Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

Understanding the IaaS: EVPN and VXLAN Solution

 

Market Overview

In addition to owning their transport infrastructure, service providers are also in the business of offering managed IT and managed data center services to a large variety of customers. Because service providers own the infrastructure, they have the ability to offer higher service-level agreements (SLAs), quality of service (QoS), and security, as these services are often provided over dedicated circuits. However, the cost structure of these services can be relatively high, especially in comparison to the nimble and fast-executing Web services companies, for whom the cost structure is very lean and low.

As service providers increasingly feel this competitive pressure, there is a need for them to innovate their business models and adopt cloud computing architectures in order to lower costs, increase efficiency, and maintain their competitiveness in Infrastructure as a Service (IaaS) offerings. While they continue to use SLAs, flexibility of deployment, and choice of topologies as a way to differentiate themselves from Web services providers, service providers also need to invest significantly in building highly automated networks. These improvements will help to cut operating expenses, and enable them to find new sources of revenue by offering new services, in order to compete more effectively.

Service providers vary widely in how they build traditional networks, and there is not one specific standard or topology that is followed. However, as they move forward and extend their networks to offer cloud services, many providers are converging around two general topologies based on some high-level requirements:

  • A large percentage of standalone bare-metal servers (BMSs), with some part of the network dedicated to offering virtualized compute services. This type of design keeps the “intelligence” in the traditional physical network.

  • Largely virtualized services, with some small amount of BMS-based services. This type of design moves the “intelligence” out of the physical network and into the virtual network, and generally requires a software-defined network (SDN) controller.

This solution guide focuses on the first use case, with a particular focus on the BMS environment. This guide will help you understand the requirements for an IaaS network, the architecture required to build the network, how to configure each layer, and how to verify its operational state.

Solution Overview

Traditionally, data centers have used Layer 2 technologies such as Spanning Tree Protocol (STP) and multichassis link aggregation groups (MC-LAG) to connect compute and storage resources. As the design of these data centers evolves to scale out multitenant networks, a new data center architecture is needed that decouples the underlay (physical) network from a tenant overlay network. Using a Layer 3 IP-based underlay coupled with a VXLAN-Ethernet VPN (EVPN) overlay, data center and cloud operators can deploy much larger networks than are otherwise possible with traditional Layer 2 Ethernet-based architectures. With overlays, endpoints (servers or virtual machines [VMs]) can be placed anywhere in the network and remain connected to the same logical Layer 2 network, enabling the virtual topology to be decoupled from the physical topology.

For the reasons of scale and operational efficiency outlined above, virtual networking is being widely deployed in data centers. Also, the role of bare-metal compute has become more relevant for high-performance, scaleout, or container-driven workloads. This solution guide describes how standards-based control and forwarding plane protocols can enable interconnectivity by leveraging control-plane learning. In particular, this guide describes how using EVPN for control plane learning can facilitate BMS interconnection within VXLAN virtual networks (VNs), and between VNs using a gateway such as a Juniper Networks QFX Series switch.

Solution Elements

Underlay Network

In data center environments, the role of the physical underlay network is to provide an IP fabric. Also known as a Clos network, its responsibility is to provide unicast IP connectivity from any physical device (server, storage device, router, or switch) to any other physical device. An ideal underlay network provides low-latency, nonblocking, high-bandwidth connectivity from any point in the network to any other point in the network.

At the underlay layer, devices maintain and share reachability information about the physical network itself. However, this layer does not contain any “per-tenant” state; that is, devices do not maintain and share reachability information about virtual or physical endpoints. This is a task for the overlay layer.

IP fabrics can vary in size and scale. A typical solution uses two layers—spine and leaf—to form what is known as a three-stage Clos network, where each leaf device is connected to each spine device, as shown in shown in Figure 1. A spine and leaf fabric is sometimes referred to as a folded, three-stage Clos network, because the first and third stages—the ingress and egress nodes—are folded back on top of each other. In this configuration, spine devices are typically Layer 3 switches that provide connectivity between leaf devices, and leaf devices are top-of-rack (TOR) switches that provide connectivity to the servers.

Figure 1: Three-Stage Clos-Based IP Fabric
Three-Stage Clos-Based IP Fabric

As the scale of the fabric increases, it can be necessary to expand to a five-stage Clos network, as shown in Figure 2. This scenario adds a fabric layer to provide inter-POD, or inter-data center, connectivity.

Figure 2: Five-Stage Clos-Based IP Fabric
Five-Stage Clos-Based IP Fabric

A key benefit of a Clos-based fabric is natural resiliency. High availability mechanisms, such as MC-LAG or Virtual Chassis, are not required as the IP fabric uses multiple links at each layer and device; resiliency and redundancy are provided by the physical network infrastructure itself.

Building an IP fabric is very straightforward and serves as a great foundation for overlay technologies such as EVPN and VXLAN.

Note

For more information about Clos-based IP fabrics, see Clos IP Fabrics with QFX5100 Switches  .

Overlay

Using an overlay architecture in the data center allows you to decouple physical network devices from the endpoints in the network. This decoupling allows the data center network to be programmatically provisioned at a per-tenant level. Overlay networking generally supports both Layer 2 and Layer 3 transport between servers or VMs. It also supports a much larger scale: a traditional network using VLANs for separation can support a maximum of about 4,000 tenants, while an overlay protocol such as VXLAN supports over 16 million.

Note

At the time of this writing, QFX5100 and QFX10000 Series switches support 4000 virtual network identifiers (VNIs) per device.

Virtual networks (VNs) are a key concept in an overlay environment. VNs are logical constructs implemented on top of the physical networks that replace VLAN-based isolation and provide multitenancy in a virtualized data center. Each VN is isolated from other VNs unless explicitly allowed by security policy. VNs can be interconnected within a data center, and between data centers.

In data center networks, tunneling protocols such as VXLAN are used to create the data plane for the overlay layer. For devices using VXLAN, each entity that performs the encapsulation and decapsulation of packets is called a VXLAN tunnel endpoint (VTEP). VTEPs typically reside within the hypervisor of virtualized hosts, but can also reside in network devices to support BMS endpoints.

Figure 3 shows a typical overlay architecture.

Figure 3: Overlay Architecture
Overlay Architecture

In the diagram, server to the left of the IP fabric has been virtualized with a hypervisor. The hypervisor contains a VTEP that handles the encapsulation of data-plane traffic between VMs, as well as MAC address learning, provisioning of new virtual networks, and other configuration changes. The physical servers above and to the right of the IP fabric do not have any VTEP capabilities of their own. In order for these servers to participate in the overlay architecture and communicate with other endpoints (physical or virtual), they need help to encapsulate the data-plane traffic and perform MAC address learning. In this case, that help comes from the attached network device, typically a top-of-rack (TOR) switch or a leaf device in the IP fabric. Supporting the VTEP role in a network device simplifies the overlay architecture; now any device with physical servers connected to it can simply perform the overlay encapsulation and control-plane function on their behalf. From the point of view of a physical server, the network functions as usual.

Note

For more information on VXLAN and VTEPs in overlay networks, see Learn About: VXLAN in Virtualized Data Center Networks  .

To support the scale of data center networks, the overlay layer typically requires a control-plane protocol to facilitate learning and sharing of endpoints. EVPN is a popular choice for this function.

EVPN is a control-plane technology that uses Multiprotocol BGP (MP-BGP) for MAC and IP address (endpoint) distribution, with MAC addresses being treated as “routes.” Route entries can contain just a MAC address, or a MAC address plus an IP address (ARP entry). As used in data center environments, EVPN enables devices acting as VTEPs to exchange reachability information with each other about their endpoints.

To support its range of capabilities, EVPN introduces several new concepts, including new route types and BGP communities. It also defines a new BGP network layer reachability information (NLRI), called the EVPN NLRI.

Tor this solution, two route types are of particular note:

  • EVPN Route Type 2: MAC/IP Advertisement route—Extends BGP to advertise MAC and IP addresses in the EVPN NLRI. Key uses of this route type include advertising host MAC and IP reachability, allowing control plane-based MAC learning for remote PE devices, minimizing flooding across a WAN, and allowing PE devices to perform proxy-ARP locally for remote hosts. Typically, the Type 2 route is used to support Layer 2 (intra-VXLAN) traffic, though it can also support Layer 3 (inter-VXLAN) traffic.

  • EVPN Route Type 5: IP Prefix route—Extends EVPN with a route type for the advertisement of IP prefixes. This route type decouples the advertisement of IP information from the advertisement of MAC addresses. The ability to advertise an entire IP prefix provides improved scaling (versus advertising MAC/IP information for every host), as well as increased efficiency in advertising and withdrawing routes. Typically, the Type 5 route is used to support Layer 3 (inter-VXLAN) traffic.

Note

For more information on EVPN in a data center context, see Improve Data Center Interconnect, L2 Services with Juniper’s EVPN  .

Moving to an overlay architecture shifts the “intelligence” of the data center. Traditionally, servers and VMs each consume a MAC address and host route entry in the physical (underlay) network. However, with an overlay architecture, only the VTEPs consume a MAC address and host route entry in the physical network. All host-to-host traffic is now encapsulated between VTEPs, and the MAC address and host route of each server or VM aren’t visible to the underlying networking equipment. The MAC address and host route scale have been moved from the underlay environment into the overlay.

Gateways

A gateway in a virtualized network environment typically refers to physical routers or switches that connect the tenant virtual networks to physical networks such as the Internet, a customer VPN, another data center, or to nonvirtualized servers. This solution uses multiple types of gateways.

A Layer 2 VXLAN gateway, also known as a VTEP gateway, maps VLANs to VXLANs and handles VXLAN encapsulation and decapsulation so that non-virtualized resources do not need to support the VXLAN protocol. This permits the VXLAN and VLAN segments to act as one forwarding domain.

In data center environments, a VTEP gateway often runs in software as a virtual switch or virtual router instance on a virtualized server. However, switches and routers can also function as VTEP gateways, encapsulating and decapsulating VXLAN packets on behalf of bare-metal servers, as shown earlier in Figure 3. This setup is referred to as a hardware VTEP gateway. In this solution, the QFX5100 (leaf) devices act as Layer 2 gateways to support intra-VXLAN traffic.

To forward traffic between VXLANs, a Layer 3 gateway is required. In this solution, the QFX10002 (spine) devices act as Layer 3 gateways to support inter-VXLAN traffic.

Note

For more information on Layer 3 gateways in a data center context, see Day One: Using Ethernet VPNs for Data Center Interconnect and Juniper Networks EVPN Implementation for Next-Generation Data Center Architectures  .

Design Considerations

There are several design considerations when implementing an IaaS network.

Fabric Connectivity

Data center fabrics can be based on Layer 2 or Layer 3 technologies. Ethernet fabrics, such as Juniper Networks Virtual Chassis Fabric, are simple to manage and provide scale and equal-cost multipath (ECMP) capabilities to a certain degree. However, as the fabric increases in size, the scale of the network eventually becomes too much for an Ethernet fabric to handle. Tenant separation is another issue; as Ethernet fabrics have no overlay network, VLANs must be used, adding another limitation to the scalability of the network.

An IaaS data center network requires Layer 3 protocols to provide the ECMP and scale capabilities for a network of this size. While IGPs provide excellent ECMP capabilities, BGP is the ideal option to provide the proper scaling and performance required by this solution. BGP was designed to handle the scale of the global Internet, and can be repurposed to support the needs of top-tier service provider data centers.

BGP Design (Underlay)

With BGP decided upon as the routing protocol for the fabric, the next decision is whether to use internal BGP (IBGP) or external BGP (EBGP). The very nature of an IP fabric requires having multiple, equal-cost paths; therefore, the key factor to consider here is how IBGP and EBGP implement ECMP functionality.

IBGP requires that all devices peer with one another. In an IaaS network, BGP route reflectors typically would be implemented in the spine layer of the network to help with scaling. However, standard BGP route reflection only reflects the best (single) prefix to clients. In order to enable full ECMP, you need to configure the BGP AddPath feature to provide additional ECMP paths into the BGP route reflection advertisements to clients.

Alternatively, EBGP supports ECMP without enabling additional features. It is easy to configure, and also facilitates traffic engineering if desired through standard EBGP techniques such as autonomous system (AS) padding.

With EBGP, each device in the IP fabric uses a different AS number. It is also a good practice to align the AS numbers within each layer. As an example, Figure 4 shows the spine layer with AS numbering in the 651xx range, and the leaf layer with AS numbering in the 652xx range.

Figure 4: AS Numbering in an IP Fabric Underlay
AS Numbering in an IP Fabric Underlay

Because EBGP supports ECMP in a more straightforward fashion, an EBGP-based IP fabric is typically used at the underlay layer.

Note

For information on Juniper Networks validated Clos-based Layer 3 IP fabric solution, see Solution Guide: Software as a Service  .

BGP Design (Overlay)

At the overlay layer, similar decisions must be made. Again the very nature of an IP fabric requires having multiple, equal-cost paths. In addition, you must consider the overlay protocol being used. This solution uses EVPN as the control-plane protocol for the overlay; given that EVPN uses MP-BGP for communication (signaling), BGP is again a logical choice to be used in the overlay.

There is more than one way to design the overlay environment. Because this solution is “controllerless,” meaning there is no SDN controller in use, the network itself must perform both the underlay and overlay functions. This solution uses an IBGP overlay design with route reflection, as shown in Figure 5. With this design, leaf devices within a given point of delivery (POD) share endpoint information upstream as EVPN routes to the spine devices, which are acting as route reflectors. The spine devices reflect the routes downstream to the other leaf devices.

Figure 5: BGP (EVPN) Overlay Design—Single POD
BGP (EVPN) Overlay Design—Single
POD

The spine devices can also advertise the EVPN routes to other PODs. As shown in Figure 6, the spine devices use an MP-IBGP full mesh to share EVPN routes and provide inter-POD communication.

Figure 6: BGP (EVPN) Overlay Design—Multiple PODs
BGP (EVPN) Overlay Design—Multiple
PODs
Note

For more information about Clos-based IP fabric design, see Clos IP Fabrics with QFX5100 Switches  .

EVPN Design

As noted above, this solution uses EVPN as the control-plane protocol for the overlay. EVPN runs between VXLAN gateways, and removes the need for VXLAN to handle the advertisement of MAC and IP reachability information in the data plane by enabling this functionality in the control plane.

A multitenant data center environment requires mechanisms to support traffic flows both within and between VNs. For this solution, intra-VXLAN traffic is handled at the leaf layer, with the QFX5100 switches acting as VXLAN Layer 2 gateways. Inter-VXLAN traffic is handled at the spine layer, with the QFX10002 switches acting as VXLAN Layer 3 gateways. Spine devices are configured with integrated routing and bridging (IRB) interfaces, which endpoints use as a default gateway for non-local traffic.

Intra-VXLAN forwarding is typically performed with the help of EVPN route Type 2 announcements, which advertise MAC addresses (along with their related IP address). Inter-VXLAN routing can also be performed using EVPN route Type 2 announcements, though it is increasingly performed with the help of EVPN route Type 5 announcements, which advertise entire IP prefixes.

Inter-VXLAN routing supports two operating modes: asymmetric and symmetric. These terms relate to the number of lookups performed by the devices at each end of a VXLAN tunnel. The following describes the two modes:

Asymmetric mode

  • The sending device maintains explicit reachability to all remote endpoints.

  • Benefit: just a single lookup is required on the receiving device (since the endpoint was already known by the sending device).

  • Drawback: large environments can cause very large lookup tables.

Symmetric mode

  • The sending device does not maintain explicit reachability to all remote endpoints; rather, it puts remote traffic into a single “routing” VXLAN tunnel and lets the receiving device perform the endpoint lookup locally.

  • Benefit: reduces lookup table size.

  • Drawback: an additional lookup is required by the receiving device (since the endpoint was not explicitly known by the sending device).

This solution uses symmetric mode for inter-VXLAN routing. This mode is generally preferred, as current Junos OS platforms can perform multiple lookups in hardware with no impact to line-rate performance.

Note

At the time of this writing, the QFX10002 and MX Series routers support asymmetric mode with EVPN route Type 2. QFX10002 also supports symmetric mode with EVPN route Type 5.

Note

For more detailed information on inter-VXLAN routing, see Configuring EVPN Type 5 for QFX10000 Series Switches  .

EVPN supports “all-active” (multipath) forwarding for endpoints, allowing them to be connected to two or more leaf devices for redundant connectivity, as shown in Figure 7.

Figure 7: EVPN Server Multihoming
EVPN Server Multihoming

In EVPN terms, the links to a multihomed server are defined as a single Ethernet segment. Each Ethernet segment is identified using a unique Ethernet segment identifier (ESI).

Note

For more detailed information about EVPN ESIs, see EVPN Multihoming Overview.

VXLAN Design

VXLAN in the overlay has the following design characteristics:

  • Each bridge domain / VXLAN network identifier (VNI) must have a VXLAN tunnel to each spine and leaf in a full mesh, that is, any-to-any connectivity.

  • VXLAN is the data plane encapsulation between servers.

  • EVPN is used as the control plane for MAC address learning.

An example of the VXLAN design for this solution is shown in Figure 8.

Figure 8: VXLAN Design
VXLAN Design

Tenant Design

This solution provides tenant separation and connectivity at the spine and leaf layers.

Tenant design in the spine devices has the following design characteristics:

  • Each tenant gets its own VRF.

  • Each tenant VRF can have multiple bridge domains.

  • Bridge domains within a VRF can switch and route freely.

  • Bridge domains between VRFs must not switch and route.

  • Each bridge domain must provide VXLAN Layer 2 gateway functionality.

  • Each bridge domain will have a routed Layer 3 interface.

  • IRB interfaces must be able to perform inter-VXLAN routing.

  • Each spine device in the POD must be configured with identical VRF, bridge domain, and IRB components.

An example of the spine tenant design for this solution is shown in Figure 9.

Figure 9: Tenant Design in Spine Devices
Tenant Design in Spine Devices

By comparison, tenant design in the leaf devices is very simple, with the following design characteristics:

  • Leaf devices are Layer 2 only (no VRF or IRB interfaces).

  • By default, all traffic is isolated per bridge domain.

  • Although a given tenant might own BD1, BD2, and BD3, there are no VRFs on the leaf device.

An example of the leaf tenant design for this solution is shown in Figure 10.

Figure 10: Tenant Design in Leaf Devices
Tenant Design in Leaf Devices

IRB Design

Inter-VXLAN gateway functionality is implemented in this solution at the spine layer, using IRB interfaces. These interfaces have the following design characteristics:

  • Every bridge domain must have an Layer 3 / routed interface that is associated with an IRB interface.

  • Each bridge domain’s IRB interface can use IPv4 addressing, IPv6 addressing, or both.

  • Each spine device must use the same IPv4 and IPv6 IRB interface addresses (this reduces the public IP addresses wasted at scale).

  • Each spine must implement EVPN anycast gateway.

An example of the leaf tenant design setup is shown in Figure 11.

Figure 11: IRB Interface Design on Spine Devices
IRB Interface Design on Spine Devices

Solution Implementation Summary

The following hardware equipment and key software features were used to create the IaaS solution described in the upcoming example:

Fabric

  • Four QFX5100-24Q switches

  • Underlay network

    • EBGP peering with the downstream (spine) devices using two-byte AS numbers

  • BFD for all BGP sessions

  • Traffic load balancing

    • EBGP multipath

    • Resilient hashing

    • Per-packet load balancing

Spine

  • Four QFX10002-72Q switches

  • Underlay network

    • EBGP peering with the upstream (fabric) devices using two-byte AS numbers

    • EBGP peering with the downstream (leaf) devices using two-byte AS numbers

  • Overlay network

    • EVPN / IBGP full mesh between all spine devices

    • EVPN / IBGP route reflection to leaf devices

      • Each spine device is a route reflector for leaf devices in its POD

      • Each POD is a separate cluster

  • BFD for all BGP sessions

  • Traffic load balancing

    • EBGP multipath

    • Resilient hashing

    • Per-packet load balancing

  • Nine VLANs (100 to 108) to illustrate intra-VLAN and inter-VLAN traffic using EVPN route Type 2

  • Two VLANs (999 on Spine 1 and Spine 2, 888 on Spine 3 and Spine 4) to illustrate inter-VLAN traffic using EVPN route Type 5

Leaf

  • Four QFX5100-48S switches

  • Underlay network

    • EBGP peering with the upstream (spine) devices using two-byte AS numbers

  • Overlay network

    • EVPN / IBGP peering with the upstream (spine) devices using two-byte AS numbers

  • BFD for all BGP sessions

  • Traffic load balancing

    • EBGP multipath

    • Resilient hashing

    • Per-packet load balancing

  • Nine VLANs (100 to 108) to illustrate intra-VLAN and inter-VLAN traffic using EVPN route Type 2

  • Two VLANs (999 on Leaf 1 and Leaf 2, 888 on Leaf 3 and Leaf 4) to illustrate inter-VLAN traffic using EVPN route Type 5

Servers / End hosts

  • Bare-metal servers attached to leaf devices

    • Traffic generator simulating BMS hosts, sending intra- and inter-VLAN traffic