Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

Understanding the Infrastructure as a Service Solution


Market Overview

In addition to owning their transport infrastructure, service providers are also in the business of offering managed IT and managed data center services to a large variety of customers. Since service providers own the infrastructure, Service Providers have the ability to offer higher service-level agreements (SLAs), quality of service (QoS) and security, as these services are often provided over dedicated circuits. However, the cost structure of these services can be relatively high, especially in comparison to the nimble and fast-executing Web services companies, for whom the cost structure is very lean and low.

As service providers increasingly feel this competitive pressure, there is a need for them to innovate their business models and adopt cloud computing architectures in order to lower costs, increase efficiency, and maintain their competitiveness in IaaS offerings. While they continue to use SLAs, flexibility of deployment, and choice of topologies as a way to differentiate themselves from Web services providers, service providers also need to invest significantly in building highly automated networks. These improvements will help to cut operating expenses, and enable them to find new sources of revenue by offering new services, in order to compete more effectively.

Service providers vary widely in how they build traditional networks, and there is not one specific standard or topology that is followed. However, as they move forward and extend their networks to offer cloud services, many providers are converging around two general topologies based on some high-level requirements:

  • A large percentage of standalone bare-metal servers (BMSs), with some part of the network dedicated to offering virtualized compute services. This type of design keeps the “intelligence” in the traditional physical network.

  • Largely virtualized services, with some small amount of BMS-based services. This type of design moves the “intelligence” out of the physical network and into the virtual network, and generally requires a software-defined network (SDN) controller.

This solution guide focuses on the second use case. This guide will help you understand the requirements for an Infrastructure as a Service (IaaS) network, the architecture required to build the network, how to configure each layer, and how to verify its operational state.

Solution Overview

Traditionally, data centers have used Layer 2 technologies such as Spanning Tree Protocol (STP) and multichassis link aggregation groups (MC-LAG) to connect compute and storage resources. As the design of these data centers evolves to scale out multitenant networks, a new data center architecture is needed that decouples the underlay (physical) network from a tenant overlay network. Using a Layer 3 IP-based underlay coupled with a VXLAN-Ethernet VPN (EVPN) overlay, data center and cloud operators can deploy much larger networks than are otherwise possible with traditional Layer 2 Ethernet-based architectures. With overlays, endpoints (servers or virtual machines [VMs]) can be placed anywhere in the network and remain connected to the same logical Layer 2 network, enabling the virtual topology to be decoupled from the physical topology.

For the reasons of scale and operational efficiency outlined above, virtual networking is being widely deployed in data centers. However, applications running in virtual networks still have to be able to communicate with systems on bare-metal servers. This solution guide describes how standards-based control and forwarding plane protocols can enable interconnectivity between virtual and physical domains by leveraging control-plane learning. In particular, this guide describes how control plane learning using Open vSwitch Database (OVSDB) configured in a top-of-rack (TOR) switch can facilitate direct interconnection between VMs in a virtual network domain, managed by Contrail, and physical servers connected through a switch. Additionally, this guide shows how VMs and physical servers in different virtual network domains can be interconnected using a gateway such as a Juniper Networks MX Series 3D Universal Edge Router.

Solution Elements

Underlay Network

In data center environments, the role of the physical underlay network is to provide an IP fabric. Also known as a Clos network, its responsibility is to provide unicast IP connectivity from any physical device (server, storage device, router, or switch) to any other physical device. An ideal underlay network provides low-latency, nonblocking, high-bandwidth connectivity from any point in the network to any other point in the network.

Underlay devices do not contain any “per-tenant” state; that is, they do not contain any MAC addresses, IP addresses, or policies for virtual machines or endpoints. The forwarding tables of underlay devices contain only the IP prefixes or MAC addresses of the physical servers. Gateway routers or switches that connect a virtual network to a physical network are an exception–they do need to contain tenant MAC or IP addresses.

IP fabrics can vary in size and scale. This solution uses two layers—spine and leaf—to form what is known as a three-stage Clos network, where each leaf device is connected to each spine device, as shown in shown in Figure 1. A spine and leaf fabric is sometimes referred to as a folded, three-stage Clos network, because the first and third stages—the ingress and egress nodes—are folded back on top of each other. In this configuration, spine devices are typically Layer 3 switches that provide connectivity between leaf devices, and leaf devices are top-of-rack (TOR) switches that provide connectivity to the servers.

Figure 1: Clos-Based IP Fabric
Clos-Based IP Fabric

A key benefit of a Clos-based fabric is natural resiliency. High availability mechanisms, such as MC-LAGs or Virtual Chassis, are not required as the IP fabric uses multiple links at each layer and device; resiliency and redundancy is are provided by the physical network infrastructure itself.

Building an IP fabric is very straightforward and serves as a great foundation for overlay technologies such as Juniper Networks Contrail.


For more information about Clos-based IP fabrics, see Clos IP Fabrics with QFX5100 Switches  .


Using an overlay architecture in the data center allows you to decouple physical hardware from the overall network, which is one of the key tenets of virtualization. Decoupling the network from the physical hardware allows the data center network to be programmatically provisioned within seconds. Overlay networking generally supports both Layer 2 and Layer 3 transport between VMs and servers. It also supports a much larger scale: a traditional network using VLANs for separation can support about 4,000 tenants, while an overlay protocol such as VXLAN supports over 16 million.

Virtual networks (VNs) are a key concept in an overlay environment. VNs are logical constructs implemented on top of the physical networks that replace VLAN-based isolation and provide multitenancy in a virtualized data center. Each VN is isolated from other VNs unless explicitly allowed by security policy. VNs can be interconnected within a data center, and between data centers.

In data center networks, tunneling protocols such as VXLAN are used to create overlays. For devices using VXLAN, the entity that performs the encapsulation and decapsulation of packets is called a VXLAN tunnel endpoint (VTEP). VTEPs typically reside within the hypervisor of virtualized hosts. Each VTEP has two interfaces: one is a switching interface that faces the VMs in the host and provides communication between VMs, while the other is an IP interface that faces the Layer 3 network.

Figure 2 shows a typical overlay architecture.

Figure 2: Overlay Architecture
Overlay Architecture

In the diagram, the servers to the left and right of the IP fabric have been virtualized with a hypervisor. Each hypervisor contains a VTEP that handles the encapsulation of data-plane traffic between VMs. Each VTEP also handles MAC address learning, provisioning of new virtual networks, and other configuration changes. The server above the IP fabric is a standard physical server, but doesn’t have any VTEP capabilities of its own. In order for the physical server to participate in the overlay architecture, it needs help to encapsulate the data-plane traffic and perform MAC address learning. In this case, that help comes from the TOR switch. Supporting the VTEP role in a physical device (such as a TOR switch) simplifies the overlay architecture; now any TOR switch with physical servers connected to it can simply perform the overlay encapsulation and control-plane function on their behalf. From the point of view of a physical server, the network functions as usual.

Moving to an overlay architecture places a different “network tax” on the data center. Traditionally, servers and VMs each consume a MAC address and host route entry in the network. However, in an overlay architecture only the VTEPs consume a MAC address and host route entry in the network. All host-to-host traffic is now encapsulated between VTEPs, and the MAC address and host route of each server and VM aren’t visible to the underlying networking equipment. The MAC address and host route scale have been moved from the physical network hardware into the hypervisor.


For more information on overlay networks, see Learn About: VXLAN in Virtualized Data Center Networks  .


A gateway in a virtualized network environment typically refers to physical routers or switches that connect the tenant virtual networks to physical networks such as the Internet, a customer VPN, another data center, or to non-virtualized servers. This solution uses multiple types of gateways.

A Layer 2 gateway, also known as a VTEP gateway, maps VLANs to VXLANs and handles VXLAN encapsulation and decapsulation so that the nonvirtualized resources do not need to support the VXLAN protocol. This permits the VXLAN and VLAN segments to act as one forwarding domain.

Typically, a VTEP gateway runs in software as a virtual switch or virtual router instance on a virtualized server. Some switches and routers can also function as VTEP gateways, encapsulating and decapsulating VXLAN packets on behalf of bare-metal servers, as shown in Figure 2. This setup is referred to as a hardware VTEP gateway. In this solution, the QFX5100 (leaf) devices act as Layer 2 gateways to support intra-VN traffic.

To forward traffic between VXLANs, a Layer 3 gateway is required. In this solution, the MX Series (spine) devices act as Layer 3 gateways to support inter-VN traffic.


For more information on Layer 3 gateways in a data center context, see Day One: Using Ethernet VPNs for Data Center Interconnect and Juniper Networks EVPN Implementation for Next-Generation Data Center Architectures   .


Juniper Networks Contrail is a simple, open, and agile cloud network automation product that leverages SDN technology to orchestrate the creation of highly scalable virtual networks. It provides self-service provisioning, improves network troubleshooting and diagnostics, and enables service chaining for dynamic application environments. Service Providers can use Contrail to enable a range of innovative new services, including cloud-based offerings and virtualized managed services.

Contrail provides the ability to abstract virtual networks at a higher layer to eliminate device-level configuration and easily control and manage policies for tenant virtual networks. A browser-based user interface enables users to define virtual network and network service policies, then configure and interconnect networks simply by attaching policies.

Contrail can be used with open cloud orchestration systems such as OpenStack. It can also interact with other operations support systems (OSS) and business support systems (BSS) using northbound APIs.

Figure 3: Contrail Architecture
Contrail Architecture

Figure 3 shows the Contrail architecture.

An orchestrator, such as OpenStack, provides overall configuration management for virtualized elements in the data center, including compute (VMs), storage, and network resources.

Contrail takes the role of software-defined networking (SDN) controller to implement the network and networking services. The orchestrator sends instructions to the SDN controller on how to orchestrate the network at a very high level of abstraction. The SDN controller is responsible for translating these requests from a high level of abstraction into real actions on the physical and virtual network devices.

A Contrail system consists of two main components: the Contrail Controller and the vRouter. The Contrail Controller is a logically centralized but physically distributed SDN controller that is responsible for providing the management, control, and analytics functions of the virtualized network. The vRouter is a forwarding plane (of a distributed router) that runs in the hypervisor of virtualized compute nodes. The Contrail Controller provides the logically centralized control plane and management plane of the overall system and orchestrates the vRouters.

The Contrail Controller uses the Extensible Messaging and Presence Protocol (XMPP) to control the vRouters, and a combination of BGP and NETCONF protocols to communicate with physical devices (except TORs).


For more information about Juniper Networks Contrail, see Juniper Networks Contrail .

TOR Switches for BMS Integration

It is rare to find a data center that has virtualized all of its compute resources. Typically there are a few applications that cannot be virtualized due to performance, compliance, or other reasons, and must be kept on bare-metal servers (BMSs). This raises the question of how to enable these physical servers to interwork with virtualized elements in the data center’s overlay environment.

Overlay architectures support several mechanisms to provide connectivity to physical servers. As noted earlier, the most common option is to embed a VTEP into the physical TOR switch located at the top of a rack of physical servers. A BMS typically does not run the protocols needed to learn about other endpoints in the overlay network, nor does it interact with a controller. It also cannot encapsulate its own traffic. Therefore, the TOR switch performs these functions on its behalf.

Since this solution is controller-based, communication must be enabled between the Contrail Controller and TOR switches. Contrail Release 2.1 (and later) supports extending a cluster to include BMSs and other virtual instances connected to a TOR switch. The Open vSwitch Database Management (OVSDB) protocol is used to configure the TOR switch and to import dynamically-learned addresses, while VXLAN encapsulation is used for data-plane communication. The BMSs and other virtual instances attached to the TOR switch can belong to any of the virtual networks configured in the Contrail cluster.

To support this functionality, a new node, the TOR services node (TSN), is added to the Contrail system. The TSN effectively translates the route exchanges between vRouters on virtual instances (which communicate using XMPP) and physical network devices (which communicate using OVSDB). The TSN also handles broadcast packets from the TOR switch, and replicates them to the required compute nodes in the cluster.

This interaction is shown in Figure 4.

Figure 4: Contrail Interaction with TOR Switches
Contrail Interaction with TOR Switches

A TSN contains four components:

  • OVSDB Client—Maintains a session with the OVSDB server on a TOR switch. Route updates and configuration changes are sent over the OVSDB session.

  • TOR Agent—Maintains an XMPP session with the Contrail Controller and mediates between XMPP and OVSDB messages.

  • vRouter Forwarder—The forwarding component of a vRouter, which traps and responds to broadcast packets sent from TOR switch VTEPs.

  • TOR Control Agent—A vRouter that provides proxy services (DHCP, DNS, and ARP) for broadcast traffic arriving from servers attached to TOR switches.

As noted above, the TSN typically includes a TOR agent, which acts as an OVSDB client to the TOR switch and facilitates all interactions between the switch and Contrail system.

At the control-plane layer, the TOR agent receives route entries (MAC addresses of VMs) from the Contrail control node for the virtual networks in which the TOR switch’s attached BMSs are members, and adds the entries to its OVSDB.

The process is similar in reverse. MAC addresses learned locally by the TOR switch are propagated to the TOR agent using OVSDB. The TOR agent then exports the addresses to the Contrail control node using XMPP, which are further distributed to compute nodes and other nodes in the cluster.

To handle broadcast traffic, the TSN receives the replication tree for each virtual network from the control node. It then adds the relevant addresses from the TOR switch, forming a complete replication tree. The TSN sends the completed tree back to the control node, which forwards it along to the other compute nodes.

At the data plane layer, VXLAN is the used as the encapsulation protocol. The virtual tunnel endpoint (VTEP) for BMSs is on the TOR switch.

Unicast traffic from BMSs to a known destination MAC address is VXLAN-encapsulated by the TOR switch and forwarded to the destination VTEP, where the VXLAN tunnel is terminated and the packet is forwarded to the virtual instance (endpoint). Likewise, unicast traffic from virtual instances in the Contrail cluster is forwarded to the TOR switch, where the VXLAN tunnel is terminated, and the packet is forwarded to the relevant BMS.

Broadcast traffic from BMSs flows through the TOR switch to the TSN, which uses the replication tree to flood the broadcast packets in the appropriate virtual network. Likewise, broadcast traffic from the virtual instances in the Contrail cluster is sent to the TSN, which replicates the packets to the TOR switch and BMSs.


For more information on using OVSDB-enabled TOR switches to integrate BMSs into a Contrail virtualized environment, see Using TOR Switches and OVSDB to Extend the Contrail Cluster.

For more detailed information on control plane and data plane interactions in intra- and inter-VN scenarios, see Using Contrail with OVSDB in Top-of-Rack Switches  .

Design Considerations

There are several design considerations when implementing an IaaS network.

Fabric connectivity—Data center fabrics can be based on Layer 2 or Layer 3 technologies. Ethernet fabrics, such as Juniper Networks Virtual Chassis Fabric, are simple to manage and provide scale and equal-cost multipath (ECMP) capabilities to a certain degree. However, as the fabric increases in size, the scale of the network eventually becomes simply too much for an Ethernet fabric to handle. Tenant separation is another issue; as Ethernet fabrics have no overlay network, VLANs must be used, adding another limitation to the scalability of the network.

An IaaS data center network requires Layer 3 protocols to provide the ECMP and scale capabilities for a network of this size. While IGPs provide excellent ECMP capabilities, BGP is the ideal option to provide the proper scaling and performance required by this solution. BGP was designed to handle the scale of the global Internet, and can be repurposed to support the needs of top-tier service provider data centers.

BGP design—With BGP decided upon as the routing protocol for the fabric, the next decision is whether to use internal BGP (IBGP) or external BGP (EBGP). The very nature of an IP fabric requires having multiple, equal-cost paths; therefore, the key factor to consider here is how IBGP and EBGP implement ECMP functionality.

IBGP requires that all devices peer with one another. In an IaaS network, BGP route reflectors typically would be implemented in the spine layer of the network to help with scaling. However, standard BGP route reflection only reflects the best (single) prefix to clients. In order to enable full ECMP, you need to configure the BGP AddPath feature to provide additional ECMP paths into the BGP route reflection advertisements to clients.

Alternatively, EBGP supports ECMP without enabling additional features. It is easy to configure, and also facilitates traffic engineering if desired through standard EBGP techniques such as autonomous system (AS) padding.

With EBGP, each device in the IP fabric uses a different AS number. It is also a good practice to align the AS numbers within each layer. As an example, Figure 5 shows the spine layer with AS numbering in the 651xx range, and the leaf layer with AS numbering in the 652xx range.

Figure 5: AS Numbering in an IP Fabric
AS Numbering in an IP Fabric

Because EBGP supports ECMP in a more straightforward fashion, an EBGP-based IP fabric is typically used at the underlay layer.


For information on Juniper Networks validated Clos-based Layer 3 IP fabric solution, see Solution Guide: Software as a Service  .

Note though that the overlay implementation can affect EBGP usage in the underlay. In this solution, the spine devices act as Layer 3 gateways for the Contrail system. Communication between these elements uses IBGP. so the spine devices must use the same AS number. An example of this setup is shown in Figure 6.

Figure 6: AS Numbering in an IP Fabric with IBGP at the Spine Layer
AS Numbering in an IP Fabric with IBGP
at the Spine Layer

Note that this hybrid model does not affect the overall BGP design. EBGP remains in use between the spine and leaf layers, thus continuing to provide the ECMP benefits noted above.


For more information about Clos-based IP fabric design, see Clos IP Fabrics with QFX5100 Switches  .

For more information about the BGP AddPath feature, see Understanding the Advertisement of Multiple Paths to a Single Destination in BGP.

Contrail High Availability

As mentioned earlier, the Contrail Controller is logically centralized but physically distributed. Physically distributed means that the controller consists of multiple types of nodes, each of which can have multiple instances to support high availability and scaling. The node instances can be physical servers or virtual machines.

When multiple, redundant instances of a node that comprises the Contrail Controller are configured, they operate in active/active mode, providing fault tolerance for node failures. This approach applies to all node types: the configuration node, the control node, the analytics node, and the TOR services node.

As part of a Contrail high availability (HA) implementation, each vRouter establishes connections with at least two control nodes. The vRouter receives all state information (routes, routing instance, configuration, and so on) from both control nodes, and makes a local decision about which copy to use. If a control node fails, the vRouter flushes all state information from the failed control node. Since it already has a redundant copy of the state information from the other control node, the vRouter can immediately switch over without any need for resynchronization. The vRouter then creates a connection to a new control node, to replace the failed control node and reestablish dual-node redundancy.

Contrail Release 2.20 (and later) supports TOR agent redundancy through the use of HAProxy, an open source solution offering high availability, load balancing, and proxying for TCP- and HTTP-based applications. With TOR agent redundancy, two TSNs are assigned to manage the same TOR switch. Both TSNs receive the same configuration from the control node, and sync their routing information. The TOR switch establishes a connection through HAProxy to one of the two TOR agents, based on current load. If a failure occurs, HAProxy reconnects to the TOR agent on the standby TSN.

To extend HA even further, you can enable redundant HAProxy instances in an active/standby configuration. Using the Virtual Router Redundancy Protocol (VRRP) and keepalives to maintain awareness of each other and detect failures, both HAProxy instances monitor the TOR agents. The TOR switch uses the VRRP virtual IP address to connect to the active HAProxy instance on the master VRRP node, which proxies the connection to a TOR agent. If the primary HAProxy instance fails, the VRRP backup node becomes master and the standby HAPRoxy instance becomes active. The TOR switch reconnects to the new active HAProxy instance.


For more information about high availability for Contrail, see Contrail High Availability Support, Contrail Scale-Out Architecture and High Availability, and Configuring High Availability for the Contrail OVSDB TOR Agent.


Since the architecture described in this guide is based on well-known and well-supported industry standards, equipment from a number of vendors can serve as the gateway router or the TOR switch. In the configuration example that follows, MX Series routers serve as gateway devices, and QFX5100 switches are used as the TOR switches. Each of these devices supports the required protocols and is well-suited to its respective role.

The following hardware equipment and key software features were used to create the IaaS solution described in the upcoming example:


  • Two MX960 routers

  • Function as Layer 3 Gateway devices

  • Connected to five downstream (leaf) devices: two QFX5100 VCs and three standalone QFX-5100s

  • EBGP peering with leaf devices, with multipath for load balancing

  • IBGP peering with the Contrail control nodes

  • OSPF for IBGP reachability (spine to spine, spine to control nodes)

  • Per-packet load balancing

  • BFD for all BGP sessions


  • Five devices: two QFX5100 VCs and three standalone QFX5100s

  • Connected to two upstream (spine) MX960s

  • EBGP peering with spine devices, with multipath for load balancing

  • OSPF for inter-spine reachability

  • Per-packet load balancing

  • BFD for all BGP sessions


Virtual Chassis Fabric (VCF) are supported in place of VCs at the leaf layer.


  • One QFX5100 hosting the Contrail cluster

  • Three Contrail control nodes

    • Three Super Micro (physical) servers, each hosting a control node

  • Three Contrail TOR services nodes (TSNs)

    • One IBM Flex Blade server with ESXi OS, hosting two TSNs

    • One Super Micro (physical) server hosting one TSN

  • Contrail high availability

Servers / Compute

  • Two compute nodes (including vRouter, hypervisor, and VMs), attached to QFX5100 VC leaf devices

    • Two Super Micro (physical) servers, each hosting a compute node

  • Two bare-metal servers (BMSs), attached to QFX5100 VC leaf devices

    • Traffic generator simulating BMS traffic