Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?


Contrail Service Setup


Before starting our investigation, let’s look at the setup. In this book we built a setup including the following devices, most of our case studies are based on it:

  • one centos server running as k8s master and Contrail Controllers

  • two centos servers, each running as a k8s node and Contrail vRouter

  • one Juniper QFX switch running as the underlay leaf

  • one Juniper MX router running as a gateway router, or a spine

  • one centos server running as an Internet host machine

Figure 1 depicts the setup.

Figure 1: Contrail Service Setup
Contrail Service Setup

To minimize the resource utilization, all servers are actually centos VMs created by a VMware ESXI hypervisor running in one physical HP server. This is also the same testbed for ingress. The Appendix of this book has details about the setup.

Contrail ClusterIP Service

Chapter 3 demonstrated how to create and verify a clusterIP service. This section revisits the lab to look at some important details about Contrail’s specific implementations. Let’s continue on and add a few more tests to illustrate the Contrail service load balancer implementation details.

ClusterIP as Floating IP

Here is the YAML file used to create a clusterIP service:

And here’s a review of what we got from the service lab in Chapter 3:

You can see one service is created, with one pod running as its backend. The label in the pod matches the SELECTOR in service. The pod name also indicates this is a deploy-generated pod. Later we can scale the deploy for the ECMP case study, but for now let’s stick to one pod and examine the clusterIP implementation details.

In Contrail, a ClusterIP is essentially implemented in the form of a floating IP. Once a service is created, a floating IP will be allocated from the service subnet and associated to all the back-end pod VMIs to form the ECMP load balancing. Now all the back-end pods can be reached via cluserIP (along with the pod IP). This clusterIP (floating IP) is acting as a VIP to the client pods inside of the cluster.


Why does Contrail choose floating IP to implement clusterIP? In the previous section, you learned that Contrail does NAT for floating IP and service also needs NAT. So, it is natural to use the floating IP for lusterIP.

For load balancer type of service, Contrail will allocate a second floating IP - the EXTERNAL-IP as the VIP, and the external VIP is advertised outside of the cluster through the gateway router. You will get more details about these later.

From the UI you can see the automatically allocated floating IP as lusterIP in Figure 2.

Figure 2: Cluster IP as Floating IP
Cluster IP as Floating

And the floating IP is also associated with the pod VMI and pod IP, in this case the VMI is representing the pod interface shown in Figure 3.

Figure 3: Pod Interface
Pod Interface

The interface can be expanded to display more details as in the next screen capture, shown in Figure 4.

Figure 4: Pod Interface Detail
Pod Interface Detail

Expand the fip_list, any you’ll see the information here:

Service/clusterIP/FIP maps to podIP/fixed_ip The port_map says that port 8888 is a nat_port, 6 is the protocol number so it means protocol TCP. Overall, clusterIP:port will be translated to podIP:targetPort, and vice versa.

Now you understand with floating IP representing clusterIP, NAT will happen in service. NAT will be examined again in the flow table.

Scaling Backend Pods

In Chapter 3’s clusterIP service example, we created a service and a backend pod. To verify the ECMP, let’s increase the replica unit to 2 to generate a second backend pod. This is a more realistic model: each of the pods will now be backing each other up to avoid a single point failure.

Instead of using a YAML file to manually create a new webserver pod, with the Kubernetes spirit in mind, think of scale to a deployment, as was done earlier in this book. In our service example we’ve been using a deployment object on purpose to spawn our webserver pod:

Immediately after you create a new webserver pod by scaling the deployment with replicas 2, a new pod is launched. You end up having two backend pods, one is running in the same node cent222 as the client pod, or a local node for client pod; the other one is running in the other node cent333, the remote node from client pod’s perspective. And the endpoint objects get updated to reflect the current set of backend pods behind the service.


Without the -o wide option, only the first endpoint will be displayed properly.

Let’s check the floating IP again.

Figure 5: Cluster IP as Floating IP (ECMP)
Cluster IP as
Floating IP (ECMP)

In Figure 5 you can see the same floating IP, but now it is associated with two podIPs, each representing a separate pod.

ECMP Routing Table

First let’s examine the ECMP. Let’s take a look at the routing table in the controller’s routing instance in the screen capture seen in Figure 6.

Figure 6: Control Node Routing Instance Table
Node Routing Instance Table

The routing instance (RI) has a full name with the following format:


In most cases the RI inherits the same name from its virtual network, so in this case the full IPv4 routing table has this name:

The .inet.0 indicates the routing table type is unicast IPv4. There are many other tables that are not of interest to us right now.

Two routing entries with the same exact prefixes of the clusterIP show up in the routing table, with two different next hops, each pointing to a different node. This gives us a hint about the route propagation process: both nodes (compute) have advertised the same clusterIP toward the master (Contrail Controller), to indicate the running backend pods are present in it. This route propagation is via XMPP. The master (Contrail Controller) then reflects the routes to all the other compute nodes.

Compute Node Perspective

Next, starting from the client pod node cent222, let’s look at the pod’s VRF table to understand how the packets are forwarded towards the backend pods in the screen capture in Figure 7.

Figure 7: vRouter VRF Table
vRouter VRF Table

The most important part of the screenshot in Figure 7 is the routing entry Prefix: / 32 (1 Route), as it is our clusterIP address. Underneath the prefix there is the statement ECMP Composite sub nh count: 2. This indicates the prefix has multiple possible next hops to reach.

Now, expand the ECMP statement by clicking on the small triangle icon in the left and you will be given a lot more details about this prefix as shown in the next screen capture in Figure 8.

The most import of all the details in this output is that of our focus, nh_index: 87, which is the next hop ID (NHID) for the clusterIP prefix. From the vRouter agent Docker, you can further resolve the Composite NHID to the sub-NHs, which are the member next hops under the Composite next hop.

Figure 8: vRouter ECMP Next Hop
vRouter ECMP Next Hop

Don’t forget to execute the vRouter commands from the vRouter Docker container. Doing it directly from the host may not work:

Some important information to highlight from this output:

  • NHID 87 is an ECMP composite next hop.

  • The ECMP next hop contains two sub-next hops: next hop 43 and next hop 28, each represents a separate path towards the backend pods.

  • Next hop 51 represents a MPLSoUDP tunnel toward backend pod in the remote node, the tunnel is established from current node cent222, with source IP being local fabric IP, to the other node cent333 whose fabric IP is If you recall where our two backend pods are located, this is the forwarding path between the two nodes.

  • Next hop 37 represents a local path, towards vif 0/8 (Oif:8), which is the local backend pod’s interface.

To resolve the vRouter vif interface, use the vif --get 8 command:

The output displays the corresponding local pod interface’s name, IP, etc.

ClusterIP Service Workflow

The clusterIP service’s load balancer ECMP workflow is illustrated in Figure 9.

Figure 9: Contrail ClusterIP Service Load Balancer ECMP Forwarding
Contrail ClusterIP Service Load Balancer ECMP Forwarding

This is what happened in the forwarding plane:

  • A pod client located in node cent222 needs to access a service service-web-clusterip. It sends a packet towards the service’s clusterIP and port 8888.

  • The pod client sends the packet to node cent222 vRouter based on the default route.

  • The vRouter on node cent222 gets the packet, checks its corresponding VRF table, gets a Composite next hop ID 87, which resolves to two sub-next hops 51 and 37, representing a remote and local backend pod, respectively. This indicates ECMP.

  • The vRouter on node cent222 starts to forward the packet to one of the pods based on its ECMP algorithm. Suppose the remote backend pod is selected, the packet will be sent through the MPLSoUDP tunnel to the remote pod on node cent333, after establishing the flow in the flow table. All subsequent packets belonging to the same flow will follow this same path. The same applies to the local path towards the local backend pod.

Multiple Port Service

You should now understand how the service Layer 4 ECMP and the LB objects in the lab work. Figure 10 shows the LB and relevant objects, and you can see that one LB may be having two or more LB listeners. Each listener has an individual backend pool that has one or multiple member(s).

Figure 10: Service Load Balancer
Service Load Balancer

In Kubernetes, this 1:N mapping between load balancer and listeners indicates a multiple port service, one service with multiple ports. Let’s look at the YAML file of it: svc/service-web-clusterip-mp.yaml:

What has been added is another item in the ports list: a new service port 9999 that maps to the container’s target- Port 90. Now, with two port mappings, you have to give each port a name, say, port1 and port2, respectively.


Without a port name the multiple ports’ YAML file won’t work.

Now apply the YAML file. A new service service-web-clusterip-mp with two ports is created:


To simplify the case study, the backend deployment’s replicas number has been scaled down to one.

Everything looks okay, doesn’t it? The new service comes up with two service ports exposed, 8888, the old one we’ve tested in previous examples, and the new 9999 port, should work equally well. But it turns out that is not the case. Let’s investigate.

Service port 8888 works:

Service port 9999 doesn’t work:

The request towards port 9999 is rejected. The reason is the targetPort is not running in the pod container, so there is no way to get a response from it:

The readinessProbe introduced in Chapter 3 is the official Kubernetes tool to detect this situation, so in case the pod is not ready, it will be restarted and you will catch the events.

To resolve this let’s start a new server in the pod to listen on a new port 90. One of the easiest ways to start a HTTP server is to use the SimpleHTTPServer module that comes with the python package. In this test we only need to set its listening port to 90 (the default value is 8080):

The targetPort is on. Now you can again start the request towards service port 9999 from the cirros pod. This time it succeeds and gets the returned webpage from Python’s SimpleHTTPServer:

Next, for each incoming request, the SimpleHTTPServer logs one line of output with an IP address showing where the request came from. In this case, the request is coming from the client pod with the IP address:

Contrail Flow Table

So far, we’ve tested the clusterIP service and we’ve seen client requests are sent towards the service IP. In Contrail, vrouter is the module that does all of the packet forwarding. When the vrouter gets a packet from the client pod, it looks up the corresponding VRF table in the vRouter module for the client pod (client), gets the next hop, and resolves the correct egress interface and proper encapsulation. So far, the client and backend pods are in two different nodes, and the source vrouter decides the packets needed to be sent in the MPLSoUDP tunnel, towards the node where the backend pod is running. What interests us the most?

  • How are the service IP and backend pod IP translated to each other?

  • Is there a way to capture and see the two IPs in a flow, before and after the translations, for comparison purposes?

The most straightforward method you would think of is to capture the packets, decode, and then see the results. Doing that, however, may not be as easy as what you expect. First you need to capture the packet at different places:

  • At the pod interface, this is after the address is translated, and that’s easy.

  • At the fabric interface, this is before packet is translated and reaches the pod interface. Here the packets are with MPLSoUDP encapsulation since data plane packets are tunneled between nodes.

Then you need to copy the pcap file out and load with Wireshark to decode. You probably also need to configure Wireshark to recognize the MPLSoUDP encapsulation.

An easier way to do this is to check the vRouter flow table, which records IP and port details about a traffic flow. Let’s test it by preparing a big file, file.txt, in the backend webserver pod and try to download it from the client pod.


You may wonder: in order to trigger a flow why we don’t simply use the same curl test to pull the webpage? That’s what we did in an early test. In theory, that is fine. The only problem is that the TCP flow follows the TCP session. In our previous test with curl, the TCP session starts and stops immediately after the webpage is retrieved, then the vRouter clears the flow right away. You won’t be fast enough to capture the flow table at the right moment. Instead, downloading a big file will hold the TCP session – as long as the file transfer is ongoing the session will remain – and you can take time to investigate the flow. Later on, the Ingress section will demonstrate a different method with a one-line shell script.

So, in the client pod curl URL, instead of just giving the root path / to list the files in folder, let’s try to pull the file: file.txt:

And in the server pod we see the log indicating the file downloading starts:

Now, with the file transfer going on, there’s enough time to collect the flow table from both the client and server nodes, in the vRouter container:

Client node flow table:

Highlights in this output are:

  • The client pod starts the TCP connection from its pod IP and a random source port, towards the service IP and server port 9999.

  • The flow TCP flag SSrEEr indicates the session is established bi-directionally.

  • The Action: F means forwarding. Note that there is no special processing like NAT happening here.


When using a filter such as --match only flow entries with Internet Host IPs are displayed.

We can conclude, from the client node’s perspective, that it only sees the service IP and is not aware of any backend pod IP at all.

Let’s look at the server node flow table in the server node vRouter Docker container:

Look at the second flow entry first – the IPs look the same as the one we just saw in the client side capture. Traffic lands the vRouter fabric interface from the remote client pod node, across the MPLSoUDP tunnel. Destination IP and the port are service IP and the service port, respectively. Nothing special here.

However, the flow Action is now set to N(DPd), not F. According to the header lines in the flow command output, this means NAT, or specifically, DNAT (Destination address translation) with DPAT (Destination port translation) – both the service IP and service port are translated to the backend pod IP and port.

Now look at the first flow entry. The source IP is the backend pod IP and the source port is the Python server port 90 opened in backend container. Obviously, this is returning traffic indicating the downloading of the file is still ongoing. The Action is also NAT(N), but this time it is the reverse operation – source NAT (SNAT) and source PAT(SPAT).

The vRouter will translate the backend’s source IP source port to the service IP and port, before putting it into the MPLSoUDP tunnel and returning back to the client pod in remote node.

The complete end-to-end traffic flow is illustrated in Figure 11.

Figure 11: ClusterIP Service Traffic Flow (NAT)
Service Traffic Flow (NAT)

Contrail Load Balancer Service

Chapter 3 briefly discussed load balancer service. It mentioned that if the goal is to expose the service to the external world outside of the cluster, then just specify ServiceType as the LoadBalancer in the service YAML file.

In Contrail, whenever a service of type: LoadBalancer gets created, not only will a clusterIP be allocated and exposed to other pods within the cluster, but also a floating IP from the public floating IP pool will be assigned to the load balancer instance as an external IP and exposed to the public world outside of the cluster.

While the clusterIP is still acting as a VIP to the client inside of the cluster, the floating ip or external ip will essentially act as a VIP facing those clients sitting outside of the cluster, for example, a remote Internet host which sends requests to the service across the gateway router.

The next section demonstrates how the LoadBalancer type of service works in our end-to-end lab setup, which includes the Kubernetes cluster, fabric switch, gateway router, and Internet host.

External IP as Floating IP

Let’s look at the YAML file of a LoadBalancer service. It’s the same as the clusterIP service except just one more line declaring the service type:

Create and verify the service:

Compare the output with the clusterIP service type, this time there is an IP allocated in the EXTERNAL-IP column. If you remember what we’ve covered in the floating IP pool section, you should understand this EXTERNAL-IP is actually another FIP, allocated from the NS FIP pool or global FIP pool. We did not give any specific floating IP pool information in the service object YAML file, so based on the algorithm the right floating IP pool will be used automatically.

From the UI you can see that for the loadbalancer service we now have two floating IPs: one as a clusterIP (internal VIP) and the other one as EXTERNAL-IP (external VIP), as can be seem in Figure 12:

Figure 12: Two Floating IPs for a Load Balancer Service
Two Floating IPs for a Load Balancer Service

Both floating IPs are associated with the pod interface shown in the next screen capture, Figure 13.

Figure 13: Pod Interface
Pod Interface

Expand the tap interface and you will see two floating IPs listed in the fip_list:

Figure 14: Pod Interface Detail
Pod Interface Detail

Now you should understand the only difference here between the two types of services is that for the load balancer service, an extra FIP is allocated from the public FIP pool, which is advertised to the gateway router and acts as the outside-facing VIP. That is how the loadbalancer service exposes itself to the external world.

Gateway Router VRF Table

In the Contrail floating IP section you’ve learned how to advertise floating IP. But now let’s review the main concepts to understand how it works in Contrail service implementation.

The route-target community setting in the floating IP VN makes it reachable by the Internet host, so effectively our service is now also exposed to the Internet instead of only to pods inside of the cluster. Examining the gateway router’s VRF table reveals this:

The floating IP host route is learned by the gateway router from the Contrail controller – more specifically, Contrail control node – which acts as a standard MP-BGP VPN RR reflecting routes between compute nodes and the gateway router. A further look at the detailed version of the same route displays more information about the process:

Highlights of the output here are:

  • The source indicates from which BGP peer the route is learned, is the Contrail Controller (and Kubernetes master) in the book’s lab.

  • The protocol next hop tells who generates the route. And is node cent222 where the backend webserver pod is running.

  • The gr-2/2/0.32771 is an interface representing the (MPLS over) GRE tunnel between the gateway router and node cent333.

Load Balancer Service Workflow

To summarize, the floating IP is given to the service as its external IP is advertised to the gateway router and gets loaded into the router’s VRF table. When the Internet host sends a request to the floating IP through the MPLSoGRE tunnel, the gateway router will forward it to the compute node where the backend pod is located.

The packet flow is illustrated in Figure 15.

Figure 15: Load Balancer Service Workflow
Load Balancer
Service Workflow

Here is the full story of Figure 15:

  • Create a FIP pool from a public VN, with route-target. The VN is advertised to the remote gateway router via MP-BGP.

  • Create a pod with a label app: webserver, and Kubernetes decides the pod will be created in node cent333. The node publishes the pod IP via XMPP.

  • Create a loadbalancer type of service with service port and label selector app=webserver. Kubernetes allocates a service IP.

  • Kubernetes finds the pod with the matching label and updates the endpoint with the pod IP and port information.

  • Contrail creates a loadbalancer instance and assigns a floating IP to it. Contrail also associates that floating IP with the pod interface, so there will be a one-to-one NAT operation between the floating IP and the pod IP.

  • Via XMPP, node cent333 advertises this floating IP to Contrail Controller cent111, which then advertises it to the gateway router.

  • On receiving the floating IP prefix, the gateway router checks and sees that the RT of the prefix matches what it’s expecting, and it will import the prefix in the local VRF table. At this moment the gateway learns the next hop of the floating IP is cent333, so it generates a soft GRE tunnel toward cent333.

  • When the gateway router sees a request coming from the Internet toward the floating IP, it will send the request to the node cent333 through the MPLS over GRE tunnel.

  • The vRouter in the node sees the packets destined to the floating IP, it will perform NAT so the packets will be sent to the right backend pod.

Verify Load Balancer Service

To verify end-to-end service access from Internet host to the backend pod, let’s log in to the Internet host desktop and launch a browser, with URL pointing to


Keep in mind that the Internet host request has to be sent to the public floating IP, not to the service IP (clusterIP) or backend pod IP which are only reachable from inside the cluster!

You can see the returned web page on the browser below in Figure 16.

Figure 16: Verify End-to-End Service
Verify End-to-End

This book’s lab installed a Centos desktop as an Internet host.0

To simplify the test, you can also SSH into the Internet host and test it with the curl tool:

And the Kubernetes service is available from the Internet!

Load Balancer Service ECMP

You’ve seen how the load balancer type of service is exposed to the Internet and how the floating IP did the trick. In the clusterIP service section, you’ve also seen how the service load balancer ECMP works. But what you haven’t seen yet is how the ECMP processing works under the load balancer type of service. To demonstrate this we again scale the RC to generate one more backend pod behind the service:

Here’s the question: with two pods on different nodes, and both as backend now, from the gateway router’s perspective, when it gets the service request, which node does it choose to forward the traffic to?

Let’s check the gateway router’s VRF table again:

The same floating IP prefix is imported, as we’ve seen in the previous example, except that now the same route is learned twice and an additional MPLSoGRE tunnel is created. Previously, in the clusterIP service example, the detail option was used in the show route command to find the tunnel endpoints. This time we examine the soft GRE gr- interface to find the same:

The IP-Header of the gr- interface indicates the two end points of the GRE tunnel:

  • Here the tunnel is between node cent222 and the gateway router.

  • Here the tunnel is between node cent333 and the gateway router

We end up needing two tunnels in the gateway router, each pointing to a different node where a backend pod is running. Now we believe the router will perform ECMP load balancing between the two GRE tun- nels, whenever it gets a service request toward the same floating IP. Let’s check it out.

Verify Load Balancer Service ECMP

To verify ECMP let’s just pull the webpage a few more times and we can expect to see both pod IPs eventually displayed.

Turns out this never happens!


Lynx is another terminal web browser similar to the w3m program that we used earlier.

The only webpage is from the first backend pod, webserver-846c9ccb8b-xkjpw, running in node cent222. The other one never shows up. So the expected ECMP does not happen yet. But when you examine the routes using the detail or extensive keyword, you’ll find the root cause:

This reveals that even if the router learned the same prefix from both nodes, only one is Active and the other won’t take effect because it is NotBest. Therefore, the second route and the corresponding GRE interface gr- 2/2/0.32771 will never get loaded into the forwarding table:

This is the default Junos BGP path selection behavior, but a detailed discussion of that is beyond the scope of this book.


For the Junos BGP path selection algorithm, go to the Juniper TechLibrary: https://www.

The solution is to enable the multipath vpn-unequal-cost knob under the VRF table:

Now let’s check the VRF table again:

A Multipath with both GRE interfaces will be added under the floating IP prefix, and the forwarding table reflects the same:

Now, try to pull the webpage from the Internet host multiple times with curl or a web browser and you’ll see the random result – both backend pods get the request and responses back:

The end-to-end packet flow is illustrated in Figure 17.

Figure 17: Load Balancer Service ECMP
Load Balancer Service