Type 5 EVPN/VXLAN GPU Backend Implementation – Forwarding Plane
Each VTEP (VXLAN Tunnel Endpoint) is responsible for encapsulating and de-encapsulating traffic as it enters and exits the VXLAN fabric. In the context of a GPU multitenancy design, these VTEPs are located at the leaf nodes of the network, where tenant workloads (including GPU-accelerated compute instances) are hosted. Each VTEP maps tenant-specific traffic into the appropriate VXLAN segment, maintaining isolation and enabling east-west communication across the fabric.
The VTEPs are responsible for encapsulating and de-encapsulating traffic as it enters and exits the VXLAN fabric.
In the context of a GPU multitenancy design, these VTEPs are located at the leaf nodes of the network.
RoCEv2 traffic encapsulation
RoCEv2 (RDMA over Converged Ethernet version 2) traffic can be transported across an Ethernet-based IP network using VXLAN encapsulation, which allows RDMA workloads to operate across Layer 3 boundaries while preserving performance and scalability. In this model, the original RDMA payload is encapsulated inside a VXLAN packet, which is further wrapped in standard UDP and IP headers, enabling transport across IP-based fabrics.
The encapsulation begins with the original RoCEv2 payload, which consists of InfiniBand headers and data. This is encapsulated in a VXLAN header, where the VXLAN Network Identifier (VNI) uniquely identifies the Layer 2 segment associated with the RDMA flow. The VXLAN header is prepended by a UDP header (with a destination port typically set to 4789), allowing the packet to traverse standard IP networks without requiring special handling.
The outer IP header carries the source and destination IP addresses of the VTEPs (VXLAN Tunnel Endpoints), and the outer MAC header ensures correct delivery across the Ethernet fabric. Importantly, the outer and inner IP headers are independent; each can be either IPv4 or IPv6, and they do not need to match. For example, it is entirely valid to encapsulate an IPv6-based RoCEv2 flow within an IPv4 VXLAN tunnel, or vice versa, depending on the underlay and overlay configurations. All testing related to this JVD was completed using RoCEv2 over IPv6.
Figure 43: RDMA Encapsulation over IPv4/IPv6
As described in the example in the previous section, the server-facing interfaces on the leaf nodes are configured as Layer 3 routed interfaces and are mapped into a tenant-specific IP-VRF. Tenant A has been assigned the first GPU on servers 1 through 4 (namely, GPU1 to GPU4). The interfaces connecting these GPUs are associated with an IP-VRF named Tenant A.
GPU1 and GPU2 (on servers 1 and 2) are connected to the same leaf node (Stripe 1, Leaf 1) and are mapped to the Tenant A VRF. Likewise, GPU3 and GPU4 (on servers 3 and 4) are connected to a different leaf node (Stripe 2, Leaf 1) and are also mapped to the same VRF. Communication between GPUs connected to the same leaf node occurs locally, while traffic between GPUs on different leaf nodes is routed across the fabric using the outer IP header added during VXLAN encapsulation, as described earlier.
RoCEv2 Traffic Flows Across the Fabric
Consider the example in Figure 44 which shows RoCEv2 traffic flows between 4 GPU,s on 4 different servers, that are assigned to Tenant-1
Figure 44: RoCEv2 Traffic Flow Across the Fabric
Traffic between Server H100-01 GPU0 (Tenant’s GPU1) and Server H100-02 GPU0 (Tenant’s GPU2) is Switched Locally at the Leaf Node
-
Traffic Origination:
GPU 0 on Server 1 initiates a RoCEv2 RDMA WRITE targeting GPU 0 on Server 2.
RoCEv2 packets are encapsulated in UDP over IP as any other IP traffic.
The source and destination IP addresses are the autoconfigured IPv6 addresses associated with each GPU (FC00:1:1:1:a288:c2ff:fe3b:55d6 and FC00:1:1:2:5aa2:e1ff:fe46:c6ca), while the source and destination MAC addresses correspond to the MAC address of the NICs associated with GPU0 Server 1 and the MAC address of interface et-0/0/12.0 on Stripe1-leaf1.
-
Leaf Forwarding/Delivery to Tenant’s GPU 2:
The leaf node simply strips off the L2 Header, performs a route lookup in the tenants specific routing table (TENANT-1 _VRF), and re-encapsulates the packet with a new L2 header with source and destination MAC addresses corresponding to the Leaf’s MAC addresses on interfaces et-0/0/13.0, and the MAC address of the NIC associated with GPU 0 on Server 2.
Note:Traffic between Server 1 and Server 4 (same tenant, same leaf) is handled in the same way.
Traffic Between Server H100-01 GPU0 (Tenant’s GPU1) and Server H100-03 GPU0 (Tenant’s GPU3) has to be Encapsulated in VXLAN and Forwarded Across the Fabric
-
Traffic Origination:
GPU 0 on Server 1 initiates a RoCEv2 RDMA WRITE targeting GPU 0 on Server 2.
RoCEv2 packets are encapsulated in UDP over IP as any other IP traffic.
The source and destination IP addresses are the autoconfigured IPv6 addresses associated with each GPU (FC00:1:1:1:a288:c2ff:fe3b:55d6 and FC00:1:1:3:966d:aeff:fef5:9c5c), while the source and destination MAC addresses correspond to the MAC address of the NICs associated with GPU0 Server 1 and the MAC address of interface et-0/0/12.0 on Stripe1-leaf1.
-
Source Leaf Forwarding:
Stripe1-Leaf 1 strips off the L2 Header, performs a route lookup in the tenants specific routing table (TENANT-1 _VRF). The route to the destination in this case, which was installed in this VRF via EVPN Type 5 route advertisement, points to loopback interface of Stripe2 leaf 1, and indicates the traffic needs to be encapsulated using VXLAN.
The leaf re-encapsulates the packet in VXLAN, using a tenant-specific VNI, and the remote VTEP MAC address that was received with the EVPN type 5 route. An additional IP and UDP header (outer header) and a new L2 header are added to the packet. with the MAC addresses of Stripe 1 Leaf 1 and Spine1 as source and destination addresses.
The source and destination IP addresses will be the loopback interface addresses of Stripe 1 Leaf 1 and Stripe 2 Leaf 1. The source and destination MAC addresses will be the MAC addresses of Stripe 1 Leaf 1 and Spine 1.
Note:Spine 1 is used here as an example. Traffic will be load-balanced across all leaf–spine links in the fabric as will be reviewed later.
-
Spine (Intermediate) Forwarding:
Spine 1 is not aware of VXLAN encapsulation and simply route the packets based on the outer IP header, in this case towards stripe 2 leaf 1. The outer header source and destination IP addresses are not modified.
-
Destination Leaf Forwarding/Delivery to Tenant’s GPU 3:
Stripe2-Leaf 1 receives the VXLAN packet and decapsulates the packet. It extracts the VNI number from the VXLAN header to determine the proper routing table for the arriving packet. Since VNI =1 is mapped to TENANT-1 _VRF, the leaf performs a router lookup in the corresponding table, which indicates that the destination is directly connected on interface et-0/0/12.0
The leaf node applies a new L2 header with source and destination MAC addresses corresponding to the Leaf and the NIC connected to the tenants GPU, and forwards the packet to the destination GPU.
Note:The forwarding process works the same way when servers NICs are configured with IPv4 addresses.