Type 5 EVPN-VXLAN Implementation – Forwarding Plane
VTEPs—VXLAN Tunnel Endpoints—are responsible for encapsulating and de-encapsulating traffic as it enters and exits the VXLAN fabric. In the context of a GPU multitenancy design, these VTEPs are located at the leaf nodes of the network, where tenant workloads (including GPU-accelerated compute instances) are hosted. Each VTEP maps tenant-specific traffic into the appropriate VXLAN segment, maintaining isolation and enabling east-west communication across the fabric.
The VTEPs are responsible for encapsulating and de-encapsulating traffic as it enters and exits the VXLAN fabric.
In the context of a GPU multitenancy design, these VTEPs are located at the leaf nodes of the network
RoCEv2 Traffic Encapsulation
RoCEv2 (RDMA over Converged Ethernet version 2) traffic can be transported across an Ethernet-based IP network using VXLAN encapsulation, which allows RDMA workloads to operate across Layer 3 boundaries while preserving performance and scalability. In this model, the original RDMA payload is encapsulated inside a VXLAN packet, which is further wrapped in standard UDP and IP headers, enabling transport across IP-based fabrics.
The encapsulation begins with the original RoCEv2 payload, which consists of InfiniBand headers and data. This is encapsulated in a VXLAN header, where the VXLAN Network Identifier (VNI) uniquely identifies the Layer 2 segment associated with the RDMA flow. The VXLAN header is prepended by a UDP header (with a destination port typically set to 4789), allowing the packet to traverse standard IP networks without requiring special handling.
The outer IP header carries the source and destination IP addresses of the VTEPs (VXLAN Tunnel Endpoints), and the outer MAC header ensures correct delivery across the Ethernet fabric. Importantly, the outer and inner IP headers are independent; each can be either IPv4 or IPv6, and they do not need to match. For example, it is entirely valid to encapsulate an IPv6-based RoCEv2 flow within an IPv4 VXLAN tunnel—or vice versa—depending on the underlay and overlay configurations.
Figure 43: RDMA encapsulation over IPv4/IPv6.
As described in the example in the previous section, the server-facing interfaces on the leaf nodes are configured as Layer 3 routed interfaces and are mapped into a tenant-specific IP-VRF. Tenant A has been assigned the first GPU on servers 1 through 4 (namely, GPU1 to GPU4). The interfaces connecting these GPUs are associated with an IP-VRF named Tenant A.
GPU1 and GPU2 (on servers 1 and 2) are connected to the same leaf node (Stripe 1, Leaf 1) and are mapped to the Tenant A VRF. Likewise, GPU3 and GPU4 (on servers 3 and 4) are connected to a different leaf node (Stripe 2, Leaf 1) and are also mapped to the same VRF. These server-to-leaf links may be configured using either IPv4 or IPv6 addressing, depending on the tenant or infrastructure requirements. Communication between GPUs connected to the same leaf node occurs locally, while traffic between GPUs on different leaf nodes is routed across the fabric using the outer IP header added during VXLAN encapsulation, as described earlier.
Table 12. Solution options summary
Option | GPU server to leaf node links | Leaf to spine node links | Leaf and spine nodes loopback interface addresses | Underlay BGP sessions | Overlay BGP sessions |
---|---|---|---|---|---|
IPv4 underlay and IPv4 overlay | Statically configured IPv4 address | Statically configured IPv4 addresses | Statically configured IPv4 addresses | Statically configured IPv4 neighbors | Statically configured IPv4 neighbors |
IPv6 underlay and IPv6 overlay |
Statically configured IPv4 address OR statically configured IPv6 address * |
Statically configured IPv6 addresses | Statically configured IPv6 addresses | Statically configured IPv6 neighbors | Statically configured IPv6 neighbors |
IPv6 Link-Local (RFC 5549) underlay and IPv4 overlay |
Statically configured IPv4 address OR statically configured IPv6 address * |
Automatically assigned IPv6 link local addresses | Statically configured IPv4 neighbors | Automatically discovered IPv6 neighbors | Statically configured IPv4 neighbors |
RECOMMENDED IPv6 Link-Local (RFC 5549) underlay and IPv6 overlay |
Statically configured IPv4 address OR Statically configured IPv6 address * | Automatically assigned IPv6 link local addresses | statically configured IPv6 address | Automatically discovered IPv6 neighbors | Statically configured IPv6 neighbors |
* Can be dynamically assigned using SLAAC (Stateless Address Autoconfiguration)
RoCEv2 Traffic Flows Across the Fabric Example
Consider the example in Figure 44 which shows an example of how RoCEv2 traffic flows in an IPv6 overlay/underlay fabric, with IPv6 configured between the GPU server NICs and the leaf nodes.
Figure 44: RoCEv2 traffic flow across the fabric
Here is a detailed description of the traffic flow between GPUs.
GPU1 Server 1 (Tenant’s GPU1) – FC00:1:1::0 <=> or -- GPU 1 Server 2 (Tenant’s GPU2) – FC00:1:1::16
-
Traffic Origination:
GPU 1 on Server 1 initiates a RoCEv2 RDMA write targeting GPU 1 on Server 2 (TENANT A’s GPU2).
RoCEv2 packets are encapsulated in UDP over IP, meaning they are treated as standard IP traffic.
The source and destination IP addresses are the /31 addresses assigned to the NICs associated with each GPU (FC00:1:1::0 and FC00:1:1::32), while the source and destination MAC addresses correspond to the MAC address of the NICs associated with GPU 1-Server 1 and the MAC addresses of interface et-0/0/12.0 on Stripe1-leaf1.
-
Leaf Forwarding:
The leaf node simply strips off the L2 Header, performs a route lookup in the tenants specific routing table (TENANT-A_VRF), and re-encapsulates the packet with a new L2 header with source and destination MAC addresses corresponding to the Leaf’s MAC addresses on interfaces et-0/0/16.0, and the MAC address of the NICs associated with GPU 1 on Server 2
The leaf node then forwards the packet out of the interface et-0/0/16.0
Note:Traffic between Server 1 and Server 4 (same tenant, same leaf) is handled in the same way.
GPU1 Server 1 (Tenant’s GPU1) – FC00:1:1::0 <=> or -- GPU 1 Server 3 (Tenant’s GPU3) – FC00:1:1::32
-
Traffic Origination:
GPU 1 on Server 1 initiates a RoCEv2 RDMA operation targeting GPU 1 on Server 3 (TENANT A’s GPU3).
RoCEv2 packets are again encapsulated in UDP over IP, meaning this is IP traffic
The source IP and destination addresses correspond are the /31 addresses on the NICs associated with the tenants GPUs (FC00:1:1::0 and FC00:1:1::32 in this case), while the source and destination MAC addresses correspond to the MAC address of the NICs associated with GPU 1-Server 1 and the MAC addresses of interface et-0/0/12.0 on Stripe1-leaf1, as before.
-
Source Leaf Forwarding:
Stripe1-Leaf 1 strips off the L2 Header, performs a route lookup in the tenants specific routing table (TENANT-A_VRF), which in this case indicates a route across the fabric. This route to the destination server’s interface was installed in this VRF via EVPN Type 5 route advertisement.
The leaf re-encapsulates the packet in VXLAN, using a tenant-specific VNI, and the remote VTEP MAC address that was received with the EVPN type 5 route. An additional IP and UDP header (outer header), with the IP address of leaf 1 and spine 1 in the example and a new L2 header with source and destination MAC addresses corresponding to the Leaf’s and Spine’s MAC addresses is added to the packet before it is forwarded out of interface et-0/0/0:0 towards Spine 1.
Note:Spine 1 is used here as an example. Traffic will be load-balanced across all leaf–spine links in the fabric as will be reviewed later.
-
Spine (Intermediate) Forwarding:
Spines are not aware of VXLAN encapsulation and simply route the packets based on the outer IP header, in this case towards stripe 2 leaf 1. The outer header source and destination IP addresses are not modified.
-
Destination Leaf Forwarding:
Stripe2-Leaf 1 receives the VXLAN packet and decapsulates the packet. It extracts the VNI number from the VXLAN header to determine the proper routing table for the arriving packet. Since VNI =1 is mapped to TENANT-A_VRF, the leaf performs a router lookup in the corresponding table, which indicates that the destination is directly connected on interface et-0/0/12.0
The leaf applies a new L2 header with source and destination MAC addresses corresponding to the Leaf’s and the NIC connected to the tenants GPU and forwards the packet to the destination GPU.
- Delivery to GPU 2:
The packet is delivered to the server and routed internally to GPU 2.
The forwarding process works the same way when servers NICs are configured with IPv4 addresses.