Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Appendix A – IPv4 Overlay Over IPv6 Underlay Fabric Implementation

When the underlay BGP sections use IPv6 and peer auto-discovery, and the overlay is IPv4, the overlay BGP sessions must be configured to advertise IPv4 routes with IPv6 next-hops as described in RFC 5549 (Advertising IPv4 Network Layer Reachability Information with an IPv6 Next Hop).

Consider the example depicted in Figure below.

Figure: IPv6 Link-Local underlay and IPv6 Overlay Example

A diagram of a server AI-generated content may be incorrect.

IPv6 GPU Server NICs to Leaf Nodes Connections

The links between the GPU interfaces and the leaf nodes are statically configured with /31 IPv4 addresses as shown in the Table below No Router advertisements are sent by the leaf nodes, and SLAAC is not used in this case. All the IPv4 addresses in the example are subnets of 10.200/16 (with 10.200.0/24 being assigned to the links between the GPU servers and the leaf nodes in stripe 1, and 10.200.1/24 being assigned to the links between the GPU servers and the leaf nodes in stripe 2).

LEAF NODE INTERFACE

LEAF NODE IPv4

ADDRESS

GPU NIC

GPU NIC IPv4

ADDRESS

Stripe 1 Leaf 1 - et-0/0/0:0 10.200.0.0/31 Server 1 - gpu0_eth 10.200.0.1/31
Stripe 1 Leaf 2 - et-0/0/0:0 10.200.0.2/31 Server 1 - gpu1_eth 10.200.0.3/31
Stripe 1 Leaf 3 - et-0/0/0:0 10.200.0.4/31 Server 1 - gpu2_eth 10.200.0.5/31
Stripe 1 Leaf 4 - et-0/0/0:0 10.200.0.6/31 Server 1 - gpu3_eth 10.200.0.7/31
Stripe 1 Leaf 5 - et-0/0/0:0 10.200.0.8/31 Server 1 - gpu4_eth 10.200.0.9/31
Stripe 1 Leaf 6 - et-0/0/0:0 10.200.0.10/31 Server 1 - gpu5_eth 10.200.0.11/31
Stripe 1 Leaf 7 - et-0/0/0:0 10.200.0.12/31 Server 1 - gpu6_eth 10.200.0.13/31
Stripe 1 Leaf 8 - et-0/0/0:0 10.200.0.14/31 Server 1 - gpu7_eth 10.200.0.15/31
Stripe 1 Leaf 1 - et-0/0/1:0 10.200.0.16/31 Server 2 - gpu0_eth 10.200.0.17/31
Stripe 1 Leaf 2 - et-0/0/1:0 10.200.0.18/31 Server 2 - gpu1_eth 10.200.0.19/31
Stripe 1 Leaf 3 - et-0/0/1:0 10.200.0.20/31 Server 2 - gpu2_eth 10.200.0.21/31
Stripe 1 Leaf 4 - et-0/0/1:0 10.200.0.22/31 Server 2 - gpu3_eth 10.200.0.23/31
Stripe 1 Leaf 5 - et-0/0/1:0 10.200.0.24/31 Server 2 - gpu4_eth 10.200.0.25/31
Stripe 1 Leaf 6 - et-0/0/1:0 10.200.0.26/31 Server 2 - gpu5_eth 10.200.0.27/31
Stripe 1 Leaf 7 - et-0/0/1:0 10.200.0.28/31 Server 2 - gpu6_eth 10.200.0.29/31
Stripe 1 Leaf 8 - et-0/0/1:0 10.200.0.30/31 Server 2 - gpu7_eth 10.200.0.31/31
Stripe 1 Leaf 1 - et-0/0/2:0 10.200.0.32/31 Server 3 - gpu0_eth 10.200.0.33/31
Stripe 1 Leaf 2 - et-0/0/2:0 10.200.0.34/31 Server 3 - gpu1_eth 10.200.0.35/31
Stripe 1 Leaf 3 - et-0/0/2:0 10.200.0.36/31 Server 3 - gpu2_eth 10.200.0.37/31
Stripe 1 Leaf 4 - et-0/0/2:0 10.200.0.38/31 Server 3 - gpu3_eth 10.200.0.39/31
Stripe 1 Leaf 5 - et-0/0/2:0 10.200.0.40/31 Server 3 - gpu4_eth 10.200.0.41/31
Stripe 1 Leaf 6 - et-0/0/2:0 10.200.0.42/31 Server 3 - gpu5_eth 10.200.0.43/31
Stripe 1 Leaf 7 - et-0/0/2:0 10.200.0.44/31 Server 3 - gpu6_eth 10.200.0.45/31
Stripe 1 Leaf 8 - et-0/0/2:0 10.200.0.46/31 Server 3 - gpu7_eth 10.200.0.47/31
       
Stripe 2 Leaf 1 - et-0/0/0:0 10.200.1.0/31 Server 9 - gpu0_eth 10.200.1.1/31
Stripe 2 Leaf 2 - et-0/0/0:0 10.200.1.2/31 Server 9 - gpu1_eth 10.200.1.3/31
Stripe 2 Leaf 3 - et-0/0/0:0 10.200.1.4/31 Server 9 - gpu2_eth 10.200.1.5/31
Stripe 2 Leaf 4 - et-0/0/0:0 10.200.1.6/31 Server 9 - gpu3_eth 10.200.1.7/31
Stripe 2 Leaf 5 - et-0/0/0:0 10.200.1.8/31 Server 9 - gpu4_eth 10.200.1.9/31
Stripe 2 Leaf 6 - et-0/0/0:0 10.200.1.10/31 Server 9 - gpu5_eth 10.200.1.11/31
Stripe 2 Leaf 7 - et-0/0/0:0 10.200.1.12/31 Server 9 - gpu6_eth 10.200.1.13/31

The following example shows the configuration of the interfaces on the leaf node. Only family IPv4 is enabled, with a /31 static IPv4 address.

The following example shows the configuration of the interfaces on the server side. Only family IPv4 is enabled, with a /31 static IPv4 address.

The netplan disables dhcp4 and configures a static IPv4 address on each of the gpu_eth interfaces. It also configures a static route for prefix 10.200/16, pointing to the address of the leaf node, for each gpu_eth. The route includes the address of the interface, which guarantees that the correct interface is used when sending traffic from a gpu_eth interface to a remote address belonging to the same tenant.

A computer chip diagram with numbers and a few different colored squares AI-generated content may be incorrect.

Netplan Example

Refer to the following documentation for details to configure the interfaces on AMD GPU servers or NVIDIA GPU servers respectively:

All leaf and spine nodes are configured with IPv4 addresses under the loopback interface (lo0.0). The loopback and Autonomous System numbers for all devices in the fabric are included in Table 23:

Table 23. Spine and Leaf Loopback Addresses and ASNs

LEAF NODE INTERFACE lo0.0 IPV4 ADDRESS Local AS #
Stripe 1 Leaf 1 10.0.1.1/32 201
Stripe 1 Leaf 2 10.0.1.2/32 202
Stripe 1 Leaf 3 10.0.1.3/32 203
Stripe 1 Leaf 4 10.0.1.4/32 204
Stripe 1 Leaf 5 10.0.1.5/32 205
Stripe 1 Leaf 6 10.0.1.6/32 206
Stripe 1 Leaf 7 10.0.1.7/32 207
Stripe 1 Leaf 8 10.0.1.8/32 208
Stripe 2 Leaf 1 10.0.1.9/32 209
Stripe 2 Leaf 2 10.0.1.10/32 210

.

.

.

   
SPINE1   101
SPINE2   102
SPINE3   103
SPINE4   104

IPv6 Leaf Nodes to Spine Nodes Connections Using Link Local Addresses

When deploying the underlay using IPv6 Link-Local underlay, the interfaces between the leaf and spine nodes do not require explicitly configured IP addresses and are configured as untagged interfaces with only family inet6 to enable processing of IPv6 traffic as shown in Figure 50.

Figure 50: Leaf nodes to spine nodes connectivity

A graph of lines and numbers AI-generated content may be incorrect.

Table 24. Spine to Leaf Interface Configuration Example

A screenshot of a computer program AI-generated content may be incorrect.

Enabling IPv6 on an interface automatically assigns a link-local IPv6 address. The switch autogenerates link local addresses for the interfaces using the EUI-64 address format (based on the interface’s MAC address), as shown in Table 25.

Table 25. Spine and Leaf IPv6-Enabled Interface Link Local Addresses

LEAF NODE INTERFACE LEAF NODE IPv6 ADDRESS SPINE NODE INTERFACE SPINE IPv6 ADDRESS
Stripe 1 Leaf 1 - et-0/0/30:0 fe80::9e5a:80ff:fec1:ae00/64 Spine 1 – et-0/0/0:0 fe80::9e5a:80ff:feef:a28f/64
Stripe 1 Leaf 1 - et-0/0/31:0 fe80::9e5a:80ff:fec1:ae08/64 Spine 2 – et-0/0/0:0 fe80::5a86:70ff:fe7b:ced5/64
Stripe 1 Leaf 1 - et-0/0/32:0 fe80::9e5a:80ff:fec1:af00/64 Spine 3 – et-0/0/0:0 fe80::5a86:70ff:fe78:e0d5/64
Stripe 1 Leaf 1 - et-0/0/33:0 fe80::9e5a:80ff:fec1:af08/64 Spine 4 – et-0/0/0:0 fe80::5a86:70ff:fe79:3d5/64
Stripe 1 Leaf 2 - et-0/0/30:0 fe80::5a86:70ff:fe79:dad5/64 Spine 1 – et-0/0/1:0 fe80::9e5a:80ff:feef:a297/64
Stripe 1 Leaf 2 - et-0/0/31:0 fe80::5a86:70ff:fe79:dadd/64 Spine 2 – et-0/0/1:0 fe80::5a86:70ff:fe7b:cedd/64
Stripe 1 Leaf 2 - et-0/0/32:0 fe80::5a86:70ff:fe79:dbd5/64 Spine 3 – et-0/0/1:0 fe80::5a86:70ff:fe78:e0dd/64
Stripe 1 Leaf 2 - et-0/0/33:0 fe80::5a86:70ff:fe79:dbdd/64 Spine 4 – et-0/0/1:0 fe80::5a86:70ff:fe79:3dd/64

.

.

.

     

These addresses need to be advertised through standard router advertisements as part of the IPv6 Neighbor Discovery process to allow the leaf and spine nodes to then establish BGP sessions between them. Router advertisement must be enabled on all the interfaces between the leaf and spine nodes as shown:

Table 26. IPv6 Router Advertisement on Leaf and Spine InterfacesA screenshot of a computer AI-generated content may be incorrect.

To verify that router advertisements are being sent you can use: show IPv6 router-advertisement interface <interface> and show IPv6 neighbors

Example:

The loopback interface IPv6 addresses and the Autonomous System numbers for all devices in the fabric are included in Table 26:

Table 26. Spine and Leaf Loopback Addresses and ASNs

LEAF NODE INTERFACE lo0.0 IPv6 ADDRESS Local AS #
Stripe 1 Leaf 1 FC00:10:0:1::1/128 201
Stripe 1 Leaf 2 FC00:10:0:1::2/128 202
Stripe 1 Leaf 3 FC00:10:0:1::3/128 203
Stripe 1 Leaf 4 FC00:10:0:1::4/128 204
Stripe 1 Leaf 5 FC00:10:0:1::5/128 205
Stripe 1 Leaf 6 FC00:10:0:1::6/128 206
Stripe 1 Leaf 7 FC00:10:0:1::7/128 207
Stripe 1 Leaf 8 FC00:10:0:1::8/128 208
Stripe 2 Leaf 1 FC00:10:0:1::9/128 209
Stripe 2 Leaf 2 FC00:10:0:1::10/128 210

.

.

.

   
SPINE1 FC00:10:0::1/128 101
SPINE2 FC00:10:0::2/128 102
SPINE3 FC00:10:0::3/128 103
SPINE4 FC00:10:0::4/128 104

Table 27. Spine and Leaf Loopback Address Configuration A computer screen shot of a computer code AI-generated content may be incorrect.

Recommended MTU

Configure the MTU consistently across the fabric and make sure that the MTU of the server->leaf links does not exceed the MTU of the leaf->spine links considering the extra overhead of the VXLAN encapsulation.

VXLAN Overhead Calculation

For IPv6, the MTU can also be calculated as:

Table 28 VXLAN Overhead Calculation

HEADER BYTES
Outer Ethernet 14
Outer IP (IPv6) 40
UDP 8
VXLAN 8
Total 70 bytes

Recommended MTU Strategy

Table 29. Recommended MTU

LINK TYPE MTU
Server ↔ Leaf 9000
Leaf ↔ Spine IPv6 > 9070

It is important to keep in mind that RoCEv2 message sizes are still limited by the RDMA MTU reported by ibv_devinfo

Table 30. MTU Types: Ownership and Functional Role

MTU TYPE OWNER PURPOSE

Interface MTU (e.g. 9000)

ifconfig, ip

Linux network stack Defines the max L3/IP packet size

RDMA MTU (e.g. 4096)

ibv_devinfo

RDMA stack Defines the max RDMA message size per Work Queue Element (WQE)

The RDMA MTU can be configured at the verbs level, and it’s negotiated during QP (Queue Pair) setup. You cannot override it by just setting the NIC's MTU to a higher value, but you would need to use low-level tools or RDMA apps.

Some performance tools such as ib_send_bw, ib_write_bw (via -m flag). For example:

ib_write_bw -m 1024 # sets RDMA MTU to 1024 bytes

ib_write_bw -m 4096 # sets RDMA MTU to 4096 (max allowed according to the output of ibv_devinfo shown before)

RDMA MTU must be ≤ Interface MTU – encapsulation overhead.

IPv6 GPU Backend Fabric Underlay, using BGP neighbor discovery

Refer to . Configure BGP Unnumbered EVPN Fabric | Juniper Networks for more information.

The underlay EBGP sessions are configured between the leaf and spine nodes to use peer auto-discovery, and are configured to advertise these loopback interfaces, as shown in the example between Stripe1 Leaf 1 and Spine 1 below:

Table 31. GPU Backend Fabric: BGP Underlay with Peer Auto-Discovery Configuration A screenshot of a computer program AI-generated content may be incorrect.

To configure peer auto discovery, the dynamic-neighbor named underlay-dynamic-neighbors, under BGP group l3clos-inet6-auto-underlay, specifies the interfaces where auto discovery is permitted. This replaces the neighbor a.b.c.d commands that would statically configure the neighbors.

The family inet unicast and family inet6 unicast statements configure the sessions to advertise both IPv4 to support the IPv4 overlay. When BGP sessions are established over IPv6 link-local addresses but carry IPv4 routes (IPv4 overlay), the extended-nexthop statement must be configured under family inet unicast. This allows IPv4 next-hops to be resolved across an IPv6 transport session, enabling correct installation of IPv4 prefixes in the routing table as described in RFC5549. Failing to include the extended-nexthop will result in hidden routes, as the protocol next-hop cannot be resolved.

The family inet6 IPv6-nd statement enables the use of IPv6 Neighbor Discovery to dynamically determine the addresses of neighbors with which to establish BGP sessions. To control and secure dynamic peer formation, a peer-as-list (discovered-as-list) is configured, restricting peering to neighbors whose autonomous system numbers fall within the defined range of AS 101–104.

The BGP sessions are also configured with multipath multiple-as, allowing multiple paths (even with different AS paths) to be considered for ECMP (Equal-Cost Multi-Path) routing. BFD (Bidirectional Forwarding Detection) is additionally enabled to accelerate convergence in case of link or neighbor failures.

You can check that the sessions have been established using: show bgp summary group <group-name>

Example:

Notice that when BGP sessions are established using link-local addresses Junos displays the neighbor address along with the interface scope (e.g. fe80::5a86:70ff:fe78:e0d5%et-0/0/1:0.0). The scope identifier (the part after the %) is necessary because the same link-local address (fe80::/10) could exist on multiple interfaces. The device must know which interface to use to send packets to that neighbor. Thus, after peer discovery is completed, the show bgp summary output lists the neighbor using the format: IPv6_link-local_address%interface-name.

Even though, the sessions are established using the IPv6 link-local addresses the advertised routes are IPv4 and installed in the inet.0 routing table.

You can check details about discovered neighbors using: show bgp neighbor auto-discovered <peer-id>

Example:

To verify the operation of BFD for the BGP sessions use: show bfd session

Example:

To control the propagation of routes, and make sure the loopback interface addresses are advertised, export policies are applied to these EBGP sessions as shown in the example in Table 32.

Table 32. Export policy example IPv4 Underlay with auto discoveryA screenshot of a computer program AI-generated content may be incorrect.

These policies ensure loopback reachability without advertising unnecessary routes.

On the spine nodes, routes are exported only if they are accepted by both the SPINE_TO_LEAF_FABRIC_OUT and BGP-AOS-Policy export policies.

  • The SPINE_TO_LEAF_FABRIC_OUT policy has no match conditions and accepts all routes unconditionally, tagging them with the FROM_SPINE_FABRIC_TIER community (0:15).
  • The BGP-AOS-Policy accepts BGP-learned routes as well as any routes accepted by the nested AllPodNetworks policy.
  • The AllPodNetworks policy, in turn, matches directly connected IPv6 routes and tags them with the DEFAULT_DIRECT_V6 community (1:20008 and 21001:26000 on Spine1).

As a result, each spine advertises both its directly connected routes (including its loopback interface) and any routes it has received from other leaf nodes.

You can verify that the expected routes are being advertised by the spine node using: show route advertising-protocol bgp <peer-id> table inet6.0

Example:

The following example shows the routes advertised to Stripe 1 Leaf 1 by Spine 1 which correspond to the loopback interface addresses of itself, as well as Stripe1 Leaf 2, Stripe 2 Leaf 1, and Stripe 2 Leaf 2.

To verify routes are received by the Leaf nodes use: show route receive-protocol bgp <peer-id> table inet6.0

Example:

On the leaf nodes, routes are exported only if they are accepted by both the LEAF_TO_SPINE_FABRIC_OUT and BGP-AOS-Policy export policies.

  • The LEAF_TO_SPINE_FABRIC_OUT policy accepts all routes except those learned via BGP that are tagged with the FROM_SPINE_FABRIC_TIER community (0:15). These routes are explicitly rejected to prevent re-advertisement of spine-learned routes back into the spine layer. As described earlier, spine nodes tag all routes they advertise to leaf nodes with this community to facilitate this filtering logic.
  • The BGP-AOS-Policy accepts all routes allowed by the nested AllPodNetworks policy, which matches directly connected IPv6 routes and tags them with the DEFAULT_DIRECT_V4 community (5:20007 and 21001:26000 for Stripe1-Leaf1).

As a result, leaf nodes will advertise only their directly connected interface routes, including their loopback interfaces, to the spines.

You can verify that the expected routes are being advertised by the spine node using: show route advertising-protocol bgp <peer-id> table inet6.0

Example:

The following example shows the routes advertised to Spine 1 by Stripe 1 Leaf 1.

To verify routes are received by the spine node, use: show route receive-protocol bgp <peer-id> table inet6.0

Example:

IPv6 GPU Backend Fabric Overlay

GPU Backend Fabric Overlay Using IPv4 The overlay EBGP sessions are configured between the leaf and spine nodes using the IPv4 addresses of the loopback interfaces, as shown in the example between Stripe1 Leaf 1/Stripe 2 Leaf 1 and Spine 1.

Table 33. GPU Backend Fabric Overlay Using IPv4 Loopback Addresses – Stripe 1 Example

A screenshot of a computer program AI-generated content may be incorrect.

Table 34. GPU Backend Fabric Overlay Using IPv4 Loopback Addresses – Stripe 2 Example

A screenshot of a computer program AI-generated content may be incorrect.

The overlay BGP sessions use family evpn signaling to enable EVPN route exchange. The multihop ttl 1 statement allows EBGP sessions to be established between the loopback interfaces.

As with the underlay BGP sessions, these sessions are configured with multipath multiple-as, allowing multiple EVPN paths with different AS paths to be considered for ECMP (Equal-Cost Multi-Path) routing. BFD (Bidirectional Forwarding Detection) is also enabled to improve convergence time in case of failures.

The no-nexthop-change knob on the spine nodes is used to preserve the original next-hop address, which is critical in EVPN for ensuring that the remote VTEP can be reached directly. The vpn-apply-export statement is included to ensure that the export policies are evaluated for VPN address families, such as EVPN, allowing fine-grained control over which routes are advertised to each peer.

To control the propagation of routes, export policies are applied to these EBGP sessions as shown in the example in table 33.

Table 35. Export Policy example to advertise EVPN routes over IPv4 overlay

A screenshot of a computer AI-generated content may be incorrect.

These policies are simpler in structure and are intended to enable end-to-end EVPN reachability between tenant GPUs, while preventing route loops within the overlay.

Routes will only be advertised if EVPN routing-instances have been created. Example:

Table 36. EVPN Routing-Instances for a single tenant example across different leaf nodes.A screenshot of a computer AI-generated content may be incorrect.

On the spine nodes, routes are exported if they are accepted by the SPINE_TO_LEAF_EVPN_OUT policy.

The SPINE_TO_LEAF_EVPN_OUT policy has no match conditions and accepts all routes. It tags each exported route with the FROM_SPINE_EVPN_TIER community (0:14).

As a result, the spine nodes export EVPN routes received from one leaf to all other leaf nodes, allowing tenant-to-tenant communication across the fabric.

Example:

On the leaf nodes, routes are exported if they are accepted by both the LEAF_TO_SPINE_EVPN_OUT and EVPN_EXPORT policies:

  • The LEAF_TO_SPINE_EVPN_OUT policy rejects any BGP-learned routes that carry the FROM_SPINE_EVPN_TIER community (0:14). These routes are explicitly rejected to prevent re-advertisement of spine-learned routes back into the spine layer. As described earlier, spine nodes tag all routes they advertise to leaf nodes with this community to facilitate this filtering logic.
  • The EVPN_EXPORT policy accepts all routes without additional conditions.

As a result, the leaf nodes export only locally originated EVPN routes for the directly connected interfaces between GPU servers and the leaf nodes. These routes are part of the tenant routing instances and are required to establish reachability between GPUs belonging to the same tenant.