Appendix A – IPv4 Overlay Over IPv6 Underlay Fabric Implementation
When the underlay BGP sections use IPv6 and peer auto-discovery, and the overlay is IPv4, the overlay BGP sessions must be configured to advertise IPv4 routes with IPv6 next-hops as described in RFC 5549 (Advertising IPv4 Network Layer Reachability Information with an IPv6 Next Hop).
Consider the example depicted in Figure below.
Figure: IPv6 Link-Local underlay and IPv6 Overlay Example
IPv6 GPU Server NICs to Leaf Nodes Connections
The links between the GPU interfaces and the leaf nodes are statically configured with /31 IPv4 addresses as shown in the Table below No Router advertisements are sent by the leaf nodes, and SLAAC is not used in this case. All the IPv4 addresses in the example are subnets of 10.200/16 (with 10.200.0/24 being assigned to the links between the GPU servers and the leaf nodes in stripe 1, and 10.200.1/24 being assigned to the links between the GPU servers and the leaf nodes in stripe 2).
| LEAF NODE INTERFACE |
LEAF NODE IPv4 ADDRESS |
GPU NIC |
GPU NIC IPv4 ADDRESS |
|---|---|---|---|
| Stripe 1 Leaf 1 - et-0/0/0:0 | 10.200.0.0/31 | Server 1 - gpu0_eth | 10.200.0.1/31 |
| Stripe 1 Leaf 2 - et-0/0/0:0 | 10.200.0.2/31 | Server 1 - gpu1_eth | 10.200.0.3/31 |
| Stripe 1 Leaf 3 - et-0/0/0:0 | 10.200.0.4/31 | Server 1 - gpu2_eth | 10.200.0.5/31 |
| Stripe 1 Leaf 4 - et-0/0/0:0 | 10.200.0.6/31 | Server 1 - gpu3_eth | 10.200.0.7/31 |
| Stripe 1 Leaf 5 - et-0/0/0:0 | 10.200.0.8/31 | Server 1 - gpu4_eth | 10.200.0.9/31 |
| Stripe 1 Leaf 6 - et-0/0/0:0 | 10.200.0.10/31 | Server 1 - gpu5_eth | 10.200.0.11/31 |
| Stripe 1 Leaf 7 - et-0/0/0:0 | 10.200.0.12/31 | Server 1 - gpu6_eth | 10.200.0.13/31 |
| Stripe 1 Leaf 8 - et-0/0/0:0 | 10.200.0.14/31 | Server 1 - gpu7_eth | 10.200.0.15/31 |
| Stripe 1 Leaf 1 - et-0/0/1:0 | 10.200.0.16/31 | Server 2 - gpu0_eth | 10.200.0.17/31 |
| Stripe 1 Leaf 2 - et-0/0/1:0 | 10.200.0.18/31 | Server 2 - gpu1_eth | 10.200.0.19/31 |
| Stripe 1 Leaf 3 - et-0/0/1:0 | 10.200.0.20/31 | Server 2 - gpu2_eth | 10.200.0.21/31 |
| Stripe 1 Leaf 4 - et-0/0/1:0 | 10.200.0.22/31 | Server 2 - gpu3_eth | 10.200.0.23/31 |
| Stripe 1 Leaf 5 - et-0/0/1:0 | 10.200.0.24/31 | Server 2 - gpu4_eth | 10.200.0.25/31 |
| Stripe 1 Leaf 6 - et-0/0/1:0 | 10.200.0.26/31 | Server 2 - gpu5_eth | 10.200.0.27/31 |
| Stripe 1 Leaf 7 - et-0/0/1:0 | 10.200.0.28/31 | Server 2 - gpu6_eth | 10.200.0.29/31 |
| Stripe 1 Leaf 8 - et-0/0/1:0 | 10.200.0.30/31 | Server 2 - gpu7_eth | 10.200.0.31/31 |
| Stripe 1 Leaf 1 - et-0/0/2:0 | 10.200.0.32/31 | Server 3 - gpu0_eth | 10.200.0.33/31 |
| Stripe 1 Leaf 2 - et-0/0/2:0 | 10.200.0.34/31 | Server 3 - gpu1_eth | 10.200.0.35/31 |
| Stripe 1 Leaf 3 - et-0/0/2:0 | 10.200.0.36/31 | Server 3 - gpu2_eth | 10.200.0.37/31 |
| Stripe 1 Leaf 4 - et-0/0/2:0 | 10.200.0.38/31 | Server 3 - gpu3_eth | 10.200.0.39/31 |
| Stripe 1 Leaf 5 - et-0/0/2:0 | 10.200.0.40/31 | Server 3 - gpu4_eth | 10.200.0.41/31 |
| Stripe 1 Leaf 6 - et-0/0/2:0 | 10.200.0.42/31 | Server 3 - gpu5_eth | 10.200.0.43/31 |
| Stripe 1 Leaf 7 - et-0/0/2:0 | 10.200.0.44/31 | Server 3 - gpu6_eth | 10.200.0.45/31 |
| Stripe 1 Leaf 8 - et-0/0/2:0 | 10.200.0.46/31 | Server 3 - gpu7_eth | 10.200.0.47/31 |
| Stripe 2 Leaf 1 - et-0/0/0:0 | 10.200.1.0/31 | Server 9 - gpu0_eth | 10.200.1.1/31 |
| Stripe 2 Leaf 2 - et-0/0/0:0 | 10.200.1.2/31 | Server 9 - gpu1_eth | 10.200.1.3/31 |
| Stripe 2 Leaf 3 - et-0/0/0:0 | 10.200.1.4/31 | Server 9 - gpu2_eth | 10.200.1.5/31 |
| Stripe 2 Leaf 4 - et-0/0/0:0 | 10.200.1.6/31 | Server 9 - gpu3_eth | 10.200.1.7/31 |
| Stripe 2 Leaf 5 - et-0/0/0:0 | 10.200.1.8/31 | Server 9 - gpu4_eth | 10.200.1.9/31 |
| Stripe 2 Leaf 6 - et-0/0/0:0 | 10.200.1.10/31 | Server 9 - gpu5_eth | 10.200.1.11/31 |
| Stripe 2 Leaf 7 - et-0/0/0:0 | 10.200.1.12/31 | Server 9 - gpu6_eth | 10.200.1.13/31 |
The following example shows the configuration of the interfaces on the leaf node. Only family IPv4 is enabled, with a /31 static IPv4 address.
[edit interfaces et-0/0/0]
jnpr@stripe1-leaf1# show
description "Breakout et-0/0/0";
number-of-sub-ports 2;
speed 400g;
[edit interfaces et-0/0/0:0]
jnpr@stripe1-leaf1# show
mtu 9216;
unit 2 {
family inet {
address 10.200.0.254/24;
}
}The following example shows the configuration of the interfaces on the server side. Only family IPv4 is enabled, with a /31 static IPv4 address.
gpu0_eth:
match:
macaddress: a0:88:c2:3b:50:66
dhcp4: false
mtu: 9000
addresses:
- 10.200.0.10/24
routes:
- to: 10.200.0.0/16
via: 10.200.0.254
from: 10.200.0.10
set-name: gpu0_ethThe netplan disables dhcp4 and configures a static IPv4 address on each of the gpu_eth interfaces. It also configures a static route for prefix 10.200/16, pointing to the address of the leaf node, for each gpu_eth. The route includes the address of the interface, which guarantees that the correct interface is used when sending traffic from a gpu_eth interface to a remote address belonging to the same tenant.
Netplan Example
jnpr@H100-01:/etc/netplan$ sudo cat 00-installer-config-type5_vrf.yaml
# This is the network config written by 'subiquity'
network:
version: 2
ethernets:
mgmt_eth:
match:
macaddress: 6c:fe:54:48:2e:48
dhcp4: false
addresses:
- 10.10.1.16/31
nameservers:
addresses:
- 8.8.8.8
routes:
- to: default
via: 10.10.1.17
set-name: mgmt_eth
gpu0_eth:
match:
macaddress: a0:88:c2:3b:50:66
dhcp4: false
mtu: 9000
addresses:
- 10.200.0.10/24
routes:
- to: 10.200.0.0/16
via: 10.200.0.254
from: 10.200.0.10
set-name: gpu0_eth
gpu1_eth:
match:
macaddress: a0:88:c2:3b:50:6a
dhcp4: false
mtu: 9000
addresses:
- 10.200.1.10/24
routes:
- to: 10.200.0.0/16
via: 10.200.1.254
from: 10.200.1.10
set-name: gpu1_eth
gpu2_eth:
match:
macaddress: a0:88:c2:3b:50:6e
dhcp4: false
mtu: 9000
addresses:
- 10.200.2.10/24
routes:
- to: 10.200.0.0/16
via: 10.200.2.254
from: 10.200.2.10
set-name: gpu2_eth
gpu3_eth:
match:
macaddress: a0:88:c2:3b:50:72
dhcp4: false
mtu: 9000
addresses:
- 10.200.3.10/24
routes:
- to: 10.200.0.0/16
via: 10.200.3.254
from: 10.200.3.10
set-name: gpu3_eth
gpu4_eth:
match:
macaddress: a0:88:c2:0a:79:48
dhcp4: false
mtu: 9000
addresses:
- 10.200.4.10/24
routes:
- to: 10.200.0.0/16
via: 10.200.4.254
from: 10.200.4.10
set-name: gpu4_eth
gpu5_eth:
match:
macaddress: a0:88:c2:0a:79:4c
dhcp4: false
mtu: 9000
addresses:
- 10.200.5.10/24
routes:
- to: 10.200.0.0/16
via: 10.200.5.254
from: 10.200.5.10
set-name: gpu5_eth
gpu6_eth:
match:
macaddress: a0:88:c2:0a:79:40
dhcp4: false
mtu: 9000
addresses:
- 10.200.6.10/24
routes:
- to: 10.200.0.0/16
via: 10.200.6.254
from: 10.200.6.10
set-name: gpu6_eth
gpu7_eth:
match:
macaddress: a0:88:c2:0a:79:44
dhcp4: false
mtu: 9000
addresses:
- 10.200.7.10/24
routes:
- to: 10.200.0.0/16
via: 10.200.7.254
from: 10.200.7.10
set-name: gpu7_eth
stor0_eth:
match:
macaddress: b8:3f:d2:63:e5:44
dhcp4: false
mtu: 9000
addresses:
- 10.100.1.13/31
routes:
- to: 10.100.0.0/21
via: 10.100.1.12
set-name: stor0_ethRefer to the following documentation for details to configure the interfaces on AMD GPU servers or NVIDIA GPU servers respectively:
All leaf and spine nodes are configured with IPv4 addresses under the loopback interface (lo0.0). The loopback and Autonomous System numbers for all devices in the fabric are included in Table 23:
Table 23. Spine and Leaf Loopback Addresses and ASNs
| LEAF NODE INTERFACE | lo0.0 IPV4 ADDRESS | Local AS # |
|---|---|---|
| Stripe 1 Leaf 1 | 10.0.1.1/32 | 201 |
| Stripe 1 Leaf 2 | 10.0.1.2/32 | 202 |
| Stripe 1 Leaf 3 | 10.0.1.3/32 | 203 |
| Stripe 1 Leaf 4 | 10.0.1.4/32 | 204 |
| Stripe 1 Leaf 5 | 10.0.1.5/32 | 205 |
| Stripe 1 Leaf 6 | 10.0.1.6/32 | 206 |
| Stripe 1 Leaf 7 | 10.0.1.7/32 | 207 |
| Stripe 1 Leaf 8 | 10.0.1.8/32 | 208 |
| Stripe 2 Leaf 1 | 10.0.1.9/32 | 209 |
| Stripe 2 Leaf 2 | 10.0.1.10/32 | 210 |
|
. . . |
||
| SPINE1 | 101 | |
| SPINE2 | 102 | |
| SPINE3 | 103 | |
| SPINE4 | 104 |
IPv6 Leaf Nodes to Spine Nodes Connections Using Link Local Addresses
When deploying the underlay using IPv6 Link-Local underlay, the interfaces between the leaf and spine nodes do not require explicitly configured IP addresses and are configured as untagged interfaces with only family inet6 to enable processing of IPv6 traffic as shown in Figure 50.
Figure 50: Leaf nodes to spine nodes connectivity
Table 24. Spine to Leaf Interface Configuration Example
Enabling IPv6 on an interface automatically assigns a link-local IPv6 address. The switch autogenerates link local addresses for the interfaces using the EUI-64 address format (based on the interface’s MAC address), as shown in Table 25.
Table 25. Spine and Leaf IPv6-Enabled Interface Link Local Addresses
| LEAF NODE INTERFACE | LEAF NODE IPv6 ADDRESS | SPINE NODE INTERFACE | SPINE IPv6 ADDRESS |
|---|---|---|---|
| Stripe 1 Leaf 1 - et-0/0/30:0 | fe80::9e5a:80ff:fec1:ae00/64 | Spine 1 – et-0/0/0:0 | fe80::9e5a:80ff:feef:a28f/64 |
| Stripe 1 Leaf 1 - et-0/0/31:0 | fe80::9e5a:80ff:fec1:ae08/64 | Spine 2 – et-0/0/0:0 | fe80::5a86:70ff:fe7b:ced5/64 |
| Stripe 1 Leaf 1 - et-0/0/32:0 | fe80::9e5a:80ff:fec1:af00/64 | Spine 3 – et-0/0/0:0 | fe80::5a86:70ff:fe78:e0d5/64 |
| Stripe 1 Leaf 1 - et-0/0/33:0 | fe80::9e5a:80ff:fec1:af08/64 | Spine 4 – et-0/0/0:0 | fe80::5a86:70ff:fe79:3d5/64 |
| Stripe 1 Leaf 2 - et-0/0/30:0 | fe80::5a86:70ff:fe79:dad5/64 | Spine 1 – et-0/0/1:0 | fe80::9e5a:80ff:feef:a297/64 |
| Stripe 1 Leaf 2 - et-0/0/31:0 | fe80::5a86:70ff:fe79:dadd/64 | Spine 2 – et-0/0/1:0 | fe80::5a86:70ff:fe7b:cedd/64 |
| Stripe 1 Leaf 2 - et-0/0/32:0 | fe80::5a86:70ff:fe79:dbd5/64 | Spine 3 – et-0/0/1:0 | fe80::5a86:70ff:fe78:e0dd/64 |
| Stripe 1 Leaf 2 - et-0/0/33:0 | fe80::5a86:70ff:fe79:dbdd/64 | Spine 4 – et-0/0/1:0 | fe80::5a86:70ff:fe79:3dd/64 |
|
. . . |
These addresses need to be advertised through standard router advertisements as part of the IPv6 Neighbor Discovery process to allow the leaf and spine nodes to then establish BGP sessions between them. Router advertisement must be enabled on all the interfaces between the leaf and spine nodes as shown:
Table 26. IPv6 Router Advertisement on Leaf and Spine
Interfaces
To verify that router advertisements are being sent you can use: show IPv6
router-advertisement interface <interface> and show IPv6 neighbors
Example:
jnpr@stripe1-leaf1> show IPv6 router-advertisement interface et-0/0/30:0
Interface: et-0/0/30:0.0
Advertisements sent: 4, last sent 00:02:28 ago
Solicits sent: 1, last sent 00:08:06 ago
Solicits received: 0
Advertisements received: 3
Solicited router advertisement unicast: Disable
IPv6 RA Preference: DEFAULT/MEDIUM
Passive mode: Disable
Upstream mode: Disable
Downstream mode: Disable
Proxy blackout timer: Not Running
Advertisement from fe80::9e5a:80ff:feef:a28f, heard 00:01:57 ago
Managed: 0
Other configuration: 0
Reachable time: 0 ms
Default lifetime: 1800 sec
Retransmit timer: 0 ms
Current hop limit: 64
jnpr@stripe1-leaf1> show IPv6 neighbors
IPv6 Address Linklayer Address State Exp Rtr Secure Interface
fe80::5a86:70ff:fe78:e0d5 58:86:70:78:e0:d5 reachable 11 yes no et-0/0/31:0.0
fe80::5a86:70ff:fe79:3d5 58:86:70:79:03:d5 reachable 23 yes no et-0/0/33:0.0
fe80::5a86:70ff:fe7b:ced5 58:86:70:7b:ce:d5 reachable 13 yes no et-0/0/32:0.0
fe80::9e5a:80ff:feef:a28f 9c:5a:80:ef:a2:8f reachable 25 yes no et-0/0/30:0.0
Total entries: 4The loopback interface IPv6 addresses and the Autonomous System numbers for all devices in the fabric are included in Table 26:
Table 26. Spine and Leaf Loopback Addresses and ASNs
| LEAF NODE INTERFACE | lo0.0 IPv6 ADDRESS | Local AS # |
|---|---|---|
| Stripe 1 Leaf 1 | FC00:10:0:1::1/128 | 201 |
| Stripe 1 Leaf 2 | FC00:10:0:1::2/128 | 202 |
| Stripe 1 Leaf 3 | FC00:10:0:1::3/128 | 203 |
| Stripe 1 Leaf 4 | FC00:10:0:1::4/128 | 204 |
| Stripe 1 Leaf 5 | FC00:10:0:1::5/128 | 205 |
| Stripe 1 Leaf 6 | FC00:10:0:1::6/128 | 206 |
| Stripe 1 Leaf 7 | FC00:10:0:1::7/128 | 207 |
| Stripe 1 Leaf 8 | FC00:10:0:1::8/128 | 208 |
| Stripe 2 Leaf 1 | FC00:10:0:1::9/128 | 209 |
| Stripe 2 Leaf 2 | FC00:10:0:1::10/128 | 210 |
|
. . . |
||
| SPINE1 | FC00:10:0::1/128 | 101 |
| SPINE2 | FC00:10:0::2/128 | 102 |
| SPINE3 | FC00:10:0::3/128 | 103 |
| SPINE4 | FC00:10:0::4/128 | 104 |
Table 27. Spine and Leaf Loopback Address Configuration
Recommended MTU
Configure the MTU consistently across the fabric and make sure that the MTU of the server->leaf links does not exceed the MTU of the leaf->spine links considering the extra overhead of the VXLAN encapsulation.
VXLAN Overhead Calculation
For IPv6, the MTU can also be calculated as:
Table 28 VXLAN Overhead Calculation
| HEADER | BYTES |
|---|---|
| Outer Ethernet | 14 |
| Outer IP (IPv6) | 40 |
| UDP | 8 |
| VXLAN | 8 |
| Total | 70 bytes |
Recommended MTU Strategy
Table 29. Recommended MTU
| LINK TYPE | MTU |
|---|---|
| Server ↔ Leaf | 9000 |
| Leaf ↔ Spine IPv6 | > 9070 |
It is important to keep in mind that RoCEv2 message sizes are still limited by the RDMA MTU reported by ibv_devinfo
jnpr@MI300-01:~/SCRIPTS$ ibv_devinfo -d bnxt_re0
hca_id: bnxt_re0
transport: InfiniBand (0)
fw_ver: 230.2.49.0
node_guid: 7ec2:55ff:febd:75d0
sys_image_guid: 7ec2:55ff:febd:75d0
vendor_id: 0x14e4
vendor_part_id: 5984
hw_ver: 0x1D42
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: EthernetTable 30. MTU Types: Ownership and Functional Role
| MTU TYPE | OWNER | PURPOSE |
|---|---|---|
|
Interface MTU (e.g. 9000) ifconfig, ip |
Linux network stack | Defines the max L3/IP packet size |
|
RDMA MTU (e.g. 4096) ibv_devinfo |
RDMA stack | Defines the max RDMA message size per Work Queue Element (WQE) |
The RDMA MTU can be configured at the verbs level, and it’s negotiated during QP (Queue Pair) setup. You cannot override it by just setting the NIC's MTU to a higher value, but you would need to use low-level tools or RDMA apps.
Some performance tools such as ib_send_bw, ib_write_bw (via -m flag). For example:
ib_write_bw -m 1024 # sets RDMA MTU to 1024 bytes
ib_write_bw -m 4096 # sets RDMA MTU to 4096 (max allowed according to the output of ibv_devinfo shown before)
RDMA MTU must be ≤ Interface MTU – encapsulation overhead.
IPv6 GPU Backend Fabric Underlay, using BGP neighbor discovery
Refer to . Configure BGP Unnumbered EVPN Fabric | Juniper Networks for more information.
The underlay EBGP sessions are configured between the leaf and spine nodes to use peer auto-discovery, and are configured to advertise these loopback interfaces, as shown in the example between Stripe1 Leaf 1 and Spine 1 below:
Table 31. GPU Backend Fabric: BGP Underlay with Peer
Auto-Discovery Configuration
To configure peer auto discovery, the dynamic-neighbor named underlay-dynamic-neighbors, under BGP group l3clos-inet6-auto-underlay, specifies the interfaces where auto discovery is permitted. This replaces the neighbor a.b.c.d commands that would statically configure the neighbors.
The family inet unicast and family inet6 unicast statements configure the sessions to advertise both IPv4 to support the IPv4 overlay. When BGP sessions are established over IPv6 link-local addresses but carry IPv4 routes (IPv4 overlay), the extended-nexthop statement must be configured under family inet unicast. This allows IPv4 next-hops to be resolved across an IPv6 transport session, enabling correct installation of IPv4 prefixes in the routing table as described in RFC5549. Failing to include the extended-nexthop will result in hidden routes, as the protocol next-hop cannot be resolved.
The family inet6 IPv6-nd statement enables the use of IPv6 Neighbor Discovery to dynamically determine the addresses of neighbors with which to establish BGP sessions. To control and secure dynamic peer formation, a peer-as-list (discovered-as-list) is configured, restricting peering to neighbors whose autonomous system numbers fall within the defined range of AS 101–104.
The BGP sessions are also configured with multipath multiple-as, allowing multiple paths (even with different AS paths) to be considered for ECMP (Equal-Cost Multi-Path) routing. BFD (Bidirectional Forwarding Detection) is additionally enabled to accelerate convergence in case of link or neighbor failures.
You can check that the sessions have been established using: show bgp summary group
<group-name>
Example:
jnpr@stripe1-leaf1> show bgp summary group l3clos-inet6-auto-underlay fe80::5a86:70ff:fe78:e0d5%et-0/0/31:0.0 102 201 196 0 0 1:29:35 Establ inet.0: 4/4/4/0 fe80::5a86:70ff:fe79:3d5%et-0/0/33:0.0 104 201 196 0 0 1:29:15 Establ inet.0: 4/4/4/0 fe80::5a86:70ff:fe7b:ced5%et-0/0/32:0.0 103 201 196 0 0 1:29:21 Establ inet.0: 4/4/4/0 fe80::9e5a:80ff:feef:a28f%et-0/0/30:0.0 101 202 197 0 0 1:29:30 Establ inet.0: 4/4/4/0
Notice that when BGP sessions are established using link-local addresses Junos displays the neighbor address along with the interface scope (e.g. fe80::5a86:70ff:fe78:e0d5%et-0/0/1:0.0). The scope identifier (the part after the %) is necessary because the same link-local address (fe80::/10) could exist on multiple interfaces. The device must know which interface to use to send packets to that neighbor. Thus, after peer discovery is completed, the show bgp summary output lists the neighbor using the format: IPv6_link-local_address%interface-name.
Even though, the sessions are established using the IPv6 link-local addresses the advertised routes are IPv4 and installed in the inet.0 routing table.
You can check details about discovered neighbors using: show bgp neighbor
auto-discovered <peer-id>
Example:
jnpr@stripe1-leaf1> show bgp neighbor auto-discovered fe80::5a86:70ff:fe78:e0d5%et-0/0/31:0.0
Peer: fe80::5a86:70ff:fe78:e0d5%et-0/0/31:0.0+179 AS 102 Local: fe80::9e5a:80ff:fec1:ae08%et-0/0/31:0.0+53984 AS 201
Group: l3clos-inet-auto-underlay Routing-Instance: master
Forwarding routing-instance: master
Type: External State: Established Flags: <Sync PeerAsList AutoDiscoveredNdp>
Last State: OpenConfirm Last Event: RecvKeepAlive
Last Error: None
Export: [ (LEAF_TO_SPINE_FABRIC_OUT && BGP-AOS-Policy) ]
Options: <GracefulRestart AddressFamily Multipath LocalAS Refresh>
Options: <MultipathAs BfdEnabled>
Options: <GracefulShutdownRcv>
Address families configured: inet-unicast
Holdtime: 90 Preference: 170
Graceful Shutdown Receiver local-preference: 0
Local AS: 201 Local System AS: 201
Number of flaps: 0
Receive eBGP Origin Validation community: Reject
Peer ID: 10.0.0.2 Local ID: 10.0.1.1 Active Holdtime: 90
Keepalive Interval: 30 Group index: 0 Peer index: 0 SNMP index: 30
I/O Session Thread: bgpio-0 State: Enabled
BFD: enabled, up
Local Interface: et-0/0/1:0.0
NLRI for restart configured on peer: inet-unicast
NLRI advertised by peer: inet-unicast
NLRI for this session: inet-unicast
Peer supports Refresh capability (2)
Restart time configured on the peer: 120
Stale routes from peer are kept for: 300
Restart time requested by this peer: 120
Restart flag received from the peer: Notification
NLRI that peer supports restart for: inet-unicast
NLRI peer can save forwarding state: inet-unicast
NLRI that peer saved forwarding for: inet-unicast
NLRI that restart is negotiated for: inet-unicast
NLRI of received end-of-rib markers: inet-unicast
NLRI of all end-of-rib markers sent: inet-unicast
Peer does not support LLGR Restarter functionality
Peer supports 4 byte AS extension (peer-as 102)
Peer does not support Addpath
NLRI(s) enabled for color nexthop resolution: inet-unicast
Table inet.0 Bit: 20000
RIB State: BGP restart is complete
Send state: in sync
Active prefixes: 4
Received prefixes: 4
Accepted prefixes: 4
Suppressed due to damping: 0
Advertised prefixes: 1
Last traffic (seconds): Received 20 Sent 24 Checked 5788
Input messages: Total 216 Updates 5 Refreshes 0 Octets 4535
Output messages: Total 212 Updates 1 Refreshes 0 Octets 4125
Output Queue[1]: 0 (inet.0, inet-unicast)
Trace options: all
Trace file: /var/log//bgp size 131072 files 10To verify the operation of BFD for the BGP sessions use: show bfd
session
Example:
jnpr@stripe1-leaf1> show bfd session
Detect Transmit
Address State Interface Time Interval Multiplier
fe80::5a86:70ff:fe78:e0d5 Up et-0/0/31:0.0 9.000 3.000 3
fe80::5a86:70ff:fe79:3d5 Up et-0/0/33:0.0 9.000 3.000 3
fe80::5a86:70ff:fe7b:ced5 Up et-0/0/32:0.0 9.000 3.000 3
fe80::9e5a:80ff:feef:a28f Up et-0/0/30:0.0 9.000 3.000 3
8 sessions, 8 clients
Cumulative transmit rate 2.7 pps, cumulative receive rate 2.7 ppsTo control the propagation of routes, and make sure the loopback interface addresses are advertised, export policies are applied to these EBGP sessions as shown in the example in Table 32.
Table 32. Export policy example IPv4 Underlay with auto
discovery
These policies ensure loopback reachability without advertising unnecessary routes.
On the spine nodes, routes are exported only if they are accepted by both the SPINE_TO_LEAF_FABRIC_OUT and BGP-AOS-Policy export policies.
- The SPINE_TO_LEAF_FABRIC_OUT policy has no match conditions and accepts all routes unconditionally, tagging them with the FROM_SPINE_FABRIC_TIER community (0:15).
- The BGP-AOS-Policy accepts BGP-learned routes as well as any routes accepted by the nested AllPodNetworks policy.
- The AllPodNetworks policy, in turn, matches directly connected IPv6 routes and tags them with the DEFAULT_DIRECT_V6 community (1:20008 and 21001:26000 on Spine1).
As a result, each spine advertises both its directly connected routes (including its loopback interface) and any routes it has received from other leaf nodes.
You can verify that the expected routes are being advertised by the spine node using:
show route advertising-protocol bgp <peer-id> table inet6.0
Example:
The following example shows the routes advertised to Stripe 1 Leaf 1 by Spine 1 which correspond to the loopback interface addresses of itself, as well as Stripe1 Leaf 2, Stripe 2 Leaf 1, and Stripe 2 Leaf 2.
jnpr@spine1> show route advertising-protocol bgp fe80::9e5a:80ff:fec1:ae00%et-0/0/30:0.0 table inet.0 inet4.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden) Restart Complete Prefix Nexthop MED Lclpref AS path inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden) Restart Complete Prefix Nexthop MED Lclpref AS path * 10.0.0.1/32 Self I * 10.0.1.2/32 Self 202 I * 10.0.1.9/32 Self 209 I * 10.0.1.10/32 Self 210 I
To verify routes are received by the Leaf nodes use: show route receive-protocol
bgp <peer-id> table inet6.0
Example:
jnpr@stripe1-leaf1> show route receive-protocol bgp fe80::5a86:70ff:fe78:e0d5%et-0/0/1:0.0 table inet.0 inet6.0: 14 destinations, 23 routes (14 active, 0 holddown, 0 hidden) Restart Complete Prefix Nexthop MED Lclpref AS path * 10.0.0.1/32 fe80::9e5a:80ff:feef:a28f 101 I 10.0.0.2/32 fe80::9e5a:80ff:feef:a28f 101 202 I 10.0.0.9/32 fe80::9e5a:80ff:feef:a28f 101 209 I 10.0.0.10/32 fe80::9e5a:80ff:feef:a28f 101 210 I
On the leaf nodes, routes are exported only if they are accepted by both the LEAF_TO_SPINE_FABRIC_OUT and BGP-AOS-Policy export policies.
- The LEAF_TO_SPINE_FABRIC_OUT policy accepts all routes except those learned via BGP that are tagged with the FROM_SPINE_FABRIC_TIER community (0:15). These routes are explicitly rejected to prevent re-advertisement of spine-learned routes back into the spine layer. As described earlier, spine nodes tag all routes they advertise to leaf nodes with this community to facilitate this filtering logic.
- The BGP-AOS-Policy accepts all routes allowed by the nested AllPodNetworks policy, which matches directly connected IPv6 routes and tags them with the DEFAULT_DIRECT_V4 community (5:20007 and 21001:26000 for Stripe1-Leaf1).
As a result, leaf nodes will advertise only their directly connected interface routes, including their loopback interfaces, to the spines.
You can verify that the expected routes are being advertised by the spine node using:
show route advertising-protocol bgp <peer-id> table inet6.0
Example:
The following example shows the routes advertised to Spine 1 by Stripe 1 Leaf 1.
jnpr@stripe1-leaf1> show route advertising-protocol bgp fe80::5a86:70ff:fe78:e0d5%et-0/0/30:0.0 table inet6.0 inet6.0: 14 destinations, 23 routes (14 active, 0 holddown, 0 hidden) Restart Complete Prefix Nexthop MED Lclpref AS path * 10.0.0.1/32 Self I
To verify routes are received by the spine node, use: show route receive-protocol
bgp <peer-id> table inet6.0
Example:
jnpr@spine1> show route receive-protocol bgp fe80::9e5a:80ff:fec1:ae00%et-0/0/0:0.0 table inet6.0 inet6.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden) Restart Complete Prefix Nexthop MED Lclpref AS path * 10.0.0.1/32 fe80::9e5a:80ff:fec1:ae00 201 I
IPv6 GPU Backend Fabric Overlay
GPU Backend Fabric Overlay Using IPv4 The overlay EBGP sessions are configured between the leaf and spine nodes using the IPv4 addresses of the loopback interfaces, as shown in the example between Stripe1 Leaf 1/Stripe 2 Leaf 1 and Spine 1.
Table 33. GPU Backend Fabric Overlay Using IPv4 Loopback Addresses – Stripe 1 Example
Table 34. GPU Backend Fabric Overlay Using IPv4 Loopback Addresses – Stripe 2 Example
The overlay BGP sessions use family evpn signaling to enable EVPN route exchange. The multihop ttl 1 statement allows EBGP sessions to be established between the loopback interfaces.
As with the underlay BGP sessions, these sessions are configured with multipath multiple-as, allowing multiple EVPN paths with different AS paths to be considered for ECMP (Equal-Cost Multi-Path) routing. BFD (Bidirectional Forwarding Detection) is also enabled to improve convergence time in case of failures.
The no-nexthop-change knob on the spine nodes is used to preserve the original next-hop address, which is critical in EVPN for ensuring that the remote VTEP can be reached directly. The vpn-apply-export statement is included to ensure that the export policies are evaluated for VPN address families, such as EVPN, allowing fine-grained control over which routes are advertised to each peer.
To control the propagation of routes, export policies are applied to these EBGP sessions as shown in the example in table 33.
Table 35. Export Policy example to advertise EVPN routes over IPv4 overlay
These policies are simpler in structure and are intended to enable end-to-end EVPN reachability between tenant GPUs, while preventing route loops within the overlay.
Routes will only be advertised if EVPN routing-instances have been created. Example:
Table 36. EVPN Routing-Instances for a single tenant example
across different leaf nodes.
On the spine nodes, routes are exported if they are accepted by the SPINE_TO_LEAF_EVPN_OUT policy.
The SPINE_TO_LEAF_EVPN_OUT policy has no match conditions and accepts all routes. It tags each exported route with the FROM_SPINE_EVPN_TIER community (0:14).
As a result, the spine nodes export EVPN routes received from one leaf to all other leaf nodes, allowing tenant-to-tenant communication across the fabric.
Example:
jnpr@spine1> show route advertising-protocol bgp 10.0.1.1 | match 5:10.*2001.*31 5:10.0.1.2:2001::0::10.200.0.2::31/248 5:10.0.1.2:2001::0::10.200.0.34::31/248 5:10.0.1.9:2001::0::10.200.1.0::31/248 5:10.0.1.9:2001::0::10.200.1.32::31/248 5:10.0.1.10:2001::0::10.200.1.2::31/248 5:10.0.1.10:2001::0::10.200.1.34::31/248 jnpr@spine1>show route advertising-protocol bgp 10.0.1.1 match-prefix 5:10.0.1.9:2001::0::10.200.1.0::31/248 bgp.evpn.0: 378 destinations, 378 routes (378 active, 0 holddown, 0 hidden) Restart Complete Prefix Nexthop MED Lclpref AS path 5:10.0.1.9:2001::0::10.200.1.0::31/248 * 10.0.1.9 209 I
On the leaf nodes, routes are exported if they are accepted by both the LEAF_TO_SPINE_EVPN_OUT and EVPN_EXPORT policies:
- The LEAF_TO_SPINE_EVPN_OUT policy rejects any BGP-learned routes that carry the FROM_SPINE_EVPN_TIER community (0:14). These routes are explicitly rejected to prevent re-advertisement of spine-learned routes back into the spine layer. As described earlier, spine nodes tag all routes they advertise to leaf nodes with this community to facilitate this filtering logic.
- The EVPN_EXPORT policy accepts all routes without additional conditions.
As a result, the leaf nodes export only locally originated EVPN routes for the directly connected interfaces between GPU servers and the leaf nodes. These routes are part of the tenant routing instances and are required to establish reachability between GPUs belonging to the same tenant.
jnpr@stripe1-leaf1> show route advertising-protocol bgp 10.0.0.1 table Tenant-1 Tenant-1.evpn.0: 8 destinations, 20 routes (8 active, 0 holddown, 0 hidden) Restart Complete Prefix Nexthop MED Lclpref AS path 5:10.0.1.1:2001::0::10.200.0.0::31/248 * Self I 5:10.0.1.1:2001::0::10.200.0.16::31/248 * Self I jnpr@stripe1-leaf1> show route advertising-protocol bgp 10.0.0.1 table Tenant-2 Tenant-2.evpn.0: 8 destinations, 20 routes (8 active, 0 holddown, 0 hidden) Restart Complete Prefix Nexthop MED Lclpref AS path 5:10.0.1.1:2002::0::10.200.0.2::31/248 * Self I 5:10.0.1.1:2002::0::10.200.0.18::31/248 * Self I