Juniper RDMA-aware Load Balancing (LB) and BGP-DPF – GPU Backend Fabric Implementation

This section outlines the configuration details to implement Juniper RDMA-aware Load Balancing (LB) and BGP-DPF. All configuration and verification examples in this section are based on the following example:

Figure 1: Implementation Example

GPU Server to Leaf Nodes Connections Using IPv6 SLAAC (Stateless Address Autoconfiguration)

The GPU servers are connected following a rail-aligned architecture as described in the Backend GPU Rail Optimized Stripe Architecture section where GPU 0 on all the servers is connected to the first Leaf node, GPU 1 on all the servers is connected to the second Leaf node and so on. This is shown in Figure 2.

Figure 2: GPU Servers to Leaf Nodes Rail-Aligned Connectivity

Connectivity between the servers and the leaf nodes is L2 vlan-based with an IRB on the leaf nodes acting as default gateway for the servers.

Figure 3: IRB Interface Example

The physical interfaces connecting the servers and the leaf nodes are configured with family ethernet-switching and are mapped to a vlan with its associated irb as l3-interface.

Example:

The following example shows the configuration of the connection between stripe1-leaf1 and the gpu0_eth interfaces on Server 1 and Server 2. The irb.2 interface on the switch is configured with four /64 IPv6 addresses:

fd00:1:1:1::1/64,
fd00:1:1:2::1/64,
fd00:1:1:3::1/64, and
fd00:1:1:4::1/64.

These prefixes will be advertised to both server 1 and server 2.

You can verify that the IPv6 addresses have been correctly assigned to the irb.2 interface and that it is associated with the proper VLAN using the following commands:

Server SLAAC Configuration:

To configure the interfaces on the NVIDIA GPU servers follow the steps in NVIDIA Configuration | Juniper Networks. You will need to ensure that the netplan includes statements to disable DHCPv6.

The interfaces on the servers do not need to be configured with any IPv6 address or have IPv6 explicitly enabled. Disabling DHCPv6 is enough.

Example:

The following are netplan examples. You can use these examples to create templates to configure the addresses on all the servers.

Note:

Make sure there is no IPv4 enabled on the gpu#_eth interfaces.

The servers must also be configured to accept and process RA messages, for IPv6 address autoconfiguration via Router Advertisements (RA) to work. In most cases, this will be enabled by default but the steps to configured are described here:

The configuration has two layers:

Interface-level RA policy in Netplan or systemd
Kernel-level sysctl parameters (accept_ra, autoconf)

Both must align to ensure proper RA behavior.

If the system uses Netplan with systemd-networkd (common on Ubuntu Server):

In the Netplan YAML file (e.g., /etc/netplan/01-netcfg.yaml), add the following under each interface:

Then apply the changes:

This ensures that Netplan renders a .network file for systemd-networkd with IPv6AcceptRA=yes, which enables RA-based autoconfiguration.

However, this alone is not enough. If the kernel is still configured to ignore RAs. You must also verify that the kernel is set to accept RAs at runtime. You can check using:

If the value is 0, RAs will be ignored regardless of Netplan settings. This can be temporarily corrected with:

sudo sysctl -w net.ipv6.conf.<interface>.accept_ra=1

To make it persistent across reboots, add the following to a sysctl configuration file (e.g., /etc/sysctl.d/99-accept-ra.conf):

And apply it with:

Note:

Parameters such as accept-ra can be enable or disable globally or on a per interface basis.

Table 1: Parameters in IPv6 Configuration
SYSCTL	SCOPE	EFFECT
net.ipv6.conf.all.accept_ra	Global (all current interfaces)	Applies immediately to all existing interfaces, but... read-only if forwarding=1
net.ipv6.conf.default.accept_ra	Global (for future interfaces)	Sets the default value used when a new interface comes up (e.g., plugged in or created later)
net.ipv6.conf.gpu0_eth.accept_ra	Per-interface	Controls RA processing for a specific active interface

If the interface is managed directly by the kernel (not using Netplan/systemd):

Enable RA acceptance and autoconfiguration by setting:

Leaf node SLAAC Configuration

To enable SLAAC, the Leaf nodes must be configured with IPv6 addresses on the GPU server facing interfaces.

Example:

After configuring the IPv6 addresses, you must enable the advertisement of all prefixes, under protocol router-advertisement as shown in the example:

Note:

Configuring router advertisements for a given prefix requires an IPv6 address within that same prefix to be configured on the interface where the router advertisements are configured. An error is return when committing the configuration if the prefix configured under router advertisements is not also configured under the interface.

Example:

The following table summarizes the prefixes and IP addresses on each irb interface, on each leaf node:

Table 2: IPV6 Address Assignments
	IRB	VLAN	SUBNET 1	IRB IP ADDRESS 1	SUBNET 2	IRB IP ADDRESS 2	SUBNET 3	IRB IP ADDRESS 3	SUBNET 4	IRB IP ADDRESS 4	GPU INTERFACE
STRIPE 1 LEAF 1	1	2	fc00:1:1:1::/64	fc00:1:1:1::1	fc00:1:1:2::/64	fc00:1:1:2::1	fc00:1:1:3::/64	fc00:1:1:3::1	fc00:1:1:4::/64	fc00:1:1:4::1	gpu0_eth
STRIPE 1 LEAF 2	2	3	fc00:1:2:1::/64	fc00:1:2:1::1	fc00:1:2:2::/64	fc00:1:2:2::1	fc00:1:2:3::/64	fc00:1:2:3::1	fc00:1:2:4::/64	fc00:1:2:4::1	gpu1_eth
STRIPE 1 LEAF 3	3	4	fc00:1:3:1::/64	fc00:1:3:1::1	fc00:1:3:2::/64	fc00:1:3:2::1	fc00:1:3:3::/64	fc00:1:3:3::1	fc00:1:3:4::/64	fc00:1:3:4::1	gpu2_eth
STRIPE 1 LEAF 4	4	5	fc00:1:4:1::/64	fc00:1:4:1::1	fc00:1:4:2::/64	fc00:1:4:2::1	fc00:1:4:3::/64	fc00:1:4:3::1	fc00:1:4:4::/64	fc00:1:4:4::1	gpu3_eth
STRIPE 1 LEAF 5	5	6	fc00:1:5:1::/64	fc00:1:5:1::1	fc00:1:5:2::/64	fc00:1:5:2::1	fc00:1:5:3::/64	fc00:1:5:3::1	fc00:1:5:4::/64	fc00:1:5:4::1	gpu4_eth
STRIPE 1 LEAF 6	6	7	fc00:1:6:1::/64	fc00:1:6:1::1	fc00:1:6:2::/64	fc00:1:6:2::1	fc00:1:6:3::/64	fc00:1:6:3::1	fc00:1:6:4::/64	fc00:1:6:4::1	gpu5_eth
STRIPE 1 LEAF 7	7	8	fc00:1:7:1::/64	fc00:1:7:1::1	fc00:1:7:2::/64	fc00:1:7:2::1	fc00:1:7:3::/64	fc00:1:7:3::1	fc00:1:7:4::/64	fc00:1:7:4::1	gpu6_eth
STRIPE 1 LEAF 8	8	9	fc00:1:8:1::/64	fc00:1:8:1::1	fc00:1:8:2::/64	fc00:1:8:2::1	fc00:1:8:3::/64	fc00:1:8:3::1	fc00:1:8:4::/64	fc00:1:8:4::1	gpu7_eth
STRIPE 2 LEAF 1	9	10	fc00:2:1:1::/64	fc00:2:1:1::1	fc00:2:1:2::/64	fc00:2:1:2::1	fc00:2:1:3::/64	fc00:2:1:3::1	fc00:2:1:4::/64	fc00:2:1:4::1	gpu0_eth
STRIPE 2 LEAF 2	10	11	fc00:2:2:1::/64	fc00:2:2:1::1	fc00:2:2:2::/64	fc00:2:2:2::1	fc00:2:2:3::/64	fc00:2:2:3::1	fc00:2:2:4::/64	fc00:2:2:4::1	gpu1_eth
STRIPE 2 LEAF 3	11	12	fc00:2:3:1::/64	fc00:2:3:1::1	fc00:2:3:2::/64	fc00:2:3:2::1	fc00:2:3:3::/64	fc00:2:3:3::1	fc00:2:3:4::/64	fc00:2:3:4::1	gpu2_eth
STRIPE 2 LEAF 4	12	13	fc00:2:4:1::/64	fc00:2:4:1::1	fc00:2:4:2::/64	fc00:2:4:2::1	fc00:2:4:3::/64	fc00:2:4:3::1	fc00:2:4:4::/64	fc00:2:4:4::1	gpu3_eth
STRIPE 2 LEAF 5	13	14	fc00:2:5:1::/64	fc00:2:5:1::1	fc00:2:5:2::/64	fc00:2:5:2::1	fc00:2:5:3::/64	fc00:2:5:3::1	fc00:2:5:4::/64	fc00:2:5:4::1	gpu4_eth
STRIPE 2 LEAF 6	14	15	fc00:2:6:1::/64	fc00:2:6:1::1	fc00:2:6:2::/64	fc00:2:6:2::1	fc00:2:6:3::/64	fc00:2:6:3::1	fc00:2:6:4::/64	fc00:2:6:4::1	gpu5_eth
STRIPE 2 LEAF 7	15	16	fc00:2:7:1::/64	fc00:2:7:1::1	fc00:2:7:2::/64	fc00:2:7:2::1	fc00:2:7:3::/64	fc00:2:7:3::1	fc00:2:7:4::/64	fc00:2:7:4::1	gpu6_eth
STRIPE 2 LEAF 8	16	17	fc00:2:8:1::/64	fc00:2:8:1::1	fc00:2:8:2::/64	fc00:2:8:2::1	fc00:2:8:3::/64	fc00:2:8:3::1	fc00:2:8:4::/64	fc00:2:8:4::1	gpu7_eth

SLAAC Verification:

To verify that RA-based configuration works and the GPU interface has autoconfigured its IPv6 address and install corresponding routes, use the following commands:

You should see a global inet6 address in the SLAAC format (prefix::EUI-64) marked as dynamic or mngtmpaddr. You will also the interface’s link local address (FE80::EUI-64)

Example:

You can also observe incoming RA messages with:

Example:

In some cases, especially after changing RA settings or switching between static and dynamic configurations, the interface may need to be reset to trigger address reassignment:

After bringing the interface back up, wait a few seconds and re-check the IPv6 address with:

This ensures that stale addresses are removed, and fresh RAs are processed.

Note:

All IPv6 settings can be found under: /proc/sys/net/ipv6/conf .

To verify that router advertisements are being sent, you can use the following command:show ipv6 router-advertisement interface <interface>.

Example:

You can also capture router advertisement packets on the interface using:

Example:

Notice that Router Advertisements are sent using the link local address of the Leaf node interfaces as source, the ipv6 all-nodes multicast address (FF02::1), next-header ICMPv6 (58). The following are the most relevant attributes for these:

Table 3: Fields and Semantics in IPv6 Router Advertisement Prefix Information
PARAMETER	VALUE	DESCRIPTION
Flags	onlink, auto	Hosts can assume addresses in this prefix are on the local link. This prefix can be used for SLAAC (Stateless Address Auto Configuration).
Valid Lifetime	2592000	Prefix is valid for 30 days (used for reachability).
Preferred Lifetime	604800	Preferred lifetime of 7 days (after which it becomes deprecated for new connections).
router lifetime	1800s	The router is considered a default gateway for 1800 seconds

After receiving the router-advertisement, the server’s NIC interfaces will have autoconfigured their IPv6 addresses by concatenating the prefix advertise by the Leaf node, with the host portion of the address calculated using the EUI-64 address format (based on the interface’s MAC address), as shown in Table 4.

Table 4: GPU to Leaf Nodes IPv6 Addresses
LEAF NODE INTERFACE	LEAF NODE IPv6 ADDRESS	GPU NIC	GPU NIC MAC address	GPU NIC IPv6 ADDRESS
Stripe 1 Leaf 1 - et-0/0/16:0	fc00:1:1:1::1 fc00:1:1:2::1 fc00:1:1:3::1 fc00:1:1:4::1	Server 1 - gpu0_eth	a0:88:c2:3b:50:66	fc00:1:1:1:a288:c2ff:fe3b:5066 fc00:1:1:2:a288:c2ff:fe3b:5066 fc00:1:1:3:a288:c2ff:fe3b:5066 fc00:1:1:4:a288:c2ff:fe3b:5066
Stripe 1 Leaf 1 - et-0/0/16:1	fc00:1:1:2::1 fc00:1:2:2::1 fc00:1:3:2::1 fc00:1:4:2::1	Server 2 - gpu0_eth	58:a2:e1:46:c6:ca	fc00:1:1:2:a288:c2ff:fe3b:506a fc00:1:2:2:a288:c2ff:fe3b:506a fc00:1:3:2:a288:c2ff:fe3b:506a fc00:1:4:2:a288:c2ff:fe3b:506a
Stripe 1 Leaf 1 - et-0/0/17:0	fc00:1:1:3::1 fc00:1:2:3::1 fc00:1:3:3::1 fc00:1:4:3::1	Server 3 - gpu0_eth	a0:88:c2:3b:50:6e	fc00:1:1:3:a288:c2ff:fe3b:506e fc00:1:2:3:a288:c2ff:fe3b:506e fc00:1:3:3:a288:c2ff:fe3b:506e fc00:1:4:3:a288:c2ff:fe3b:506e
. . .

Leaf to Spine Connections using Link Local Addresses and IPv6 Neighbor Discovery

The interfaces between the leaf and spine nodes do not require explicitly configured IP addresses and are configured as untagged interfaces with only family inet6 to enable processing of IPv6 traffic. The interfaces automatically generate their link local addresses using EUI-64 format, as shown in Figure 4.

Figure 4: Leaf Nodes to Spine Nodes Connectivity

Leaf to Spine Connections Configuration

Table: Leaf to Spine Interface Configuration Example

A screenshot of a computer AI-generated content may be incorrect.

Enabling IPv6 on an interface automatically assigns a link-local IPv6 address. The switch autogenerates link local addresses for each interface using the EUI-64 address format (based on the interface’s MAC address), as shown in Table 5.

Table 5: Spine and Leaf IPv6-Enabled Interface Link Local Addresses
LEAF NODE INTERFACE	LEAF NODE IPv6 ADDRESS	SPINE NODE INTERFACE	SPINE IPv6 ADDRESS
Stripe 1 Leaf 1 - et-0/0/0:0	fe80::9e5a:80ff:fec1:ae00/64	Spine 1 – et-0/0/0:0	fe80::9e5a:80ff:feef:a28f/64
Stripe 1 Leaf 1 - et-0/0/1:0	fe80::9e5a:80ff:fec1:ae08/64	Spine 2 – et-0/0/1:0	fe80::5a86:70ff:fe7b:ced5/64
Stripe 1 Leaf 1 - et-0/0/2:0	fe80::9e5a:80ff:fec1:af00/64	Spine 3 – et-0/0/2:0	fe80::5a86:70ff:fe78:e0d5/64
Stripe 1 Leaf 1 - et-0/0/3:0	fe80::9e5a:80ff:fec1:af08/64	Spine 4 – et-0/0/3:0	fe80::5a86:70ff:fe79:3d5/64
Stripe 1 Leaf 2 - et-0/0/0:0	fe80::5a86:70ff:fe79:dad5/64	Spine 1 – et-0/0/0:0	fe80::9e5a:80ff:feef:a297/64
Stripe 1 Leaf 2 - et-0/0/1:0	fe80::5a86:70ff:fe79:dadd/64	Spine 2 – et-0/0/1:0	fe80::5a86:70ff:fe7b:cedd/64
Stripe 1 Leaf 2 - et-0/0/2:0	fe80::5a86:70ff:fe79:dbd5/64	Spine 3 – et-0/0/2:0	fe80::5a86:70ff:fe78:e0dd/64
Stripe 1 Leaf 2 - et-0/0/3:0	fe80::5a86:70ff:fe79:dbdd/64	Spine 4 – et-0/0/3:0	fe80::5a86:70ff:fe79:3dd/64
. . .

These addresses are advertised through standard router advertisements, as part of the IPv6 Neighbor Discovery process, after router advertisements have been enabled on all the interfaces between the leaf and spine nodes as shown:

Table:. IPv6 Router Advertisement on Leaf and Spine Interfaces

Leaf to Spine Connections Verification

To verify that router advertisements are being sent you can use show ipv6 router-advertisement interface <interface> and show ipv6 neighbors:

BGP DPF (Deterministic Path Forwarding) Using IPv6 Neighbor Discovery

This section describes how to configure BGP to establish peering sessions between the leaf and spine nodes automatically, using IPv6 neighbor discovery, and how to enable Deterministic Patch Forwarding. It includes other BGP configuration parameters.

BGP Auto-discovery

Because each connection between a leaf and spine node must be mapped to a different fabric color, and BGP is configured for auto-discovery, a separate BGP group per neighbor is required. This ensures that each dynamically discovered peer is associated with the correct color.

Each BGP group is configured as an external BGP session and assigned a local AS number:

Example:

A screenshot of a computer AI-generated content may be incorrect.

To enable BGP peer auto-discovery, each group includes a dynamic-neighbor template, which specifies the interface(s) where discovery is permitted. This replaces the traditional neighbor a.b.c.d configuration used for static peers. Auto-discovery is enabled using the peer-auto-discovery and ipv6-nd options:

Example:

A screen shot of a computer AI-generated content may be incorrect.

This allows Junos to determine neighbor addresses dynamically using IPv6 Neighbor Discovery.

To secure peer formation, each group references a defined AS range using the peer-as-list statement. This ensures that only neighbors with matching AS numbers can establish BGP sessions. The list itself is defined under policy-options:

Auto-discovery must be enabled on both the leaf and spine nodes.

Example:

A screenshot of a computer AI-generated content may be incorrect.

Each group is also assigned a fabric color, which is defined as a BGP community under policy-options:

where: <community-name> = <color>

A screenshot of a computer AI-generated content may be incorrect.

IPv6 NLRI support

The advertisement of IPv6 routes must be enabled on both leaf and spine nodes using:

ECMP multipath

BGP multipath must be configured to allow Equal-Cost Multipath (ECMP) routing over protected path, ensuring both load balancing and failover across multiple spine uplinks. This is achieved using:

BGP multipath must be configured on both leaf and spine nodes.

BFD (Bidirectional Forwarding Detection)

BFD is configured to improve failure detection time using:

Table 6: BFD Options
Options	Description
minimum-interval (required)	minimum time (in milliseconds) between BFD hello packets sent by the local device and the expected from the neighbor. Range: 1-255,000 Recommended value: 1000
multiplier (Optional)	number of hello packets not received by a neighbor that causes the originating interface to be declared down. Range: 1-255 Recommended: 3 (default)

BFD must be configured on both the leaf and spine nodes.

DPF (Deterministic Path Forwarding)

The leaf nodes must be configured to advertise the same individual /64 IPv6 prefixes advertised to the GPU servers via Router Advertisements (RAs), as described in the GPU Server to Leaf Nodes Connections Using IPv6 SLAAC section .

Prefixes are advertised using the fabric-advertise statement:

Note:

The fabric-advertise commands are only needed on the leaf nodes.

The color option specifies that the prefix should be advertised to BGP peers matching <color>. The prefix is advertised tagged with the community value associated with <color> and with the AIGP attribute set to 0. The backup-color option specifies that the prefix should be advertised to BGP peers matching <backup-color>. The advertised prefix is tagged with the community value associated with the peer, but without the AIGP attribute.

Example:

Figure 5: Prefix fc00:1:1:1::/64 Advertisements

The following commands assign fabric colors to each BGP peer (i.e., each SPINE node when configured on a leaf node):

A screenshot of a computer AI-generated content may be incorrect.

To advertise a prefix (fc00:1:1:1::/64) across the preferred path to reach that prefix, and assign the color green to the prefix, the following commands are used:

Note:

Make sure the correct prefixes are configured. No commit error is generated but prefixes will not be advertised.

This causes prefix fc00:1:1:1::/64 to be advertised to Spine 1. Because the color assigned to the prefix (green) matches the color of the peer (spine 1), the route is advertised with community color:0:1 (green) and includes the AIGP attribute, marking it as the preferred path.

To advertise the same prefix across all other paths (backup path), the following commands are used:

This causes the prefix to be advertised to all remaining spines, each tagged with its corresponding color community (blue, red, orange). Because the color assigned to the prefix (green) does not match the color of the peers (spines 2, 3 and 4), these advertisements do not include the AIGP attribute, ensuring they are less preferred.

The following table summarizes the routing advertisement of prefix fc00:1:1:1::/64:

Table: Community and AIGP for Prefix fc00:1:1:1::/64

The previous example showed how a single prefix is advertised to all spine nodes, with only one spine receiving the route with the AIGP attribute to indicate the preferred path.

The table below expands on this by summarizing all prefixes advertised by stripe1-leaf1 (4) to all spine nodes. For each prefix, it shows:

The assigned color for the intended path.
The BGP community (color:0:X) used for tagging.
Whether the AIGP attribute is included (for preferred path only).

Table: Community and AIGP Per Spine, Per Prefix Example

A screenshot of a computer AI-generated content may be incorrect.

Example:

A computer screen shot of a computer code AI-generated content may be incorrect.

Note:

When neighbors are configured statically, the fabric-color can be configured directly under the neighbor statement, and therefore it is possible to have a single BGP group.

Preventing Route Re-advertisement to Spines

To configure the leaf nodes to advertise only the prefixes associated with the irb interface, and to prevent the advertisement of routes learned from one spine back to the other spine peers, an export policy is applied to all BGP groups. This ensures that the leaf acts as a non-transit node for spine-to-spine traffic and maintains proper routing in the fabric.

This policy matches any route matching prefixes fc00::/16 and fd00::16 with a prefix length 16 bits or longer, and rejects it during export. Any additional IPv6 address to be advertised can be added to the prefix-list local.

Without this policy, a leaf node would re-advertise prefixes learned from one spine node to all the others, which could lead to unwanted routing behavior, and inefficient traffic distribution across the different paths.

Additionally, routes are tagged with community local, which the spine nodes use to prevent advertising prefixes back to the leaf nodes.

BGP Session Auto-discovery and DPF Verification

You can check that the sessions have been established by using the show bgp summary command:

Notice that when BGP sessions are established using link-local addresses Junos displays the neighbor address along with the interface scope (e.g. fe80::5a86:70ff:fe78:e0d5%et-0/0/1:0.0). The scope identifier (the part after the %) is necessary because the same link-local address (fe80::/10) could exist on multiple interfaces. The device must know which interface to use to send packets to that neighbor. Thus, after peer discovery is completed, the show bgp summary output lists the neighbor using the format: IPv6_link-local_address%interface-name

You quickly check the status of the discovered neighbors using show bgp summary autodiscovered as shown in the example below:

You can verify the status of a specific neighbor based on its fabric-color:

To verify that the prefixes are advertised correctly to all peers use show route advertising-protocol bgp <peer-address>.

In the example, fe80::9e5a:80ff:feef:a28f%et-0/0/0:0.0 is the address of Spine 1.

Thus, all prefixes are advertised with community color:0:1 (green). fc00:1:1:1::/64 is also advertised with the AIGP attribute as it is also configured with the same color. The same prefix is also advertised to the other spines but without the AIGP value. fe80::5a86:70ff:fe78:e0d5%et-0/0/1:0.0 is the address of Spine 2.

fe80::5a86:70ff:fe7b:ced5%et-0/0/2:0.0 is the address of Spine 3.

fe80::5a86:70ff:fe79:3d5%et-0/0/3:0.0 is the address of Spine 4.

Juniper NCCL Plug-in

The Juniper NCCL Net Plugin assigns a unique IPv6 address to each Queue Pair (QP) on every RDMA interface. This enables QP flows to use distinct source and destination addresses, allowing the fabric to forward them along separate paths. To support this behavior, the plugin and its supporting libraries must be installed, and the servers must be configured with additional routing information to ensure proper forwarding of each IPv6 address—both from the server to the leaf node and vice versa.

Installing the plug-in on the servers

The Juniper NCCL net plugin is distributed as a compressed tar-ball (juniper-ib_2.23.4-1.tar.gz) and can be found in …

To install, extract the tar-ball to the root directory:

$ tar -xzvf juniper-ib_0.0.5.tar.gz -C /

Table 7: Key Installation Components
COMPONENT	DESCRIPTION/USAGE
/usr/local/lib/libnccl-net-juniper-ib.so	The NCCL network plugin shared object
/usr/local/bin/jnpr-fabric-topo-gen	Script for generating fabric topology json file
/usr/local/bin/jnpr-AI-LB-dpf-config-gen	Script for generating AI-LB DPF configurations
/usr/local/bin/jnpr-nccl-net-setup	Tool for configuring GPU Server network settings
/usr/local/bin/gids.py	Helper module for finding GIDs
/usr/local/bin/jnpr-rdma-ping	Tool for testing RDMA connectivity
/usr/local/bin/jnpr-find-gids	Tool for finding GIDs.

Configuring routing parameters on the GPUs servers

On the GPU servers, each IPv6 address assigned to an interface must be associated with a separate routing table, containing the appropriate routes to forward traffic through the correct default gateway. Conversely, IP rules must be created to direct incoming traffic destined for each specific IPv6 address to its corresponding routing table.

Therefore, when configuring the servers to operate correctly with the NCCL plugin for RDMA Load Balancing (RLB), you must create a number of routing tables equal to the number of IPv6 addresses assigned per interface (based on the number of uplinks, as previously described) multiplied by the number of NICs.

In the example shown in Figure 6, each interface is assigned four IPv6 addresses, resulting in a total of 32 routing tables. Each routing table must include a default route and a corresponding prefix route. Additionally, an IP rule must be added for each routing table.

Figure 6: Example from Server 1 (H100-01)

The jnpr-nccl-net-setup utility live-run option can be used to automatically create the necessary tables, routes, and IP rules on each server.

The command requires exporting the NCCL socket interface, and the Address family as shown in the example below:

If you need to create the routing tables, routes, and rules manually, follow the steps below to configure the tables, routes, and rules on all GPU servers manually if needed:

Create Routing Tables

Create the file jnpr_nccl_net.conf under /etc/iproute2/rt_tables.d on each server and add each table id and name per line as shown in the example.

EXAMPLE:

Table 8: Routing Tables Example
STRIPE 1 / STRIPE 2
INTERFACE	ID	TABLE
gpu0_eth	10000	gpu0_subnet1
	10001	gpu0_subnet2
	10002	gpu0_subnet3
	10003	gpu0_subnet4
gpu1_eth	10004	gpu1_subnet1
	10005	gpu1_subnet2
	10006	gpu1_subnet3
	10007	gpu1_subnet4
gpu2_eth	10008	gpu2_subnet1
	10009	gpu2_subnet2
	10010	gpu2_subnet3
	10011	gpu2_subnet4
gpu3_eth	10012	gpu3_subnet1
	10013	gpu3_subnet2
	10014	gpu3_subnet3
	10015	gpu3_subnet4
gpu4_eth	10016	gpu4_subnet1
	10017	gpu4_subnet2
	10018	gpu4_subnet3
	10019	gpu4_subnet4
gpu5_eth	10020	gpu5_subnet1
	10021	gpu5_subnet2
	10022	gpu5_subnet3
	10023	gpu5_subnet4
gpu6_eth	10024	gpu6_subnet1
	10025	gpu6_subnet2
	10026	gpu6_subnet3
	10027	gpu6_subnet4
gpu7_eth	10028	gpu7_subnet1
	10029	gpu7_subnet2
	10030	gpu7_subnet3
	10031	gpu7_subnet4

Verify that the tables were created .

Add IPv6 routes on each routing table.

Configure a default route and prefix route on each routing table using the following commands, as shown in the example.

If you need to remove all routes before you create required routes you can run:

EXAMPLE:

Table 9: Routes Example with 4 IPv6 Addresses per Interface
	STRIPE			STRIPE 2
<INTERFACE>	<TABLE>	<PREFIX>	<DEFAULT GATEWAY>	<TABLE>	<PREFIX>	<DEFAULT GATEWAY>
gpu0_eth	gpu0_subnet1	fc00:1:1:1::/64	fc00:1:1:1::1 (leaf 1 irb.2)	gpu0_subnet1	fc00:2:1:1::/64	fc00:2:1:1::1 (leaf 1 irb.10)
	gpu0_subnet2	fc00:1:1:2::/64	fc00:1:1:2::1 (leaf 1 irb.2)	gpu0_subnet2	fc00:2:1:2::/64	fc00:2:1:2::1 (leaf 1 irb.10)
	gpu0_subnet3	fc00:1:1:3::/64	fc00:1:1:3::1 (leaf 1 irb.2)	gpu0_subnet3	fc00:2:1:3::/64	fc00:2:1:3::1 (leaf 1 irb.10)
	gpu0_subnet4	fc00:1:1:4::/64	fc00:1:1:4::1 (leaf 1 irb.2)	gpu0_subnet4	fc00:2:1:4::/64	fc00:2:1:4::1 (leaf 1 irb.10)
gpu1_eth	gpu1_subnet1	fc00:1:2:1::/64	fc00:1:2:1::1 (leaf 2 irb.3)	gpu1_subnet1	fc00:2:2:1::/64	fc00:2:2:1::1 (leaf 2 irb.11)
	gpu1_subnet2	fc00:1:2:2::/64	fc00:1:2:2::1 (leaf 2 irb.3)	gpu1_subnet2	fc00:2:2:2::/64	fc00:2:2:2::1 (leaf 2 irb.11)
	gpu1_subnet3	fc00:1:2:3::/64	fc00:1:2:3::1 (leaf 2 irb.3)	gpu1_subnet3	fc00:2:2:3::/64	fc00:2:2:3::1 (leaf 2 irb.11)
	gpu1_subnet4	fc00:1:2:4::/64	fc00:1:2:4::1 (leaf 2 irb.3)	gpu1_subnet4	fc00:2:2:4::/64	fc00:2:2:4::1 (leaf 2 irb.11)
gpu2_eth	gpu2_subnet1	fc00:1:3:1::/64	fc00:1:3:1::1 (leaf 3 irb.4)	gpu2_subnet1	fc00:2:3:1::/64	fc00:2:3:1::1 (leaf 3 irb.12)
	gpu2_subnet2	fc00:1:3:2::/64	fc00:1:3:2::1 (leaf 3 irb.4)	gpu2_subnet2	fc00:2:3:2::/64	fc00:2:3:2::1 (leaf 3 irb.12)
	gpu2_subnet3	fc00:1:3:3::/64	fc00:1:3:3::1 (leaf 3 irb.4)	gpu2_subnet3	fc00:2:3:3::/64	fc00:2:3:3::1 (leaf 3 irb.12)
	gpu2_subnet4	fc00:1:3:4::/64	fc00:1:3:4::1 (leaf 3 irb.4	gpu2_subnet4	fc00:2:3:4::/64	fc00:2:3:4::1 (leaf 3 irb.12)
gpu3_eth	gpu3_subnet1	fc00:1:4:1::/64	fc00:1:4:1::1 (leaf 4 irb.5)	gpu3_subnet1	fc00:2:4:1::/64	fc00:2:4:1::1 (leaf 4 irb.13)
	gpu3_subnet2	fc00:1:4:2::/64	fc00:1:4:2::1 (leaf 4 irb.5)	gpu3_subnet2	fc00:2:4:2::/64	fc00:2:4:2::1 (leaf 4 irb.13)
	gpu3_subnet3	fc00:1:4:3::/64	fc00:1:4:3::1 (leaf 4 irb.5)	gpu3_subnet3	fc00:2:4:3::/64	fc00:2:4:3::1 (leaf 4 irb.13)
	gpu3_subnet4	fc00:1:4:4::/64	fc00:1:4:4::1 (leaf 5 irb.5)	gpu3_subnet4	fc00:2:4:4::/64	fc00:2:4:4::1 (leaf 4 irb.13)
gpu4_eth	gpu4_subnet1	fc00:1:5:1::/64	fc00:1:5:1::1 (leaf 5 irb.6)	gpu4_subnet1	fc00:2:5:1::/64	fc00:2:5:1::1 (leaf 5 irb.14)
	gpu4_subnet2	fc00:1:5:2::/64	fc00:1:5:2::1 (leaf 5 irb.6)	gpu4_subnet2	fc00:2:5:2::/64	fc00:2:5:2::1 (leaf 5 irb.14)
	gpu4_subnet3	fc00:1:5:3::/64	fc00:1:5:3::1 (leaf 5 irb.6)	gpu4_subnet3	fc00:2:5:3::/64	fc00:2:5:3::1 (leaf 5 irb.14)
	gpu4_subnet4	fc00:1:5:4::/64	fc00:1:5:4::1 (leaf 5 irb.6)	gpu4_subnet4	fc00:2:5:4::/64	fc00:2:5:4::1 (leaf 5 irb.14)
gpu5_eth	gpu5_subnet1	fc00:1:6:1::/64	fc00:1:6:1::1 (leaf 6 irb.7)	gpu5_subnet1	fc00:2:6:1::/64	fc00:2:6:1::1 (leaf 6 irb.15)
	gpu5_subnet2	fc00:1:6:2::/64	fc00:1:6:2::1 (leaf 6 irb.7)	gpu5_subnet2	fc00:2:6:2::/64	fc00:2:6:2::1 (leaf 6 irb.15)
	gpu5_subnet3	fc00:1:6:3::/64	fc00:1:6:3::1 (leaf 6 irb.7)	gpu5_subnet3	fc00:2:6:3::/64	fc00:2:6:3::1 (leaf 6 irb.15)
	gpu5_subnet4	fc00:1:6:4::/64	fc00:1:6:4::1 (leaf 6 irb.7)	gpu5_subnet4	fc00:2:6:4::/64	fc00:2:6:4::1 (leaf 6 irb.15)
gpu6_eth	gpu6_subnet1	fc00:1:7:1::/64	fc00:1:7:1::1 (leaf 7 irb.8)	gpu6_subnet1	fc00:2:7:1::/64	fc00:2:7:1::1 (leaf 7 irb.16)
	gpu6_subnet2	fc00:1:7:2::/64	fc00:1:7:2::1 (leaf 7 irb.8)	gpu6_subnet2	fc00:2:7:2::/64	fc00:2:7:2::1 (leaf 7 irb.16)
	gpu6_subnet3	fc00:1:7:3::/64	fc00:1:7:3::1 (leaf 7 irb.8)	gpu6_subnet3	fc00:2:7:3::/64	fc00:2:7:3::1 (leaf 7 irb.16)
	gpu6_subnet4	fc00:1:7:4::/64	fc00:1:7:4::1 (leaf 7 irb.8)	gpu6_subnet4	fc00:2:7:4::/64	fc00:2:7:4::1 (leaf 7 irb.16)
gpu7_eth	gpu7_subnet1	fc00:1:8:1::/64	fc00:1:8:1::1 (leaf 8 irb.9)	gpu7_subnet1	fc00:2:8:1::/64	fc00:2:8:1::1 (leaf 8 irb.17)
	gpu7_subnet2	fc00:1:8:2::/64	fc00:1:8:2::1 (leaf 8 irb.9)	gpu7_subnet2	fc00:2:8:2::/64	fc00:2:8:2::1 (leaf 8 irb.17)
	gpu7_subnet3	fc00:1:8:3::/64	fc00:1:8:3::1 (leaf 8 irb.9)	gpu7_subnet3	fc00:2:8:3::/64	fc00:2:8:3::1 (leaf 8 irb.17)
	gpu7_subnet4	fc00:1:8:4::/64	fc00:1:8:4::1 (leaf 8 irb.9)	gpu7_subnet4	fc00:2:8:4::/64	fc00:2:8:4::1 (leaf 8 irb.17)

Stripe1: user@H100-01:/$ sudo ip -6 route add fc00:1:1:1::/64 dev gpu0_eth table gpu0_subnet1

Stripe2:

Verify routes creation

If you need to remove all routes you can run:

EXAMPLE:

Add routing policy rules.

Configure a routing policy rule for each routing table using the following command, as shown in the example.

Table 10: Table and Prefix for Each Policy Rule
	Stripe 1		Stripe 2
INTERFACE	TABLE	PREFIX	TABLE	PREFIX
gpu0_eth	gpu0_subnet1	fc00:1:1:1::/64	gpu0_subnet1	fc00:2:1:1::/64
	gpu0_subnet2	fc00:1:1:2::/64	gpu0_subnet2	fc00:2:1:2::/64
	gpu0_subnet3	fc00:1:1:3::/64	gpu0_subnet3	fc00:2:1:3::/64
	gpu0_subnet4	fc00:1:1:4::/64	gpu0_subnet4	fc00:2:1:4::/64
gpu1_eth	gpu1_subnet1	fc00:1:2:1::/64	gpu1_subnet1	fc00:2:2:1::/64
	gpu1_subnet2	fc00:1:2:2::/64	gpu1_subnet2	fc00:2:2:2::/64
	gpu1_subnet3	fc00:1:2:3::/64	gpu1_subnet3	fc00:2:2:3::/64
	gpu1_subnet4	fc00:1:2:4::/64	gpu1_subnet4	fc00:2:2:4::/64
gpu2_eth	gpu2_subnet1	fc00:1:3:1::/64	gpu2_subnet1	fc00:2:3:1::/64
	gpu2_subnet2	fc00:1:3:2::/64	gpu2_subnet2	fc00:2:3:2::/64
	gpu2_subnet3	fc00:1:3:3::/64	gpu2_subnet3	fc00:2:3:3::/64
	gpu2_subnet4	fc00:1:3:4::/64	gpu2_subnet4	fc00:2:3:4::/64
gpu3_eth	gpu3_subnet1	fc00:1:4:1::/64	gpu3_subnet1	fc00:2:4:1::/64
	gpu3_subnet2	fc00:1:4:2::/64	gpu3_subnet2	fc00:2:4:2::/64
	gpu3_subnet3	fc00:1:4:3::/64	gpu3_subnet3	fc00:2:4:3::/64
	gpu3_subnet4	fc00:1:4:4::/64	gpu3_subnet4	fc00:2:4:4::/64
gpu4_eth	gpu4_subnet1	fc00:1:5:1::/64	gpu4_subnet1	fc00:2:5:1::/64
	gpu4_subnet2	fc00:1:5:2::/64	gpu4_subnet2	fc00:2:5:2::/64
	gpu4_subnet3	fc00:1:5:3::/64	gpu4_subnet3	fc00:2:5:3::/64
	gpu4_subnet4	fc00:1:5:4::/64	gpu4_subnet4	fc00:2:5:4::/64
gpu5_eth	gpu5_subnet1	fc00:1:6:1::/64	gpu5_subnet1	fc00:2:6:1::/64
	gpu5_subnet2	fc00:1:6:2::/64	gpu5_subnet2	fc00:2:6:2::/64
	gpu5_subnet3	fc00:1:6:3::/64	gpu5_subnet3	fc00:2:6:3::/64
	gpu5_subnet4	fc00:1:6:4::/64	gpu5_subnet4	fc00:2:6:4::/64
gpu6_eth	gpu6_subnet1	fc00:1:7:1::/64	gpu6_subnet1	fc00:2:7:1::/64
	gpu6_subnet2	fc00:1:7:2::/64	gpu6_subnet2	fc00:2:7:2::/64
	gpu6_subnet3	fc00:1:7:3::/64	gpu6_subnet3	fc00:2:7:3::/64
	gpu6_subnet4	fc00:1:7:4::/64	gpu6_subnet4	fc00:2:7:4::/64
gpu7_eth	gpu7_subnet1	fc00:1:8:1::/64	gpu7_subnet1	fc00:2:8:1::/64
	gpu7_subnet2	fc00:1:8:2::/64	gpu7_subnet2	fc00:2:8:2::/64
	gpu7_subnet3	fc00:1:8:3::/64	gpu7_subnet3	fc00:2:8:3::/64
	gpu7_subnet4	fc00:1:8:4::/64	gpu7_subnet4	fc00:2:8:4::/64

EXAMPLE:

Verify rules creations:

If you need to remove all rules you can run:

EXAMPLE:

Running the NCCL Job

The following NCCL variables are required to run a NCCL test while mapping QPs and IPv6 addresses:

NCCL_NET_PLUGIN=juniper-ib

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/

NCCL_IB_QPS_PER_CONNECTION=4

NCCL_IB_SPLIT_DATA_ON_QPS=1

NCCL_IB_ADDR_FAMILY=AF_INET6

NCCL_SOCKET_FAMILY=AF_INET6

NCCL_SOCKET_NTHREADS=8

UCX_IB_GID_INDEX=3

UCX_NET_DEVICES=mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1

Table 11: NCCL Variables Description
VARIABLE	DESCRIPTION	VALUES ACCEPTED	DEFAULT
NCCL_NET_PLUGIN (since 2.11)	Set it to either a suffix string or to a library name to choose among multiple NCCL net plugins. This setting will cause NCCL to look for the net plugin library using the following strategy: If NCCL_NET_PLUGIN is set, attempt loading the library with name specified by NCCL_NET_PLUGIN; If NCCL_NET_PLUGIN is set and previous failed, attempt loading libnccl-net-<NCCL_NET_PLUGIN>.so; If NCCL_NET_PLUGIN is not set, attempt loading libnccl-net.so; If no plugin was found (neither user defined nor default), use internal network plugin. For example, setting NCCL_NET_PLUGIN=foo will cause NCCL to try load foo and, if foo cannot be found, libnccl-net-foo.so (provided that it exists on the system).	Plugin suffix, plugin file name, or “none”.
LD_LIBRARY_PATH	Points to the directory containing the Juniper NCCL net plugin shared object
NCCL_IB_QPS_PER_CONNECTION (since 2.10)	Number of IB queue pairs to use for each connection between two ranks. This can be useful on multi-level fabrics which need multiple queue pairs to have good routing entropy. See NCCL_IB_SPLIT_DATA_ON_QPS for different ways to split data on multiple QPs, as it can affect performance.	1 - 128	1
NCCL_IB_SPLIT_DATA_ON_QPS (since 2.18)	This parameter controls how we use the queue pairs when we create more than one. Set to 1 (split mode), each message will be split evenly on each queue pair. This may cause a visible latency degradation if many QPs are used. Set to 0 (round-robin mode), queue pairs will be used in round-robin mode for each message we send. Operations which do not send multiple messages will not use all QPs.	0 or 1.	0 1 (for 2.18, & 2.19)
NCCL_IB_ADDR_FAMILY	Specifies address family used for InfiniBand.	AF_INET (IPv4) or AF_INET6 (IPv6)	AF_INET
NCCL_SOCKET_FAMILY	Specifies address family used for sockets. Should match the address type of the fabric.	AF_INET or AF_INET6	AF_UNSPEC (fallback logic)
NCCL_SOCKET_NTHREADS	Number of threads per socket used by NCCL. Can improve performance with multiple network interfaces.	Integer (commonly set between 1–16 depending on CPU/network load)	1
UCX_IB_GID_INDEX	Specifies the GID index for InfiniBand device. Needed to select the correct GID (e.g., global IPv6 GID).	Integer, typically 3 for RoCEv2 over IPv6 (but depends on NIC config)	NIC-dependent
UCX_NET_DEVICES	Specifies the list of UCX-enabled network devices to use. Needed to pin traffic to selected NIC ports.	Comma-separated list like mlx5_0:1,mlx5_3:1,...	All available devices

Reference: Environment Variables — NCCL 2.27.5 documentation

Note:

Check Appendix A – How to run NCCL test using autoconfigured IPv6 address to determine the value of UCX_IB_GID_INDEX. Make sure the selected address is not the link local IPv6 address.

Recommended Number of Queue Pairs

Increasing the number of QPs can improve jobs performance, but only up to a point. Beyond that, performance remains the same or even starts to drop due to internal processing limits within the GPU servers (NIC cache constraints, scheduling overhead, cache contention, and how completion queues are managed), and are not caused by the network fabric of the traffic balancing mechanisms.

As a rule of thumb, configuring the number of Queue pairs per connection to be equal to the number of uplinks (leaf to spine links) for optimal performance. Increasing the number of Queue pairs beyond that generally does not provide any benefit.

As an example, consider the performance results for NCCL tests with varying numbers of queue pairs presented in Table 12. The average bus bandwidth for all reduce tests improves as the number of QPs increases and reaches the highest value when the number of uplinks and queue pairs is the same. Increasing the number beyond that point does not provide any benefit or even results in a drastic decrease in performance.

Table 12: Average Bus Bandwidth with NCCL All-Reduce, and All-to-All NCCL Tests with Varying Numbers of Queue Pairs
Number of Uplinks	Number of QPs	Average Bus Bandwidth [Gbps]
		all-reduce	all-to-all
4	1	189.067	17.785
4	2	379.499	31.666
4	4	386.753	43.552
4	8	386.734	28.130
4	16	383.412	13.537
4	32	381.354	6.294
8	1	66.488	15.835
8	2	201.376	26.355
8	4	364.614	43.284
8	8	386.739	28.404
8	16	383.396	13.662
8	32	381.060	6.338

Note:

These tests were completed with 4 nodes, 8 GPUs per node, NVIDIA H100-01 servers with Connect X7 NICs, and NCCL version 2.23.4+cuda12.6.

ON THIS PAGE

Juniper RDMA-aware Load Balancing (LB) and BGP-DPF – GPU Backend Fabric Implementation

GPU Server to Leaf Nodes Connections Using IPv6 SLAAC (Stateless Address Autoconfiguration)

Leaf to Spine Connections using Link Local Addresses and IPv6 Neighbor Discovery

BGP DPF (Deterministic Path Forwarding) Using IPv6 Neighbor Discovery

Juniper NCCL Plug-in