Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Juniper RDMA-aware Load Balancing (LB) and BGP-DPF – GPU Backend Fabric Implementation

This section outlines the configuration details to implement Juniper RDMA-aware Load Balancing (LB) and BGP-DPF. All configuration and verification examples in this section are based on the following example:

Figure 1: Implementation Example Implementation Example

GPU Server to Leaf Nodes Connections Using IPv6 SLAAC (Stateless Address Autoconfiguration)

The GPU servers are connected following a rail-aligned architecture as described in the Backend GPU Rail Optimized Stripe Architecture section where GPU 0 on all the servers is connected to the first Leaf node, GPU 1 on all the servers is connected to the second Leaf node and so on. This is shown in Figure 2.

Figure 2: GPU Servers to Leaf Nodes Rail-Aligned Connectivity GPU Servers to Leaf Nodes Rail-Aligned Connectivity

Connectivity between the servers and the leaf nodes is L2 vlan-based with an IRB on the leaf nodes acting as default gateway for the servers.

Figure 3: IRB Interface Example IRB Interface Example

The physical interfaces connecting the servers and the leaf nodes are configured with family ethernet-switching and are mapped to a vlan with its associated irb as l3-interface.

Example:

The following example shows the configuration of the connection between stripe1-leaf1 and the gpu0_eth interfaces on Server 1 and Server 2. The irb.2 interface on the switch is configured with four /64 IPv6 addresses:

  • fd00:1:1:1::1/64,
  • fd00:1:1:2::1/64,
  • fd00:1:1:3::1/64, and
  • fd00:1:1:4::1/64.

These prefixes will be advertised to both server 1 and server 2.

You can verify that the IPv6 addresses have been correctly assigned to the irb.2 interface and that it is associated with the proper VLAN using the following commands:

Server SLAAC Configuration:

To configure the interfaces on the NVIDIA GPU servers follow the steps in NVIDIA Configuration | Juniper Networks. You will need to ensure that the netplan includes statements to disable DHCPv6.

The interfaces on the servers do not need to be configured with any IPv6 address or have IPv6 explicitly enabled. Disabling DHCPv6 is enough.

Example:

The following are netplan examples. You can use these examples to create templates to configure the addresses on all the servers.

Note:

Make sure there is no IPv4 enabled on the gpu#_eth interfaces.

The servers must also be configured to accept and process RA messages, for IPv6 address autoconfiguration via Router Advertisements (RA) to work. In most cases, this will be enabled by default but the steps to configured are described here:

The configuration has two layers:

  1. Interface-level RA policy in Netplan or systemd
  2. Kernel-level sysctl parameters (accept_ra, autoconf)

Both must align to ensure proper RA behavior.

  • If the system uses Netplan with systemd-networkd (common on Ubuntu Server):

In the Netplan YAML file (e.g., /etc/netplan/01-netcfg.yaml), add the following under each interface:

Then apply the changes:

This ensures that Netplan renders a .network file for systemd-networkd with IPv6AcceptRA=yes, which enables RA-based autoconfiguration.

However, this alone is not enough. If the kernel is still configured to ignore RAs. You must also verify that the kernel is set to accept RAs at runtime. You can check using:

If the value is 0, RAs will be ignored regardless of Netplan settings. This can be temporarily corrected with:

sudo sysctl -w net.ipv6.conf.<interface>.accept_ra=1

To make it persistent across reboots, add the following to a sysctl configuration file (e.g., /etc/sysctl.d/99-accept-ra.conf):

And apply it with:

Note:

Parameters such as accept-ra can be enable or disable globally or on a per interface basis.

Table 1: Parameters in IPv6 Configuration
SYSCTL SCOPE EFFECT
net.ipv6.conf.all.accept_ra Global (all current interfaces) Applies immediately to all existing interfaces, but... read-only if forwarding=1
net.ipv6.conf.default.accept_ra Global (for future interfaces) Sets the default value used when a new interface comes up (e.g., plugged in or created later)
net.ipv6.conf.gpu0_eth.accept_ra Per-interface Controls RA processing for a specific active interface
  • If the interface is managed directly by the kernel (not using Netplan/systemd):

Enable RA acceptance and autoconfiguration by setting:

Leaf node SLAAC Configuration

To enable SLAAC, the Leaf nodes must be configured with IPv6 addresses on the GPU server facing interfaces.

Example:

After configuring the IPv6 addresses, you must enable the advertisement of all prefixes, under protocol router-advertisement as shown in the example:

Note:

Configuring router advertisements for a given prefix requires an IPv6 address within that same prefix to be configured on the interface where the router advertisements are configured. An error is return when committing the configuration if the prefix configured under router advertisements is not also configured under the interface.

Example:

The following table summarizes the prefixes and IP addresses on each irb interface, on each leaf node:

Table 2: IPV6 Address Assignments
  IRB VLAN SUBNET 1 IRB IP ADDRESS 1 SUBNET 2 IRB IP ADDRESS 2 SUBNET 3 IRB IP ADDRESS 3 SUBNET 4 IRB IP ADDRESS 4 GPU INTERFACE
STRIPE 1 LEAF 1 1 2 fc00:1:1:1::/64 fc00:1:1:1::1 fc00:1:1:2::/64 fc00:1:1:2::1 fc00:1:1:3::/64 fc00:1:1:3::1 fc00:1:1:4::/64 fc00:1:1:4::1 gpu0_eth
STRIPE 1 LEAF 2 2 3 fc00:1:2:1::/64 fc00:1:2:1::1 fc00:1:2:2::/64 fc00:1:2:2::1 fc00:1:2:3::/64 fc00:1:2:3::1 fc00:1:2:4::/64 fc00:1:2:4::1 gpu1_eth
STRIPE 1 LEAF 3 3 4 fc00:1:3:1::/64 fc00:1:3:1::1 fc00:1:3:2::/64 fc00:1:3:2::1 fc00:1:3:3::/64 fc00:1:3:3::1 fc00:1:3:4::/64 fc00:1:3:4::1 gpu2_eth
STRIPE 1 LEAF 4 4 5 fc00:1:4:1::/64 fc00:1:4:1::1 fc00:1:4:2::/64 fc00:1:4:2::1 fc00:1:4:3::/64 fc00:1:4:3::1 fc00:1:4:4::/64 fc00:1:4:4::1 gpu3_eth
STRIPE 1 LEAF 5 5 6 fc00:1:5:1::/64 fc00:1:5:1::1 fc00:1:5:2::/64 fc00:1:5:2::1 fc00:1:5:3::/64 fc00:1:5:3::1 fc00:1:5:4::/64 fc00:1:5:4::1 gpu4_eth
STRIPE 1 LEAF 6 6 7 fc00:1:6:1::/64 fc00:1:6:1::1 fc00:1:6:2::/64 fc00:1:6:2::1 fc00:1:6:3::/64 fc00:1:6:3::1 fc00:1:6:4::/64 fc00:1:6:4::1 gpu5_eth
STRIPE 1 LEAF 7 7 8 fc00:1:7:1::/64 fc00:1:7:1::1 fc00:1:7:2::/64 fc00:1:7:2::1 fc00:1:7:3::/64 fc00:1:7:3::1 fc00:1:7:4::/64 fc00:1:7:4::1 gpu6_eth
STRIPE 1 LEAF 8 8 9 fc00:1:8:1::/64 fc00:1:8:1::1 fc00:1:8:2::/64 fc00:1:8:2::1 fc00:1:8:3::/64 fc00:1:8:3::1 fc00:1:8:4::/64 fc00:1:8:4::1 gpu7_eth
STRIPE 2 LEAF 1 9 10 fc00:2:1:1::/64 fc00:2:1:1::1 fc00:2:1:2::/64 fc00:2:1:2::1 fc00:2:1:3::/64 fc00:2:1:3::1 fc00:2:1:4::/64 fc00:2:1:4::1 gpu0_eth
STRIPE 2 LEAF 2 10 11 fc00:2:2:1::/64 fc00:2:2:1::1 fc00:2:2:2::/64 fc00:2:2:2::1 fc00:2:2:3::/64 fc00:2:2:3::1 fc00:2:2:4::/64 fc00:2:2:4::1 gpu1_eth
STRIPE 2 LEAF 3 11 12 fc00:2:3:1::/64 fc00:2:3:1::1 fc00:2:3:2::/64 fc00:2:3:2::1 fc00:2:3:3::/64 fc00:2:3:3::1 fc00:2:3:4::/64 fc00:2:3:4::1 gpu2_eth
STRIPE 2 LEAF 4 12 13 fc00:2:4:1::/64 fc00:2:4:1::1 fc00:2:4:2::/64 fc00:2:4:2::1 fc00:2:4:3::/64 fc00:2:4:3::1 fc00:2:4:4::/64 fc00:2:4:4::1 gpu3_eth
STRIPE 2 LEAF 5 13 14 fc00:2:5:1::/64 fc00:2:5:1::1 fc00:2:5:2::/64 fc00:2:5:2::1 fc00:2:5:3::/64 fc00:2:5:3::1 fc00:2:5:4::/64 fc00:2:5:4::1 gpu4_eth
STRIPE 2 LEAF 6 14 15 fc00:2:6:1::/64 fc00:2:6:1::1 fc00:2:6:2::/64 fc00:2:6:2::1 fc00:2:6:3::/64 fc00:2:6:3::1 fc00:2:6:4::/64 fc00:2:6:4::1 gpu5_eth
STRIPE 2 LEAF 7 15 16 fc00:2:7:1::/64 fc00:2:7:1::1 fc00:2:7:2::/64 fc00:2:7:2::1 fc00:2:7:3::/64 fc00:2:7:3::1 fc00:2:7:4::/64 fc00:2:7:4::1 gpu6_eth
STRIPE 2 LEAF 8 16 17 fc00:2:8:1::/64 fc00:2:8:1::1 fc00:2:8:2::/64 fc00:2:8:2::1 fc00:2:8:3::/64 fc00:2:8:3::1 fc00:2:8:4::/64 fc00:2:8:4::1 gpu7_eth

SLAAC Verification:

To verify that RA-based configuration works and the GPU interface has autoconfigured its IPv6 address and install corresponding routes, use the following commands:

You should see a global inet6 address in the SLAAC format (prefix::EUI-64) marked as dynamic or mngtmpaddr. You will also the interface’s link local address (FE80::EUI-64)

Example:

You can also observe incoming RA messages with:

Example:

In some cases, especially after changing RA settings or switching between static and dynamic configurations, the interface may need to be reset to trigger address reassignment:

After bringing the interface back up, wait a few seconds and re-check the IPv6 address with:

This ensures that stale addresses are removed, and fresh RAs are processed.

Note:

All IPv6 settings can be found under: /proc/sys/net/ipv6/conf .

To verify that router advertisements are being sent, you can use the following command:show ipv6 router-advertisement interface <interface>.

Example:

You can also capture router advertisement packets on the interface using:

Example:

Notice that Router Advertisements are sent using the link local address of the Leaf node interfaces as source, the ipv6 all-nodes multicast address (FF02::1), next-header ICMPv6 (58). The following are the most relevant attributes for these:

Table 3: Fields and Semantics in IPv6 Router Advertisement Prefix Information
PARAMETER VALUE DESCRIPTION
Flags onlink, auto

Hosts can assume addresses in this prefix are on the local link.

This prefix can be used for SLAAC (Stateless Address Auto Configuration).

Valid Lifetime 2592000 Prefix is valid for 30 days (used for reachability).
Preferred Lifetime 604800 Preferred lifetime of 7 days (after which it becomes deprecated for new connections).
router lifetime 1800s The router is considered a default gateway for 1800 seconds

After receiving the router-advertisement, the server’s NIC interfaces will have autoconfigured their IPv6 addresses by concatenating the prefix advertise by the Leaf node, with the host portion of the address calculated using the EUI-64 address format (based on the interface’s MAC address), as shown in Table 4.

Table 4: GPU to Leaf Nodes IPv6 Addresses
LEAF NODE INTERFACE

LEAF NODE IPv6

ADDRESS

GPU NIC

GPU NIC

MAC address

GPU NIC IPv6

ADDRESS

Stripe 1 Leaf 1 - et-0/0/16:0

fc00:1:1:1::1

fc00:1:1:2::1

fc00:1:1:3::1

fc00:1:1:4::1

Server 1 - gpu0_eth a0:88:c2:3b:50:66

fc00:1:1:1:a288:c2ff:fe3b:5066

fc00:1:1:2:a288:c2ff:fe3b:5066

fc00:1:1:3:a288:c2ff:fe3b:5066

fc00:1:1:4:a288:c2ff:fe3b:5066

Stripe 1 Leaf 1 - et-0/0/16:1

fc00:1:1:2::1

fc00:1:2:2::1

fc00:1:3:2::1

fc00:1:4:2::1

Server 2 - gpu0_eth 58:a2:e1:46:c6:ca

fc00:1:1:2:a288:c2ff:fe3b:506a

fc00:1:2:2:a288:c2ff:fe3b:506a

fc00:1:3:2:a288:c2ff:fe3b:506a

fc00:1:4:2:a288:c2ff:fe3b:506a

Stripe 1 Leaf 1 - et-0/0/17:0

fc00:1:1:3::1

fc00:1:2:3::1

fc00:1:3:3::1

fc00:1:4:3::1

Server 3 - gpu0_eth a0:88:c2:3b:50:6e

fc00:1:1:3:a288:c2ff:fe3b:506e

fc00:1:2:3:a288:c2ff:fe3b:506e

fc00:1:3:3:a288:c2ff:fe3b:506e

fc00:1:4:3:a288:c2ff:fe3b:506e

.

.

.

       

BGP DPF (Deterministic Path Forwarding) Using IPv6 Neighbor Discovery

This section describes how to configure BGP to establish peering sessions between the leaf and spine nodes automatically, using IPv6 neighbor discovery, and how to enable Deterministic Patch Forwarding. It includes other BGP configuration parameters.

BGP Auto-discovery

Because each connection between a leaf and spine node must be mapped to a different fabric color, and BGP is configured for auto-discovery, a separate BGP group per neighbor is required. This ensures that each dynamically discovered peer is associated with the correct color.

Each BGP group is configured as an external BGP session and assigned a local AS number:

Example:

A screenshot of a computer AI-generated content may be incorrect.

To enable BGP peer auto-discovery, each group includes a dynamic-neighbor template, which specifies the interface(s) where discovery is permitted. This replaces the traditional neighbor a.b.c.d configuration used for static peers. Auto-discovery is enabled using the peer-auto-discovery and ipv6-nd options:

Example:

A screen shot of a computer AI-generated content may be incorrect.

This allows Junos to determine neighbor addresses dynamically using IPv6 Neighbor Discovery.

To secure peer formation, each group references a defined AS range using the peer-as-list statement. This ensures that only neighbors with matching AS numbers can establish BGP sessions. The list itself is defined under policy-options:

Auto-discovery must be enabled on both the leaf and spine nodes.

Example:

A screenshot of a computer AI-generated content may be incorrect.

Each group is also assigned a fabric color, which is defined as a BGP community under policy-options:

where: <community-name> = <color>

A screenshot of a computer AI-generated content may be incorrect.

IPv6 NLRI support

The advertisement of IPv6 routes must be enabled on both leaf and spine nodes using:

ECMP multipath

BGP multipath must be configured to allow Equal-Cost Multipath (ECMP) routing over protected path, ensuring both load balancing and failover across multiple spine uplinks. This is achieved using:

BGP multipath must be configured on both leaf and spine nodes.

BFD (Bidirectional Forwarding Detection)

BFD is configured to improve failure detection time using:

Table 6: BFD Options
Options Description

minimum-interval

(required)

minimum time (in milliseconds) between BFD hello packets sent by the local device and the expected from the neighbor.

Range: 1-255,000

Recommended value: 1000

multiplier

(Optional)

number of hello packets not received by a neighbor that causes the originating interface to be declared down.

Range: 1-255

Recommended: 3 (default)

BFD must be configured on both the leaf and spine nodes.

DPF (Deterministic Path Forwarding)

The leaf nodes must be configured to advertise the same individual /64 IPv6 prefixes advertised to the GPU servers via Router Advertisements (RAs), as described in the GPU Server to Leaf Nodes Connections Using IPv6 SLAAC section .

Prefixes are advertised using the fabric-advertise statement:

Note:

The fabric-advertise commands are only needed on the leaf nodes.

The color option specifies that the prefix should be advertised to BGP peers matching <color>. The prefix is advertised tagged with the community value associated with <color> and with the AIGP attribute set to 0. The backup-color option specifies that the prefix should be advertised to BGP peers matching <backup-color>. The advertised prefix is tagged with the community value associated with the peer, but without the AIGP attribute.

Example:

Figure 5: Prefix fc00:1:1:1::/64 Advertisements Prefix fc00:1:1:1::/64 Advertisements

The following commands assign fabric colors to each BGP peer (i.e., each SPINE node when configured on a leaf node):

A screenshot of a computer AI-generated content may be incorrect.

To advertise a prefix (fc00:1:1:1::/64) across the preferred path to reach that prefix, and assign the color green to the prefix, the following commands are used:

Note:

Make sure the correct prefixes are configured. No commit error is generated but prefixes will not be advertised.

This causes prefix fc00:1:1:1::/64 to be advertised to Spine 1. Because the color assigned to the prefix (green) matches the color of the peer (spine 1), the route is advertised with community color:0:1 (green) and includes the AIGP attribute, marking it as the preferred path.

To advertise the same prefix across all other paths (backup path), the following commands are used:

This causes the prefix to be advertised to all remaining spines, each tagged with its corresponding color community (blue, red, orange). Because the color assigned to the prefix (green) does not match the color of the peers (spines 2, 3 and 4), these advertisements do not include the AIGP attribute, ensuring they are less preferred.

The following table summarizes the routing advertisement of prefix fc00:1:1:1::/64:

Table: Community and AIGP for Prefix fc00:1:1:1::/64

The previous example showed how a single prefix is advertised to all spine nodes, with only one spine receiving the route with the AIGP attribute to indicate the preferred path.

The table below expands on this by summarizing all prefixes advertised by stripe1-leaf1 (4) to all spine nodes. For each prefix, it shows:

  • The assigned color for the intended path.
  • The BGP community (color:0:X) used for tagging.
  • Whether the AIGP attribute is included (for preferred path only).

Table: Community and AIGP Per Spine, Per Prefix Example

A screenshot of a computer AI-generated content may be incorrect.

Example:

A computer screen shot of a computer code AI-generated content may be incorrect.

Note:

When neighbors are configured statically, the fabric-color can be configured directly under the neighbor statement, and therefore it is possible to have a single BGP group.

Preventing Route Re-advertisement to Spines

To configure the leaf nodes to advertise only the prefixes associated with the irb interface, and to prevent the advertisement of routes learned from one spine back to the other spine peers, an export policy is applied to all BGP groups. This ensures that the leaf acts as a non-transit node for spine-to-spine traffic and maintains proper routing in the fabric.

This policy matches any route matching prefixes fc00::/16 and fd00::16 with a prefix length 16 bits or longer, and rejects it during export. Any additional IPv6 address to be advertised can be added to the prefix-list local.

Without this policy, a leaf node would re-advertise prefixes learned from one spine node to all the others, which could lead to unwanted routing behavior, and inefficient traffic distribution across the different paths.

Additionally, routes are tagged with community local, which the spine nodes use to prevent advertising prefixes back to the leaf nodes.

BGP Session Auto-discovery and DPF Verification

You can check that the sessions have been established by using the show bgp summary command:

Notice that when BGP sessions are established using link-local addresses Junos displays the neighbor address along with the interface scope (e.g. fe80::5a86:70ff:fe78:e0d5%et-0/0/1:0.0). The scope identifier (the part after the %) is necessary because the same link-local address (fe80::/10) could exist on multiple interfaces. The device must know which interface to use to send packets to that neighbor. Thus, after peer discovery is completed, the show bgp summary output lists the neighbor using the format: IPv6_link-local_address%interface-name

You quickly check the status of the discovered neighbors using show bgp summary autodiscovered as shown in the example below:

You can verify the status of a specific neighbor based on its fabric-color:

To verify that the prefixes are advertised correctly to all peers use show route advertising-protocol bgp <peer-address>.

In the example, fe80::9e5a:80ff:feef:a28f%et-0/0/0:0.0 is the address of Spine 1.

Thus, all prefixes are advertised with community color:0:1 (green). fc00:1:1:1::/64 is also advertised with the AIGP attribute as it is also configured with the same color. The same prefix is also advertised to the other spines but without the AIGP value. fe80::5a86:70ff:fe78:e0d5%et-0/0/1:0.0 is the address of Spine 2.

fe80::5a86:70ff:fe7b:ced5%et-0/0/2:0.0 is the address of Spine 3.

fe80::5a86:70ff:fe79:3d5%et-0/0/3:0.0 is the address of Spine 4.

Juniper NCCL Plug-in

The Juniper NCCL Net Plugin assigns a unique IPv6 address to each Queue Pair (QP) on every RDMA interface. This enables QP flows to use distinct source and destination addresses, allowing the fabric to forward them along separate paths. To support this behavior, the plugin and its supporting libraries must be installed, and the servers must be configured with additional routing information to ensure proper forwarding of each IPv6 address—both from the server to the leaf node and vice versa.

Installing the plug-in on the servers

The Juniper NCCL net plugin is distributed as a compressed tar-ball (juniper-ib_2.23.4-1.tar.gz) and can be found in …

To install, extract the tar-ball to the root directory:

$ tar -xzvf juniper-ib_0.0.5.tar.gz -C /

Table 7: Key Installation Components
COMPONENT DESCRIPTION/USAGE
/usr/local/lib/libnccl-net-juniper-ib.so The NCCL network plugin shared object
/usr/local/bin/jnpr-fabric-topo-gen Script for generating fabric topology json file
/usr/local/bin/jnpr-AI-LB-dpf-config-gen Script for generating AI-LB DPF configurations
/usr/local/bin/jnpr-nccl-net-setup Tool for configuring GPU Server network settings
/usr/local/bin/gids.py Helper module for finding GIDs
/usr/local/bin/jnpr-rdma-ping Tool for testing RDMA connectivity
/usr/local/bin/jnpr-find-gids Tool for finding GIDs.

Configuring routing parameters on the GPUs servers

On the GPU servers, each IPv6 address assigned to an interface must be associated with a separate routing table, containing the appropriate routes to forward traffic through the correct default gateway. Conversely, IP rules must be created to direct incoming traffic destined for each specific IPv6 address to its corresponding routing table.

Therefore, when configuring the servers to operate correctly with the NCCL plugin for RDMA Load Balancing (RLB), you must create a number of routing tables equal to the number of IPv6 addresses assigned per interface (based on the number of uplinks, as previously described) multiplied by the number of NICs.

In the example shown in Figure 6, each interface is assigned four IPv6 addresses, resulting in a total of 32 routing tables. Each routing table must include a default route and a corresponding prefix route. Additionally, an IP rule must be added for each routing table.

Figure 6: Example from Server 1 (H100-01) Example from Server 1 (H100-01)

The jnpr-nccl-net-setup utility live-run option can be used to automatically create the necessary tables, routes, and IP rules on each server.

The command requires exporting the NCCL socket interface, and the Address family as shown in the example below:

If you need to create the routing tables, routes, and rules manually, follow the steps below to configure the tables, routes, and rules on all GPU servers manually if needed:

  1. Create Routing Tables

    Create the file jnpr_nccl_net.conf under /etc/iproute2/rt_tables.d on each server and add each table id and name per line as shown in the example.

    EXAMPLE:

    Table 8: Routing Tables Example
    STRIPE 1 / STRIPE 2
    INTERFACE ID TABLE
    gpu0_eth 10000 gpu0_subnet1
    10001 gpu0_subnet2
    10002 gpu0_subnet3
    10003 gpu0_subnet4
    gpu1_eth 10004 gpu1_subnet1
    10005 gpu1_subnet2
    10006 gpu1_subnet3
    10007 gpu1_subnet4
    gpu2_eth 10008 gpu2_subnet1
    10009 gpu2_subnet2
    10010 gpu2_subnet3
    10011 gpu2_subnet4
    gpu3_eth 10012 gpu3_subnet1
    10013 gpu3_subnet2
    10014 gpu3_subnet3
    10015 gpu3_subnet4
    gpu4_eth 10016 gpu4_subnet1
    10017 gpu4_subnet2
    10018 gpu4_subnet3
    10019 gpu4_subnet4
    gpu5_eth 10020 gpu5_subnet1
    10021 gpu5_subnet2
    10022 gpu5_subnet3
    10023 gpu5_subnet4
    gpu6_eth 10024 gpu6_subnet1
    10025 gpu6_subnet2
    10026 gpu6_subnet3
    10027 gpu6_subnet4
    gpu7_eth 10028 gpu7_subnet1
    10029 gpu7_subnet2
    10030 gpu7_subnet3
    10031 gpu7_subnet4
  2. Verify that the tables were created .
  3. Add IPv6 routes on each routing table.

    Configure a default route and prefix route on each routing table using the following commands, as shown in the example.

    If you need to remove all routes before you create required routes you can run:

    EXAMPLE:

    Table 9: Routes Example with 4 IPv6 Addresses per Interface
      STRIPE STRIPE 2
    <INTERFACE> <TABLE> <PREFIX> <DEFAULT GATEWAY> <TABLE> <PREFIX> <DEFAULT GATEWAY>
    gpu0_eth gpu0_subnet1 fc00:1:1:1::/64

    fc00:1:1:1::1

    (leaf 1 irb.2)

    gpu0_subnet1 fc00:2:1:1::/64

    fc00:2:1:1::1

    (leaf 1 irb.10)

    gpu0_subnet2 fc00:1:1:2::/64

    fc00:1:1:2::1

    (leaf 1 irb.2)

    gpu0_subnet2 fc00:2:1:2::/64

    fc00:2:1:2::1

    (leaf 1 irb.10)

    gpu0_subnet3 fc00:1:1:3::/64

    fc00:1:1:3::1

    (leaf 1 irb.2)

    gpu0_subnet3 fc00:2:1:3::/64

    fc00:2:1:3::1

    (leaf 1 irb.10)

    gpu0_subnet4 fc00:1:1:4::/64

    fc00:1:1:4::1

    (leaf 1 irb.2)

    gpu0_subnet4 fc00:2:1:4::/64

    fc00:2:1:4::1

    (leaf 1 irb.10)

    gpu1_eth gpu1_subnet1 fc00:1:2:1::/64

    fc00:1:2:1::1

    (leaf 2 irb.3)

    gpu1_subnet1 fc00:2:2:1::/64

    fc00:2:2:1::1

    (leaf 2 irb.11)

    gpu1_subnet2 fc00:1:2:2::/64

    fc00:1:2:2::1

    (leaf 2 irb.3)

    gpu1_subnet2 fc00:2:2:2::/64

    fc00:2:2:2::1

    (leaf 2 irb.11)

    gpu1_subnet3 fc00:1:2:3::/64

    fc00:1:2:3::1

    (leaf 2 irb.3)

    gpu1_subnet3 fc00:2:2:3::/64

    fc00:2:2:3::1

    (leaf 2 irb.11)

    gpu1_subnet4 fc00:1:2:4::/64

    fc00:1:2:4::1

    (leaf 2 irb.3)

    gpu1_subnet4 fc00:2:2:4::/64

    fc00:2:2:4::1

    (leaf 2 irb.11)

    gpu2_eth gpu2_subnet1 fc00:1:3:1::/64

    fc00:1:3:1::1

    (leaf 3 irb.4)

    gpu2_subnet1 fc00:2:3:1::/64

    fc00:2:3:1::1

    (leaf 3 irb.12)

    gpu2_subnet2 fc00:1:3:2::/64

    fc00:1:3:2::1

    (leaf 3 irb.4)

    gpu2_subnet2 fc00:2:3:2::/64

    fc00:2:3:2::1

    (leaf 3 irb.12)

    gpu2_subnet3 fc00:1:3:3::/64

    fc00:1:3:3::1

    (leaf 3 irb.4)

    gpu2_subnet3 fc00:2:3:3::/64

    fc00:2:3:3::1

    (leaf 3 irb.12)

    gpu2_subnet4 fc00:1:3:4::/64

    fc00:1:3:4::1

    (leaf 3 irb.4

    gpu2_subnet4 fc00:2:3:4::/64

    fc00:2:3:4::1

    (leaf 3 irb.12)

    gpu3_eth gpu3_subnet1 fc00:1:4:1::/64

    fc00:1:4:1::1

    (leaf 4 irb.5)

    gpu3_subnet1 fc00:2:4:1::/64

    fc00:2:4:1::1

    (leaf 4 irb.13)

    gpu3_subnet2 fc00:1:4:2::/64

    fc00:1:4:2::1

    (leaf 4 irb.5)

    gpu3_subnet2 fc00:2:4:2::/64

    fc00:2:4:2::1

    (leaf 4 irb.13)

    gpu3_subnet3 fc00:1:4:3::/64

    fc00:1:4:3::1

    (leaf 4 irb.5)

    gpu3_subnet3 fc00:2:4:3::/64

    fc00:2:4:3::1

    (leaf 4 irb.13)

    gpu3_subnet4 fc00:1:4:4::/64

    fc00:1:4:4::1

    (leaf 5 irb.5)

    gpu3_subnet4 fc00:2:4:4::/64

    fc00:2:4:4::1

    (leaf 4 irb.13)

    gpu4_eth gpu4_subnet1 fc00:1:5:1::/64

    fc00:1:5:1::1

    (leaf 5 irb.6)

    gpu4_subnet1 fc00:2:5:1::/64

    fc00:2:5:1::1

    (leaf 5 irb.14)

    gpu4_subnet2 fc00:1:5:2::/64

    fc00:1:5:2::1

    (leaf 5 irb.6)

    gpu4_subnet2 fc00:2:5:2::/64

    fc00:2:5:2::1

    (leaf 5 irb.14)

    gpu4_subnet3 fc00:1:5:3::/64

    fc00:1:5:3::1

    (leaf 5 irb.6)

    gpu4_subnet3 fc00:2:5:3::/64

    fc00:2:5:3::1

    (leaf 5 irb.14)

    gpu4_subnet4 fc00:1:5:4::/64

    fc00:1:5:4::1

    (leaf 5 irb.6)

    gpu4_subnet4 fc00:2:5:4::/64

    fc00:2:5:4::1

    (leaf 5 irb.14)

    gpu5_eth gpu5_subnet1 fc00:1:6:1::/64

    fc00:1:6:1::1

    (leaf 6 irb.7)

    gpu5_subnet1 fc00:2:6:1::/64

    fc00:2:6:1::1

    (leaf 6 irb.15)

    gpu5_subnet2 fc00:1:6:2::/64

    fc00:1:6:2::1

    (leaf 6 irb.7)

    gpu5_subnet2 fc00:2:6:2::/64

    fc00:2:6:2::1

    (leaf 6 irb.15)

    gpu5_subnet3 fc00:1:6:3::/64

    fc00:1:6:3::1

    (leaf 6 irb.7)

    gpu5_subnet3 fc00:2:6:3::/64

    fc00:2:6:3::1

    (leaf 6 irb.15)

    gpu5_subnet4 fc00:1:6:4::/64

    fc00:1:6:4::1

    (leaf 6 irb.7)

    gpu5_subnet4 fc00:2:6:4::/64

    fc00:2:6:4::1

    (leaf 6 irb.15)

    gpu6_eth gpu6_subnet1 fc00:1:7:1::/64

    fc00:1:7:1::1

    (leaf 7 irb.8)

    gpu6_subnet1 fc00:2:7:1::/64

    fc00:2:7:1::1

    (leaf 7 irb.16)

    gpu6_subnet2 fc00:1:7:2::/64

    fc00:1:7:2::1

    (leaf 7 irb.8)

    gpu6_subnet2 fc00:2:7:2::/64

    fc00:2:7:2::1

    (leaf 7 irb.16)

    gpu6_subnet3 fc00:1:7:3::/64

    fc00:1:7:3::1

    (leaf 7 irb.8)

    gpu6_subnet3 fc00:2:7:3::/64

    fc00:2:7:3::1

    (leaf 7 irb.16)

    gpu6_subnet4 fc00:1:7:4::/64

    fc00:1:7:4::1

    (leaf 7 irb.8)

    gpu6_subnet4 fc00:2:7:4::/64

    fc00:2:7:4::1

    (leaf 7 irb.16)

    gpu7_eth gpu7_subnet1 fc00:1:8:1::/64

    fc00:1:8:1::1

    (leaf 8 irb.9)

    gpu7_subnet1 fc00:2:8:1::/64

    fc00:2:8:1::1

    (leaf 8 irb.17)

    gpu7_subnet2 fc00:1:8:2::/64

    fc00:1:8:2::1

    (leaf 8 irb.9)

    gpu7_subnet2 fc00:2:8:2::/64

    fc00:2:8:2::1

    (leaf 8 irb.17)

    gpu7_subnet3 fc00:1:8:3::/64

    fc00:1:8:3::1

    (leaf 8 irb.9)

    gpu7_subnet3 fc00:2:8:3::/64

    fc00:2:8:3::1

    (leaf 8 irb.17)

    gpu7_subnet4 fc00:1:8:4::/64

    fc00:1:8:4::1

    (leaf 8 irb.9)

    gpu7_subnet4 fc00:2:8:4::/64

    fc00:2:8:4::1

    (leaf 8 irb.17)

    Stripe1: user@H100-01:/$ sudo ip -6 route add fc00:1:1:1::/64 dev gpu0_eth table gpu0_subnet1

    Stripe2:

  4. Verify routes creation

    If you need to remove all routes you can run:

    EXAMPLE:

  5. Add routing policy rules.

    Configure a routing policy rule for each routing table using the following command, as shown in the example.

    Table 10: Table and Prefix for Each Policy Rule
      Stripe 1 Stripe 2
    INTERFACE TABLE PREFIX TABLE PREFIX
    gpu0_eth gpu0_subnet1 fc00:1:1:1::/64 gpu0_subnet1 fc00:2:1:1::/64
    gpu0_subnet2 fc00:1:1:2::/64 gpu0_subnet2 fc00:2:1:2::/64
    gpu0_subnet3 fc00:1:1:3::/64 gpu0_subnet3 fc00:2:1:3::/64
    gpu0_subnet4 fc00:1:1:4::/64 gpu0_subnet4 fc00:2:1:4::/64
    gpu1_eth gpu1_subnet1 fc00:1:2:1::/64 gpu1_subnet1 fc00:2:2:1::/64
    gpu1_subnet2 fc00:1:2:2::/64 gpu1_subnet2 fc00:2:2:2::/64
    gpu1_subnet3 fc00:1:2:3::/64 gpu1_subnet3 fc00:2:2:3::/64
    gpu1_subnet4 fc00:1:2:4::/64 gpu1_subnet4 fc00:2:2:4::/64
    gpu2_eth gpu2_subnet1 fc00:1:3:1::/64 gpu2_subnet1 fc00:2:3:1::/64
    gpu2_subnet2 fc00:1:3:2::/64 gpu2_subnet2 fc00:2:3:2::/64
    gpu2_subnet3 fc00:1:3:3::/64 gpu2_subnet3 fc00:2:3:3::/64
    gpu2_subnet4 fc00:1:3:4::/64 gpu2_subnet4 fc00:2:3:4::/64
    gpu3_eth gpu3_subnet1 fc00:1:4:1::/64 gpu3_subnet1 fc00:2:4:1::/64
    gpu3_subnet2 fc00:1:4:2::/64 gpu3_subnet2 fc00:2:4:2::/64
    gpu3_subnet3 fc00:1:4:3::/64 gpu3_subnet3 fc00:2:4:3::/64
    gpu3_subnet4 fc00:1:4:4::/64 gpu3_subnet4 fc00:2:4:4::/64
    gpu4_eth gpu4_subnet1 fc00:1:5:1::/64 gpu4_subnet1 fc00:2:5:1::/64
    gpu4_subnet2 fc00:1:5:2::/64 gpu4_subnet2 fc00:2:5:2::/64
    gpu4_subnet3 fc00:1:5:3::/64 gpu4_subnet3 fc00:2:5:3::/64
    gpu4_subnet4 fc00:1:5:4::/64 gpu4_subnet4 fc00:2:5:4::/64
    gpu5_eth gpu5_subnet1 fc00:1:6:1::/64 gpu5_subnet1 fc00:2:6:1::/64
    gpu5_subnet2 fc00:1:6:2::/64 gpu5_subnet2 fc00:2:6:2::/64
    gpu5_subnet3 fc00:1:6:3::/64 gpu5_subnet3 fc00:2:6:3::/64
    gpu5_subnet4 fc00:1:6:4::/64 gpu5_subnet4 fc00:2:6:4::/64
    gpu6_eth gpu6_subnet1 fc00:1:7:1::/64 gpu6_subnet1 fc00:2:7:1::/64
    gpu6_subnet2 fc00:1:7:2::/64 gpu6_subnet2 fc00:2:7:2::/64
    gpu6_subnet3 fc00:1:7:3::/64 gpu6_subnet3 fc00:2:7:3::/64
    gpu6_subnet4 fc00:1:7:4::/64 gpu6_subnet4 fc00:2:7:4::/64
    gpu7_eth gpu7_subnet1 fc00:1:8:1::/64 gpu7_subnet1 fc00:2:8:1::/64
    gpu7_subnet2 fc00:1:8:2::/64 gpu7_subnet2 fc00:2:8:2::/64
    gpu7_subnet3 fc00:1:8:3::/64 gpu7_subnet3 fc00:2:8:3::/64
    gpu7_subnet4 fc00:1:8:4::/64 gpu7_subnet4 fc00:2:8:4::/64

    EXAMPLE:

  6. Verify rules creations:

If you need to remove all rules you can run:

EXAMPLE:

Running the NCCL Job

The following NCCL variables are required to run a NCCL test while mapping QPs and IPv6 addresses:

NCCL_NET_PLUGIN=juniper-ib
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/
NCCL_IB_QPS_PER_CONNECTION=4
NCCL_IB_SPLIT_DATA_ON_QPS=1
NCCL_IB_ADDR_FAMILY=AF_INET6
NCCL_SOCKET_FAMILY=AF_INET6
NCCL_SOCKET_NTHREADS=8
UCX_IB_GID_INDEX=3
UCX_NET_DEVICES=mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1
Table 11: NCCL Variables Description
VARIABLE DESCRIPTION VALUES ACCEPTED DEFAULT

NCCL_NET_PLUGIN

(since 2.11)

Set it to either a suffix string or to a library name to choose among multiple NCCL net plugins. This setting will cause NCCL to look for the net plugin library using the following strategy:

  • If NCCL_NET_PLUGIN is set, attempt loading the library with name specified by NCCL_NET_PLUGIN;
  • If NCCL_NET_PLUGIN is set and previous failed, attempt loading libnccl-net-<NCCL_NET_PLUGIN>.so;
  • If NCCL_NET_PLUGIN is not set, attempt loading libnccl-net.so;
  • If no plugin was found (neither user defined nor default), use internal network plugin.

For example, setting NCCL_NET_PLUGIN=foo will cause NCCL to try load foo and, if foo cannot be found, libnccl-net-foo.so (provided that it exists on the system).

Plugin suffix, plugin file name, or “none”.  
LD_LIBRARY_PATH Points to the directory containing the Juniper NCCL net plugin shared object    

NCCL_IB_QPS_PER_CONNECTION

(since 2.10)

Number of IB queue pairs to use for each connection between two ranks. This can be useful on multi-level fabrics which need multiple queue pairs to have good routing entropy.

See NCCL_IB_SPLIT_DATA_ON_QPS for different ways to split data on multiple QPs, as it can affect performance.

1 - 128 1

NCCL_IB_SPLIT_DATA_ON_QPS

(since 2.18)

This parameter controls how we use the queue pairs when we create more than one. Set to 1 (split mode), each message will be split evenly on each queue pair. This may cause a visible latency degradation if many QPs are used. Set to 0 (round-robin mode), queue pairs will be used in round-robin mode for each message we send. Operations which do not send multiple messages will not use all QPs. 0 or 1.

0

1 (for 2.18, & 2.19)

NCCL_IB_ADDR_FAMILY Specifies address family used for InfiniBand. AF_INET (IPv4) or AF_INET6 (IPv6) AF_INET
NCCL_SOCKET_FAMILY Specifies address family used for sockets. Should match the address type of the fabric. AF_INET or AF_INET6 AF_UNSPEC (fallback logic)
NCCL_SOCKET_NTHREADS Number of threads per socket used by NCCL. Can improve performance with multiple network interfaces. Integer (commonly set between 1–16 depending on CPU/network load) 1
UCX_IB_GID_INDEX Specifies the GID index for InfiniBand device. Needed to select the correct GID (e.g., global IPv6 GID). Integer, typically 3 for RoCEv2 over IPv6 (but depends on NIC config) NIC-dependent
UCX_NET_DEVICES Specifies the list of UCX-enabled network devices to use. Needed to pin traffic to selected NIC ports. Comma-separated list like mlx5_0:1,mlx5_3:1,... All available devices

Reference: Environment Variables — NCCL 2.27.5 documentation

Note:

Check Appendix A – How to run NCCL test using autoconfigured IPv6 address to determine the value of UCX_IB_GID_INDEX. Make sure the selected address is not the link local IPv6 address.

Increasing the number of QPs can improve jobs performance, but only up to a point. Beyond that, performance remains the same or even starts to drop due to internal processing limits within the GPU servers (NIC cache constraints, scheduling overhead, cache contention, and how completion queues are managed), and are not caused by the network fabric of the traffic balancing mechanisms.

As a rule of thumb, configuring the number of Queue pairs per connection to be equal to the number of uplinks (leaf to spine links) for optimal performance. Increasing the number of Queue pairs beyond that generally does not provide any benefit.

As an example, consider the performance results for NCCL tests with varying numbers of queue pairs presented in Table 12. The average bus bandwidth for all reduce tests improves as the number of QPs increases and reaches the highest value when the number of uplinks and queue pairs is the same. Increasing the number beyond that point does not provide any benefit or even results in a drastic decrease in performance.

Table 12: Average Bus Bandwidth with NCCL All-Reduce, and All-to-All NCCL Tests with Varying Numbers of Queue Pairs
Number of Uplinks Number of QPs Average Bus Bandwidth [Gbps]
  all-reduce all-to-all
4 1 189.067 17.785
4 2 379.499 31.666
4 4 386.753 43.552
4 8 386.734 28.130
4 16 383.412 13.537
4 32 381.354 6.294
8 1 66.488 15.835
8 2 201.376 26.355
8 4 364.614 43.284
8 8 386.739 28.404
8 16 383.396 13.662
8 32 381.060 6.338
Note:

These tests were completed with 4 nodes, 8 GPUs per node, NVIDIA H100-01 servers with Connect X7 NICs, and NCCL version 2.23.4+cuda12.6.