Appendix D – How to Run NCCL Tests Using Autoconfigured IPv6 Address

To run a model or NCCL test using a global IPv6 addresses assigned either statically or automatically via SLAAC the value of the NCCL_IB_GID_INDEX variable must be adjusted.

Note:

Starting with NCCL 2.21, the GID index no longer needs to be specified manually. It is automatically handled based on the NCCL_SOCKET_FAMILY setting. If NCCL_SOCKET_FAMILY is set to AF_INET6 and IPv6 connectivity between hosts is in place, RoCEv2 traffic over IPv6 should work as expected.

The NCCL_IB_GID_INDEX variable defines the Global ID index used by RoCE (RDMA) communication. The default value is -1, which means that NCCL will automatically select the correct GID index based on the active link layer of the InfiniBand device. If the link layer is Ethernet (RoCE), NCCL will use the GID index that returns a GID with RoCE v2 support (usually GID index 3, depending on driver/firmware).

For more details you can review Nvidia’s Environment Variables documentation

To find the GID for the desired address, use the following command:

To find the mellanox interface name you can use the following script:

Example:

Note: Make sure the GID matches in all nodes.

The easily find mapping information between the Mellanox interface names, the user assigned interface names (e.g. gpu0_eth), NICs, and the GPUs you can use the script find_pxb_gpu_nic_pairs.py which can be found under: https://github.com/Juniper/jvd/tree/main/Data%20Center/AIDC/backend/AI_ML_Multitenancy

Example:

Once you have identified the GID you can run a NCCL test using:

TENANT=<TENANT#> GID=<GID> ./run-tenant.sh

which can also be found under: https://github.com/Juniper/jvd/tree/main/Data%20Center/AIDC/backend/AI_ML_Multitenancy

Note: The script was created for Tenants = 1-8.

Example:

To check if the correct GPU is being used when running a NCCL test use the following:

Example:

GPU–NIC Mapping and Topology Awareness

Make sure that the correct GPU and NIC are mapped to each Tenant. Maintaining tight NUMA and PCIe alignment between the assigned GPU and NIC ensures the best performance. Each tenant’s GPU and NIC should be strategically co-located within the same NUMA region and PCIe hierarchy whenever possible.

The nvidia-smi topo -m command displays the interconnect topology between GPUs, NICs, and CPUs in the system. The output is shown as a matrix where rows and columns represent devices, and each cell indicates the connection type (or “distance”) between them. These connection types reveal how traffic flows across PCIe switches, host bridges, and CPU sockets, helping identify which GPU–NIC pairings deliver the best performance.

X	Same device (diagonal of the matrix)
PIX	Single PCIe switch or bridge. Shortest Path Fastest communication
PXB	Multiple PCIe bridges within the same root complex (NUMA node), but without traversing the PCIe Host Bridge. Slightly longer path and latency.
PHB	Crosses a PCIe Host Bridge (attached to CPU). May cross CPU boundaries. Lower performance.
SYS	Crosses multiple PCIe Host Bridges within the same NUMA node. More latency.
NODE	Crosses NUMA nodes, traversing QPI/UPI interconnects between CPU sockets. Slowest path — avoid for RDMA or latency-sensitive traffic.

For RDMA traffic, choose PXB or PIX paths for GPU↔NIC pairs to keep communication within the same NUMA domain and PCIe Host Bridge. Avoid SYS or NODE paths whenever possible, as they add unnecessary latency and reduce bandwidth efficiency.

As an example, consider a case where GPU2 and NIC0 are assigned to Tenant‑A, and GPU5 and NIC9 are assigned to Tenant‑B, as shown in Figure below. The nvidia-smi topo -m output in Figure ## indicates that traffic from GPU2→NIC0 must traverse multiple PCIe host bridges and cross NUMA domains, resulting in degraded performance for Tenant‑A. In contrast, GPU5→NIC9 communicates through multiple PCIe bridges within the same root complex, avoiding CPU traversal and maintaining better performance for Tenant‑B.

Figure 61. Tenants GPU and NIC assignment example