Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Appendix C – How to Run NCCL Test Using Autoconfigured IPV6 Address

To run a model or NCCL test using a global IPv6 addresses assigned either statically or automatically via SLAAC the value of the NCCL_IB_GID_INDEX variable must be adjusted.

Note:

Starting with NCCL 2.21, the GID index no longer needs to be specified manually. It is automatically handled based on the NCCL_SOCKET_FAMILY setting. If NCCL_SOCKET_FAMILY is set to AF_INET6 and IPv6 connectivity between hosts is in place, RoCEv2 traffic over IPv6 should work as expected.

The NCCL_IB_GID_INDEX variable defines the Global ID index used by RoCE (RDMA) communication. The default value is -1, which means that NCCL will automatically select the correct GID index based on the active link layer of the InfiniBand device. If the link layer is Ethernet (RoCE), NCCL will use the GID index that returns a GID with RoCE v2 support (usually GID index 3, depending on driver/firmware).

For more details, you can review Nvidia’s Environment Variables documentation.

To find the GID for the desired address, use the following command:ibv_devinfo -vvv -d <mellanox-interface-name> | grep GID

To find the mellanox interface name you can use the following script:

Example:

Once you have identified the GID you can run a NCCL test as shown in the example:

NCCL_PXN_DISABLE=1 NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_GID_INDEX =5 ./nccl_run_rails_all_H100.sh -b 1G -e 1G -n 200 -i 0 -m 10

Note:

The following script provides mapping information between the Mellanox interface names, the NIC numbers, and the user assigned interface names (e.g. gpu0_eth). It also provides mapping information between the interfaces and the GPUs.

Example:

jnpr@A100-01:~/SCRIPTS$ python3 find_pxb_gpu_nic_pairs.py