Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

NVIDIA Configuration

NVIDIA® ConnectX® family of network interface cards (NICs) offer advanced hardware offload and acceleration features, and speeds up to 400G, supporting both Ethernet and Infiniband protocols.

Always refer to the official manufacturer documentation when making changes. This section provides some guidelines based on the AI JVD lab testing.

Converting NVIDIA ConnectX NICs from Infiniband to Ethernet

By default, the NVIDIA ConnectX NICs are set to operate as Infiniband interfaces and must be converted to Ethernet using the mlxconfig tool.

1) Check the status of the ConnectX NICs using sudo mst status.

Note:

Mellanox Software Tools (MST) is part of the Mellanox firmware tools suite and can be used to manage and interact with Mellanox network adapters.

user@A100-01:/dev/mst$ sudo mst -h

Usage:

/usr/bin/mst {start|stop|status|remote|server|restart|save|load|rm|add|help|version|gearbox|cable}

Type "/usr/bin/mst help" for detailed help

Start the mst service or load the mst modules if necessary.

Example:

The example shows “MST PCI module is not loaded”. To load it, use the command modprobe mst_pci.

2) Identify the interface that you want to convert,

This sudo mst status -v command will provide a list of Mellanox devices (ConnectX-6 and ConnectX-7 NICs) detected on the system, along with their type, Mellanox device name, PCI addresses, RDMA interface name, NET interface name, and NUMA ID, as shown in the example below:

For the first interface in the list, you can identify the following:

  • Type = ConnectX7(rev:0)
  • Mellanox device name = mt4129_pciconf7 (/dev/mst/mt4129_pciconf7)
  • PCI addresses = cb:00.0
  • RDMA interface name = mlx5_12
  • NET interface name = net-gpu6_eth
  • NUMA = 1

Notice that for some of the interfaces the name follows the standard Linux interface naming scheme (e.g. net-enp14s0f1np1), while others do not (e.g. net-gpu0_eth). The interface names that do not follow the standard are user defined names for easy identification purposes. That means the default name was changed in the /etc/netplan/. We will show an example of how to do this later in this section.

3) Identify what mode a given interface is running using

mlxconfig -d <device> query

EXAMPLE:

Notice that you need to use the Mellanox device name, including the path (/dev/mst/mt4129_pciconf7).

Also, LINK_TYPE_P1 and LINK_TYPE_P2 refer to the two physical ports in a dual-port Mellanox adapter.

4) If an interface is operating in Infiniband mode, you can change the mode for ethernet mode using

mlxconfig -d <device> set [LINK_TYPE_P1=<link_type>] [LINK_TYPE_P2=<link_type>]

Example

Again, notice that you need to use the Mellanox device name, including the path (/dev/mst/mt4129_pciconf7).

Note:

Changes via mlxconfig require the box to be power cycled.

To check the status of the interface you can use the mlxlink:

For more details you can refer to:

HowTo Find Mellanox Adapter Type and Firmware/Driver version (Linux) (nvidia.com)

Firmware Support and Downloads - Identifying Adapter Cards (nvidia.com)

Identifying NICs and GPUs mappings and assigning the appropriate interface name

NICs can be used by any GPU at any time; it is not hard coded that a given GPU can only communicate with the outside world using a specific NIC card. However, there are preferred communication paths between GPUs and NICs, which in some cases could be seen as a 1:1 correspondence between them. This will be shown in the steps below.

NCCL (NVIDIA Collective Communications Library) will choose the path that has the best connection from a given GPU to one of the NICs.

To identify the paths selected by NCCL and what the best path between a GPU and a NIC is, follow these steps:

Use the nvidia-smi topo -m command, which displays topological information about the system, to identify the connection type between GPUs and NICs:

EXAMPLES:

  • DGX H100:

Figure 44. Nvidia H100 System Management Interface (SMI) system topology information

System Management Interface SMI | NVIDIA Developer

Based on our research:

Connection Type Description Performance
PIX PCIe on the same switch Good
PXB PCIe through multiple switches, but not host bridge Good
PHB PCIe switch and across a host bridge on the same NUMA - uses CPU OK
NODE PCIe switch and across multiple host bridge on the same NUMA Bad
SYS PCIe switch and across QPI/UPI bus between NUMA nodes - uses CPU Very Bad
NV# NVLink Very Good
  • DGX A100:

Figure 45. Nvidia A100 System Management Interface (SMI) system topology information

Identify PBX Connections

If you focus on the highlighted sections of the nvidia-smi output, you can see that for each GPU there is one or more NIC connection(s) of type PXB. This is the preferred “direct” path from each GPU to a given NIC. That means, when the GPU needs to communicate to a remote device, it will use one of these specific NICs, as the first option.

  • DGX H100:

Figure 46. Nvidia H100 System Management Interface (SMI) system topology PBX connections

A diagram of a computer system Description automatically generated

  • DGX A100:

Figure 47. Nvidia A100 System Management Interface (SMI) system topology PBX connections

A computer diagram of a computer network Description automatically generated with medium confidence

Note:

These paths are fixed.

You can also find these mappings in Nvidia’s A100 or H100 user guides.

For example, on an DGX H100/H200 System the port mappings according to the NVIDIA's DGX H100/H200 System User Guide table 5 and table 6 is as follows:

  Port ConnectX GPU Default RDMA NIC
2 OSFP4P2 CX1 0 ibp24s0 mlx5_0 NIC0
2 OSFP3P2 CX3 1 ibp64s0 mlx5_3 NIC3
1 OSFP3P1 CX2 2 ibp79s0 mlx5_4 NIC4
1 OSFP4P1 CX0 3 ibp94s0 mlx5_5 NIC5
2 OSFP1P2 CX1 4 ibp154s0 mlx5_6 NIC6
2 OSFP2P2 CX3 5 ibp192s0 mlx5_9 NIC9
1 OSFP2P1 CX2 6 ibp206s0 mlx5_10 NIC10
1 OSFP1P1 CX0 7 ibp220s0 mlx5_11 NIC11

Figure 48. Nvidia H100 Port mappings example

NIC GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
NIC0 PXB SYS SYS SYS SYS SYS SYS SYS
NIC3 SYS PXB SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS PXB SYS SYS SYS SYS SYS
NIC5 SYS SYS SYS PXB SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PXB SYS SYS SYS
NIC9 SYS SYS SYS SYS SYS PXB SYS SYS
NIC10 SYS SYS SYS SYS SYS SYS PXB SYS
NIC11 SYS SYS SYS SYS SYS SYS SYS PXB

A screenshot of a computer Description automatically generated

For more information and for the mappings on the A100 systems check:

Introduction to the NVIDIA DGX A100 System — NVIDIA DGX A100 User Guide 1 documentation

Introduction to NVIDIA DGX H100/H200 Systems — NVIDIA DGX H100/H200 User Guide 1 documentation

Changing NIC attributes

How to Change a NIC’s Interface Name, and Assign IP Addresses and Routes

NIC attributes such as the IP address or the interface name can be made by editing and reapplying the netplan.

The network configuration is described in the file: /etc/netplan/01-netcfg.yaml as shown in the example table below. Any attribute changes involve editing this file and reapplying the network plan as will be shown in the examples later in this section.

Example:

netcfg.yaml output
jvd@A100-01:/etc/netplan$ more 01-netcfg.yaml
# This is the network config written by 'subiquity' gpu0_eth: gpu4_eth:
network: match: match:
version: 2 macaddress: 94:6d:ae:54:72:22 macaddress: 94:6d:ae:5b:28:70
ethernets: dhcp4: false dhcp4: false
mgmt_eth: mtu: 9000 mtu: 9000
match: addresses: addresses:
macaddress: 7c:c2:55:42:b2:28 - 10.200.0.8/24 - 10.200.4.8/24
dhcp4: false routes: routes:
addresses: - to: 10.200.0.0/16 - to: 10.200.0.0/16
- 10.10.1.0/31 via: 10.200.0.254 via: 10.200.4.254
nameservers: from: 10.200.0.8 from: 10.200.4.8
addresses: set-name: gpu0_eth set-name: gpu4_eth
- 8.8.8.8 gpu1_eth: gpu5_eth:
routes: match: match:
- to: default macaddress: 94:6d:ae:5b:01:d0 macaddress: 94:6d:ae:5b:27:f0
via: 10.10.1.1 dhcp4: false dhcp4: false
set-name: mgmt_eth mtu: 9000 mtu: 9000
weka_eth: addresses: addresses:
match: - 10.200.1.8/24 - 10.200.5.8/24
macaddress: b8:3f:d2:8b:68:e0 routes: routes:
dhcp4: false - to: 10.200.0.0/16 - to: 10.200.0.0/16
mtu: 9000 via: 10.200.1.254 via: 10.200.5.254
addresses: from: 10.200.1.8 from: 10.200.5.8
- 10.100.1.0/31 set-name: gpu1_eth set-name: gpu5_eth
routes: gpu2_eth: gpu6_eth:
- to: 10.100.0.0/22 match: match:
via: 10.100.1.1 macaddress: 94:6d:ae:5b:28:60 macaddress: 94:6d:ae:54:78:e2
set-name: weka_eth dhcp4: false dhcp4: false
  mtu: 9000 mtu: 9000
  addresses: addresses:
  - 10.200.2.8/24 - 10.200.6.8/24
  routes: routes:
  - to: 10.200.0.0/16 - to: 10.200.0.0/16
  via: 10.200.2.254 via: 10.200.6.254
  from: 10.200.2.8 from: 10.200.6.8
  set-name: gpu2_eth set-name: gpu6_eth
  gpu3_eth: gpu7_eth:
  match: match:
  macaddress: 94:6d:ae:5b:01:e0 macaddress: 94:6d:ae:54:72:12
  dhcp4: false dhcp4: false
  mtu: 9000 mtu: 9000
  addresses: addresses:
  - 10.200.3.8/24 - 10.200.7.8/24
  routes: routes:
  - to: 10.200.0.0/16 - to: 10.200.0.0/16
  via: 10.200.3.254 via: 10.200.7.254
  from: 10.200.3.8 from: 10.200.7.8
  set-name: gpu3_eth set-name: gpu7_eth

To Map an Interface Name to a Specific NIC (Physical Interface)

Map the interface name to the MAC of the physical interface in the configuration file:

Figure 49. Nvidia A100 physical interface identification example

where: t

en = ethernet network interface.

p203s0 = physical location of the network interface.

203 bus number.

s0 = slot number 0 on the bus.

f1 = function number 1 for the network interface.

np1 = Network Port 1

A computer code with black text Description automatically generated

Function 0: Might be the primary Ethernet interface.

Function 1: Might be a second Ethernet interface.

Function 2: Might be a management or diagnostics interface.

Figure 50. Nvidia A100 netplan file modification example

You can find the names of all the logical interfaces on the devnames file:

Apply the changes using the netplan apply command

Figure 51. Nvidia A100 netplan application example

To Change the NIC Name

Change the value of set-name in the configuration file and save the changes:

Figure 52. Nvidia A100 netplan interface name change example

Apply the Changes Using the netplan apply command

Figure 53. Nvidia A100 netplan interface name change application and verification example

A computer screen shot of a computer code Description automatically generated

To Change the Current IP Address or Assign an IP Address to the NIC

Change or add the address under the proper interface in the configuration file, and save the changes:

Figure 54. Nvidia A100 netplan interface IP address change example

Enter the IP addresses preceded with a hyphen and indented; make sure to add the subnet mask.

Apply the Changes Using the netplan apply Command

Figure 55. Nvidia A100 netplan interface new IP address application and verification example

To Change or Add Routes to the NIC

Change or add the routes under the proper interface in the configuration file, and save the changes.

Figure 56. Nvidia A100 netplan additional routes example

Apply the changes using the netplan apply command

Figure 57. Nvidia A100 netplan additional routes application and verification example:

Configuring NVIDIA DCQCN – ECN

Figure 58: NVIDIA DCQCN – ECNA diagram of a cloud computing system Description automatically generated

Starting from MLNX_OFED 4.1 ECN will be enabled by default (in the firmware).

ECN bits on the IP header are always marked with 10 for RoCE traffic

ECN parameters are located on the following path /sys/class/net/<interface>/ecn

Use the following command to find the interface:

Notification Point (NP) Parameters

When the ECN-enabled receiver receives ECN-marked RoCE packets, it responds by sending CNP (Congestion Notification Packets).

The following commands describe the notification parameters:

Examples:

cnp_802p_prio = the value of the PCP (Priority Code Point) field of the CNP packets.

PCP is a 3-bit field within an Ethernet frame header when using VLAN tagged frames as defined by IEEE 802.1Q.

cnp_dscp = the value of the DSCP (Differentiated Services Code Point) field of the CNP packets.

min_time_between_cnps = minimal time between two consecutive CNPs sent. if ECN-marked RoCE packet arrives in a period smaller than min_time_between_cnps since previous sent CNP, no CNP will be sent as a response. This value is in microseconds. Default = 0

The output shows that roce_np is enabled for all priority values.

Note:

Sending CNP packets is handled globally per port, any priority enabled here will set sending CNP packets to on (1).

To change the attributes described above, use the mlxconfig utility:

Example:

Reaction Point (RP) Parameters

When the ECN-enabled sender receives CNP packets, it responds by slowing down transmission for the specified flows (priority).

The following parameters define how traffic flows will be rate limited, after CNP packets arrival:

Examples:

rpg_max_rate = Maximum rate at which reaction point node can transmit. Once this limit is reached, RP is no longer rate limited.

This value is configured in Mbits/sec. Default = 0 (full speed – no max)

The output shows that roce_rp is enabled for all priority values.

To check the ECN statistics use: ethtool -S <interface> | grep ecn

Note:

Handling CNP is configured per priority.

Example:

NVIDIA DCQCN – PFC Configuration

IEEE 802.1Qbb applies pause functionality to specific classes of traffic on the Ethernet link.

Figure 59: NVIDIA DCQCN – PFC ConfigurationA diagram of a computer program Description automatically generated with medium confidence

To check whether PFC is enabled on an interface use: ethtool -a <interface>

Example:

To check the current configuration parameters on an interface you can also use: mlnx_qos -i <interface>

Example:

To enable/disable PFC use: sudo ethtool -A <interface> rx <on|off> tx <on|off>

Example:

PFC should be enabled for a specific priority using the mlnx_qos utility:

Example:

Current configuration:

Enable PFC for Priority=2 instead of 3.

To check the PFC statistics use ethtool -S <interface>

Example:

Note:

The Pause counters are visible via ethtool only for priorities on which PFC is enabled.

NVIDIA TOS/DSCP Configuration for RDMA-CM QPS (RDMA Traffic)

Figure 60: NVIDIA TOS/DSCP

A diagram of a machine Description automatically generated

RDMA traffic must be properly marked to allow the switch to correctly classify it, and to place it in the lossless queue for proper treatment. Marking can be either DSCP within the IP header, or PCP in the ethernet frame vlan-tag field. Whether DSCP or PCP is use depends on whether the interface between the GPU server and the switch is doing vlan tagging (802.1q) or not.

To check the current configuration and to change the values of TOS for the RDMA outbound traffic, use the cma_roce_tos script that is part of MLNX_OFED 4.0.

To check the current value of the TOS field enter sudo cma_roce_tos without any options.

Example:

In the example, the current TOS value = 106, which means a DSCP value = 48 and the ECN bits set to 10.

Note:

The TOS field is 8 bits, while the DSCP is 6 bits. To set a DSCP value of X, you need to multiply this value by 4 (SHIFT 2). For example, to set DSCP value of 24, (24x4=96). Set the TOS bit to 96. You need to add 2 to include the ECN.

A screenshot of a graph Description automatically generated

To change the value use: cma_roce_tos –d <ib_device> -t <TOS>

You need to enter the ib_device in this command. The following script automatically does the mapping between the physical interfaces and the ib_device.

Example:

Figure 61. script results example

Figure 62. Reference TOS, DSCP Mappings:

A table with numbers and symbols