Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

AMD configuration

The AI servers covered as part of the JVD include 2 Supermicro AS-8125GS-TNMR2 Dual AMD EPYC 8U GPU and 2 Dell PowerEdge XE9680.

This section provides some guidelines to install and configure the interfaces and other relevant parameters based on the AI JVD lab testing. Always refer to the official manufacturer’s documentation when making changes and for more details.

AMD MI300X Setting BIOS Parameters

Each vendor has different BIOS settings based on differences in its UI and GPU mappings and the servers' internal architectures.

SuperMicro AS-8125GS-TNMR2

Boot the server into Setup mode (the boot to supermicro splash will take several minutes to appear):

UEFI/BIOS Area Value
Advanced -> NB Configuration ACS Enable = Disable
Advanced -> NB Configuration -> xGMI xGMI Link Width Control = Manual
xGMI Force Link Width Control = Force
xGMI Force Link Width = 2
xGMI Max Link Width control = Manual
xGMI Link Max Speed = Auto
Advanced -> PCIe/PCI/PnP Configuration Above 4G Encoding: Enabled
Re-Size BAR Support: Enabled
SR-IOV Support: Enabled
Workload = Not Configured

DELL XE9680

The following BIOS settings are recommended by Dell for their XE9680 AI/ML server. The BIOS settings disable IOMMU and ACS on the host as well.

UEFI/BIOS Area Value
BIOS -> Processor Settings Logical Processor = Disable
Virtualization Technology = Disable
SubNumaCluster = Disable
MADt Core cluster = Linear
1 BIOS -> Integrated Devices Global SRIOV = Disable 1
BIOS -> System Profile Setting Server System Profile = Performance
Workload = Not Configured
BIOS -> System Security AC Recovery Delay = Random (highly recommended)

1 Dell recommends “enabling” Global SR-IOV, but on the Dell DUTs in this lab setup, this setting was incompatible with the Thor2 NIC port mode 0 for the storage and frontend fabrics (2x200Gb vs. 1x400Gb), causing the DUT to fault on boot. Consult with your Dell account team for recommendations about this setting in your setup.”

Follow the configuration steps described in the Single-node network configuration for AMD Instinct accelerators — GPU cluster networking documentation. Notice that the disable ACS script used in step 6, must also be run before any workloads, after a server has been rebooted.

Ethernet Network Adapter (NICs) for AI Data centers

AI/ML workloads have increased complexity and scale; networking becomes crucial for efficient job completion times. The Network Adapters (NIC) are the connection points that connect the GPUs to the data center fabrics and hence these NICs should be able to handle large amounts of data and should be able to support high-speed, low-latency communications between GPU servers. Due to this at a minimum, the NICs should be able to support some of the key AI/ML functionality such as below:

  • RDMA over Converged Ethernet (RoCE) and congestion control.
  • Ability to handle 400G data bidirectional with low latency.
  • Advanced congestion control mechanisms that are sensitive and able to react to network congestion and optimize traffic flow.
  • Support GPU scalability ensuring robust performance even with increasing GPUs.

For Server NICs, we have two options:

  • Broadcom Thor2—The Broadcom Thor2 network adapters were validated for AI/ML workloads and job completion times.
  • AMD Pollara—AMD Pollara 400 ethernet network adapters

For more information on AMD Pensando Pollara 400 (ethernet adapter) refer to this link.

Identifying NICs and GPUs mappings

Along with the fabric and GPU server setup, this JVD covers the configuration and setup of ethernet Network Adapters (or NIC) as below. The Broadcom BCM957608 (Thor2) ethernet network adapter was validated in Phase 1. And in Phase2, the AMD Pollara 400 NIC cards are validated.

All 4 servers are equipped with:

and either of the below NICs

Or

Note: Most of the setup commands will apply to both Thor2 and AMD Pollara 400 NIC. However, in case of any differences in steps, same will be called out at appropriate places within the document.

Dell devices:

AMD MI300x GPU Server and NIC Firmware and RCCL libraries

For the purposes of Broadcom Thor 2 NIC validation following are the main OS and firmware versions configured on the MI300x GPU servers:

Broadcom Thor 2 Ethernet Adapter

Below are the details of the Operating System (OS), firmware and AMD libraries installed:

OS/Firmware Version
Ubuntu Ubuntu 24.04.2 LTS
Broadcom Thor2 NIC Firmware version 231.2.63.0

Following are the libraries installed for RCCL test for Thor2 Network adapter:

RCCL Test libraries Version command
rocm/noble 6.4.0.60400-47~24.04 amd64 apt list rocm
Rccl 1 2.22.3.60400-47~24.04 amd64 apt list rccl
mpi (Open MPI) 5.0.8a1 mpirun –version

UCX

https://github.com/openucx/ucx.git

1.15.0 /opt/ucx/bin/ucx_info -v
Note: For AMD drivers and host utilities, please contact your regional AMD representative.

AMD Pensando Pollara 400 Ethernet Adapter

For the purposes of the AMD Pollara 400 NIC validation, the following are the main OS and firmware versions configured on the MI300x GPU servers:

OS/Firmware Version
Ubuntu Ubuntu 22.04.5 LTS
AMD Pollara NIC Firmware version 1.110.0-a-79

Output of the ubuntu version 22.04 installed on the MI300 servers.

Output of the AMD Pollara 400 NIC card Firmware version

Following are the libraries installed for RCCL test for AMD Pollara 400 NIC adapter:

RCCL Test libraries Version command
rocm/jammy 6.3.3.60303-74~22.04 amd64 apt list rocm
Rccl 1 7961624  
mpi (Open MPI) 5.1.0a1 /opt/ompi/bin/mpirun –version

UCX

https://github.com/openucx/ucx.git

1.20.0 /opt/ucx/bin/ucx_info -v
rccl-tests revision 6704fc6 Git branch https://github.com/ROCm/rccl-tests.git
ANP Plugin2    
Note: For AMD drivers and host utilities, please contact your regional AMD representative.
1. The RCCL library is private build version that AMD provided.
2. ANP Plugin version is private build provided by AMD.

For more information on installing these software and dependent libraries, high level steps are provided later in section AMD Pollara firmware and dependent libraries, as these steps can only be performed once the NICs and GPUs are mapped as described in sections below.

In this section, we will explore some of the options to find information about and configure the NICs and GPUs.

ROCm Communication Collectives Library (RCCL)

In AMD servers, the ROCm provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs. These collectives implement send and receive operations such as all-reduce, all-gather, reduce, broadcast, all-to-all, and so on across multiple GPUs in one or more GPU servers.

Communication between GPUs on a single server is implemented using xGMI (inter-chip global memory interconnect), part of AMD's Infinity Fabric technology. The Infinity Fabric is a high-bandwidth, low-latency interconnect for the various components within a system including CPUs, GPUs, memory, NICs, and other devices. xGMI provides socket-to-socket communication, allowing direct CPU-to-CPU or GPU-to-GPU communication.

Communication between different servers is processed by RDMA-capable NICs (e.g., RoCEv2 over Ethernet) and routed across the GPU backend fabric. These NICs can be used by any GPU at any time as there is no hard coded 1-to-1 GPU to NIC mapping. However, the use of preferred communication paths between GPUs and NICs creates the appearance of a 1:1 correspondence.

RCCL will always choose the path that has the best connection between GPUs and between GPUs and NICs, aiming to optimize bandwidth, and latency. Optimized intra-node path will be taken before forwarding inter-node.

The rocm-smi (Radeon Open Compute Platform System Management Interface) cli provides tools for configuring and monitoring AMD GPUs. It can be used to identify GPUs hardware details as well as topology information using the options such as:

--showproductname: show product details

--showtopo : show hardware topology information

--showtopoaccess : shows the link accessibility between GPUs

--showtopohops : shows the number of hops between GPUs

--showtopotype : shows the link type between GPUs

--showtoponuma : shows the numa nodes

--shownodesbw: shows the numa nodes bandwidth

--showhw: shows the hardware details

Examples from AMD Instinct MI300X OAM:

The --showproductname shows the GPU series, model, and vendor along with additional details. The example output shows AMD Instinct™ MI300X Platform GPUs are installed in the server.

The --showhw options shows information about the GPUs in the system, including ID

The fields are defined as follows:

GPU Index of the GPU on the system, starting from 0.
NODE NUMA (Non-Uniform Memory Access) node ID associated with the GPU. Helps identify memory locality. Optimal GPU/NIC mapping often relies on NUMA proximity
DID

Device ID of the GPU. This is a unique identifier for the specific GPU model.

Useful for verifying the exact GPU model. For example, 0x74a1 corresponds to an MI300X-series GPU.

GUID

GPU Unique Identifier. This value is specific to each GPU and may relate to its PCIe device.

Useful for distinguishing GPUs in a multi-GPU environment.

GFX VER

The version of the GPU architecture (e.g., gfx942 is part of AMD's RDNA2 family).

In AMD GPUs, the GFX prefix is part of AMD's internal naming convention for their GPU microarchitecture families.

GPU architecture hardware specifications — ROCm Documentation

GFX RAS Status of GPU RAS (Reliability, Availability, Serviceability) features. Indicates error handling.
SDMA RAS Status of SDMA (System Direct Memory Access) RAS features.
UMC RAS Status of Unified Memory Controller (UMC) RAS features.
VBIOS

VBIOS (Video BIOS) version. Indicates the firmware version running on the GPU.

Identical firmware version (113-M3000100-102) for all GPUs indicates a uniform configuration.

BUS

PCIe bus address of the GPU. Helps map the GPU to its physical slot.

For example, 0000:05:00.0 is the PCIe address. It allows you to correlate GPUs to physical slots or NUMA nodes.

PARTITION ID GPU partition or instance ID. For multi-instance GPUs (e.g., MI300X), this would identify instances.All values are 0 indicating no multi-instance partitioning is enabled for these GPUs.

The --showbus options show PCI bus related information, including correspondence between GPU IDs and PCI Bus IDs.

The --showmetrics option provides comprehensive information about the GPU status and performance, including metrics such as temperature, clock frequency, power, and pcie bandwidth.

The --showtopo options show how the GPUs in the systems can communicate with each other via XGMI (Link Type) representing one hop between any two GPUs. The weight of 15 indicates this direct communication is the preferred path.

The link type, number of hops, and weight can be also obtained using the specific options --showtopoweight , --showtopotype, and –showtopoweight:

The --shownodesbw shows the bandwidth available internally for GPU to GPU internal communication:

For additional options and details check:

For more information about ROCm-SMI as well as for the newer AMD-SMI cli please check: ROCm Documentation, AMD SMI documentation, ROCm and AMD SMI

NICs and GPUs mappings

Next is to perform mapping of the NIC to GPUs as shown in the below steps. These will be the same for both Thor2 and AMD Pollara 400 NIC.

The information from other commands can be combined with some of the options above to find correlation between GPU and NICs following these steps:

  1. Identify NUMA Nodes and GPUs

    Use the output from rocm-smi --showtoponuma or just rocm-smi --showtopo to find mappings between GPUs and NUMA nodes.

    Look for NUMA Affinity for each GPU in the output. A description of what this attribute means is included later in this section.

    Note which GPUs are associated with which NUMA nodes.

    Example:

    GPU 0–3 → NUMA Node 0

    GPU 4–7 → NUMA Node 1

  2. Identify NUMA Nodes for NICs

    Navigate to the /sys/class/net/ directory and check the NUMA node affinity for each network interface (excluding lo or docker interfaces):

    Note the NUMA node affinity for each NIC interface.

    EXAMPLE:

  3. Correlate GPUs to NICs Based on NUMA Affinity

Using the NUMA node affinity from Step 1 (GPUs) and Step 2 (NICs), to map each GPU to NICs within the same NUMA node:

EXAMPLE:

NOTE: You can also use the following script to automate the steps above:

EXAMPLE:

You will notice that there is not a 1:1 GPU to NIC association. Instead, multiple NIC interfaces are associated with the GPU. This is because they belong to the same Non-Uniform Memory Access (NUMA) node affinity.

Systems employing a NUMA architecture contain collections of hardware resources including CPUs, GPUs memory, and PCIe devices (including NICs), grouped together in what is known as a “NUMA node”. These resources are considered "local" to each other. From the point of view of a GPU, devices in the same NUMA node are the most closely associated with that GPU. The NUMA node is identified by the NUMA Affinity.

Multiple NICs and GPUs may be connected to the same PCIe complex or switch within a NUMA node. This makes the NICs accessible to all GPUs sharing that complex. However, while all NICs in a NUMA node are accessible to any GPU in the same node, the NICs are allocated dynamically for usage by a given GPU, based on availability, traffic type, latency, and so on.

Communication Between GPUs on the Same NUMA Node (e.g., GPU1 ↔ GPU2):

GPUs on the same NUMA node (e.g., GPU1 and GPU2) communicate directly over the high-bandwidth, low-latency interconnect, such as Infinity Fabric (in AMD systems).

These interconnects avoid the CPU and main memory entirely, offering much faster communication compared to NUMA-crossing communication. Since both GPUs are "local" to the same memory controller and CPU, the communication path is highly optimized.

Communication Between GPUs on Different NUMA Nodes (e.g., GPU1 ↔ GPU4):

Communication between GPUs on different NUMA nodes (e.g., GPU1 on NUMA 0 and GPU4 on NUMA 1) must traverse additional layers of the system architecture, which introduces higher latency. The path typically follows:

  • GPU1 → CPU (NUMA 0): Data is sent from GPU1 to the CPU on NUMA 0.
  • Inter-NUMA Link: The CPUs in NUMA 0 and NUMA 1 are connected via an interconnect such as Infinity Fabric or UPI (Ultra Path Interconnect).
  • CPU (NUMA 1) → GPU4: The data is forwarded from the CPU on NUMA 1 to GPU4.

Changing NIC attributes

This section shows you how to add or change a NIC’s Interface Name, MTU, DNS, IP Addresses and Routing table entries.

Editing and reapplying the network configuration (netplan) file

The network configuration is described in the netplan *.yaml file found under: /etc/netplan/.

Notice that the actual file name might vary. Examples:

/etc/netplan/01-netcfg.yaml

/etc/netplan/00-installer-config.yaml

Changing any interface attribute involves editing this file and reapplying the network plan as shown below:

  1. Find the default names of the logical interfaces.

    You can use the following steps to achieve this:

    Thor2 NIC output:

    Interface ens31np0:

    Where

    • en: ethernet network interface.
    • s31: indicates the physical location of the network interface on the system bus. slot number 31 on the bus.
    • np0:
    • n: Network (indicates it's a network port).
    • p0: Port 0 (indicates it's the first port of this network interface).

    AMD Pollara 400 NIC output

    You can use the script gpunic.py to find mappings between GPUs and NIC per pcie bus, to identify how the NICS need to be renamed for consistency.

    EXAMPLE:

    To further identify the interfaces, you can use the command.

    You want to make sure that the NICs connected to the GPU Backend fabric, the Storage Backend fabric, and the Frontend fabric are 400GE interfaces, 200GE interfaces, and 100GE interfaces, respectively.

    DEFAULT INTERFACE NAME NEW NAME Speed
    enp6s0np0 gpu0_eth 400GE
    enp35s0np0 gpu1_eth 400GE
    enp67s0np0 gpu2_eth 400GE
    enp102s0np0 gpu3_eth 400GE
    enp134s0np0 gpu4_eth 400GE
    enp163s0np0 gpu5_eth 400GE
    enp195s0np0 gpu6_eth 400GE
    enp230s0np0 gpu7_eth 400GE
    enp47s0f0np0 stor0_eth 200GE
    enp47s0f0np1 stor1_eth 200GE
    enp208s0f0np0 mgmt_eth 100GE
  2. Find the interface’s MAC address:

    You can use the ip link show <device> command.

    EXAMPLE:

    DEFAULT INTERFACE NAME NEW NAME MAC address
    enp6s0np0 gpu0_eth 7c:c2:55:bd:75:d0
    enp35s0np0 gpu1_eth 7c:c2:55:bd:79:20
    enp67s0np0 gpu2_eth 7c:c2:55:bd:7d:f0
    enp102s0np0 gpu3_eth 7c:c2:55:bd:7e:20
    enp134s0np0 gpu4_eth 7c:c2:55:bd:75:10
    enp163s0np0 gpu5_eth 7c:c2:55:bd:7d:c0
    enp195s0np0 gpu6_eth 7c:c2:55:bd:84:90
    enp230s0np0 gpu7_eth 7c:c2:55:bd:83:10
    enp47s0f0np0 stor0_eth 5c:25:73:66:bc:5e
    enp47s0f0np1 stor1_eth 5c:25:73:66:bc:5f
    enp208s0f0np0 mgmt_eth 5c:25:73:66:c3:ee
  3. Modify the netplan configuration file using the new name and MAC addresses determined in the previous steps.

    EXAMPLE:

    Make sure to keep proper indentation, and hyphens were appropriate (e.g. before IP addresses, routes, etc.) when editing the file. For the IP addresses, make sure to include the subnet mask.

    The following is an example of the netplan configuration file for one of the MI300X servers in the lab:

  4. Save the file and apply the changes using the netplan apply command.

    jnpr@MI300X-01:/etc/netplan$ sudo netplan apply

    jnpr@MI300X-01:/etc/netplan$

  5. Verify that the changes were correctly applied.

Check that the new interface names are correct:

Thor2 NIC output:

Notice that the gpu#_eth (#=0-7) interfaces are Broadcom BCM957608 interfaces while the mgmt_eth and stor#_eth interfaces are Mellanox MT2910 (ConnectX-7) interfaces. This will become important in the next section where we will cover the interfaces CoS configuration.

AMD Pollara NIC output for same command:

Notice that the gpu#_eth (#=0-7) interfaces are AMD Pollara 400 NIC interfaces while the mgmt_eth and stor#_eth interfaces are Mellanox MT2910 (ConnectX-7) interfaces. This will become important in the next section where we will cover the interfaces CoS configuration. eth3 interface is the supermicro IPMI interface.

Verify that the IP addresses were configured correctly:

OR

Check that the routes were added correctly to the routing table:

OR

Check address resolution:

AMD Pollara firmware and dependent libraries

Note: For AMD drivers and host utilities, please contact your regional AMD representative.

For brevity, the steps described here only pertain to enabling RCCL test for AMD Pollara 400 NIC and hence all the necessary dependent software and libraries are required to be installed for RCCL test to run. The steps involved pertain to the libraries listed in AMD Server and NIC Firmware and RCCL supporting libraries table.

  1. Ensure that ubuntu OS version is 22.04 as suggested in section AMD Server and NIC Firmware and RCCL supporting libraries
  2. Install RCCL library as suggested in the steps below. Note that the RCCL and ANP are private libraries provided by AMD.
  3. Install Unified Communication Framework (UCX). The Unified Communication Framework (UCX) is an open source, cross-platform framework designed to provide a common set of communication interfaces for various network programming models and interfaces, refer AMD documentation for more information.
  4. Next, Install OpenMPI. Note that the OpenMPI is a GitHub link, and hence GitHub credentials may be required. The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from across the High Performance Computing community in order to build the best MPI library available, refer OpenMPI for more information.
  5. Install Pollara drivers and firmware. This is an AMD provided firmware bundle. This firmware will also install the ‘nicctl’ command line utility to interact with the Pollara NICs and run commands to reset cards or configure QOS etc.

    Output of the Firmware version:

  6. Once the firmware install is complete, run the reset card to reflect the firmware version.

    Pollara NIC output of Firmware update

  7. Install ANP plugin. ANP is a plugin library designed to enhance the RCCL collective communication library with extended network transport support. The ANP Plugin library is a private AMD library.
  8. Lastly Build the RCCL tests.

Broadcom BCM957608 Thor2 DCQCN configuration for RDMA Traffic

Default DCQN-ECN/PFC attributes in AMD servers.

The network interface adapters are configured with the following Class of Service (including DCQCN-ECN) parameters for RoCE traffic:

For Thor2 NIC adapter:

  • RoCEv2 (RDMA over IPv4) enabled
  • Congestion Control (ECN) and PFC enabled
  • RoCE traffic tagged with DSCP 26 on PRIORITY 3
  • RoCE CNP traffic tagged with DSCP 48 and PRIORITY 7

Mapping Broadcom and logical interface names to configure DCQN-ECN/PFC and TOS/DSCP for RDMA Traffic attributes in AMD servers

DCQCN ECN, PFC and traffic marking need to be configured on the interfaces connected to the GPU backend; that is on the gpu#_eth (#=0-7) interfaces only.

In the Changing NIC attributes section of this document, we determined that the gpu#_eth interfaces in our servers, are Broadcom BCM957608 (shown below) NICs.

All the steps for configuring Class of Service in this section will be focused on these Broadcom interfaces.

We will be using a combination of Linux system commands and Broadcom tools to enable, tune and monitor DCQCN ECN/PFC operation and RoCE traffic marking. For some of these commands we will need to find the Broadcom interface name associated with each gpu interface. Follow these steps to find these mappings:

  1. Find the PCI address of each gpu#_eth interface using the following logic:

    EXAMPLE:

  2. Find the bnxt_re# (#=0-7) devices that corresponds to each PCI address using the following logic:

    EXAMPLE:

  3. MAP the GPU interface bnxt_re# or mlx5_# interface names.

Combine the outputs from steps 1 and 2 to create a full mapping from gpu#_eth to bnxt_re# or mlx5_#. You can see from the outputs that for example gpu0_eth corresponds to bnxt_re3 (0000:66:00.0)

You can use the following logic to simplify the process:

EXAMPLE:

Configuring DCQN-ECN/PFC and TOS/DSCP for RDMA Traffic attributes in AMD servers (Broadcom interfaces)

Some of the parameters related to DCQN-ECN/PFC and TOS/DSCP are listed in the following table:

Table 25. Server DCQCN configuration parameters

PARAMETER DESCRIPTION DEFAULT
cc_mode

0 for Deterministic Marking (DCQCN-D)

1 for Probabilistic Marking (DCQCN-P)

1
cnp_ecn Enables/disables ECN 0x1 (enabled)
cnp_dscp DSCP value for RoCE congestion notification packets 48
cnp_prio Priority for RoCE congestion notification packets 7
cnp_ratio_th Defines the threshold ratio for generating CNPs. It determines the rate CNPs are sent in response to congestion, helping control the feedback mechanism's aggressiveness. 0x0
ecn_enable Enable congestion control. 0x1 (enabled)
ecn_marking Enables tagging of packets as ECN-enabled. ECN = 01 0x1 (enabled)
default_roce_mode Sets the default RoCE mode for RDMA RoCE v2
default_roce_tos Sets the default ToS value for RDMA traffic 104
roce_dscp DSCP value for RoCE packets. 26
roce_prio Priority for RoCE packets. 3
rtt Time period (µs) over which cnp and transmitted packets counts accumulate. At the end of rtt, the ratio between CNPs and TxPkts is computed, and the CP is updated. 40 μs.

BCM95741X Ethernet network adapters support three transmit and receive queues for each Ethernet port: 0, 4, and 5.

BCM95750X Ethernet network adapters support eight transmit and receive queues for each Ethernet port: 0 through 7.

By default, all queues are configured for weighted-fair-queueing (WFQ), with priority 0 traffic mapped to queue 4.

When the RoCE bnxt_re driver is loaded, CoSQ 0 is configured for lossless traffic, and CoSQ 5 is changed from WFQ to strict priority (SP) for CNP processing.

RoCE and CNP traffic can be tagged with different DSCP values or use VLAN tags instead.

By default, the ToS field is set to 104, which means DSCP is set to 48, and the ECN bits are set to 10 (ECN-enabled).

These parameters can be adjusted using three different methods:

  • Configuring DCQCN/RDMA marking values directly
  • Configuring DCQCN/RDMA marking values using Broadcom tools such as niccli, or lldptool directly
  • Configuring DCQCN/RDMA marking values using the bnxt_setupcc.sh utility, which uses either niccli or lldptool (default) behind the scenes.

The following sections describe the steps to make changes using these different options.

NOTE: Please ensure that all changes are consistent with the configuration of switches within the fabric. Example:

Configuring DCQN-ECN/PFC and TOS/DSCP for RDMA Traffic attributes directly

You can make changes to the DCQCN and traffic marking by directly editing the files that contain the values of each parameter. This method is the easiest and does not require installation of any additional tools. However, it is not an option for PFC related parameters, nor is it supported on all types of network adapters.

To complete these changes for a specific interface, you must be under in the proper interface directory, following these steps:

  1. Create interface directories for qos related values

    We determined the mappings between the gpu#_eth interfaces and the corresponding Broadcom interface names

    GPU-to-NIC Mapping:

    gpu0_eth => 0000:06:00.0 => bnxt_re0

    gpu1_eth => 0000:23:00.0 => bnxt_re1

    gpu2_eth => 0000:43:00.0 => bnxt_re2

    gpu3_eth => 0000:66:00.0 => bnxt_re3

    gpu4_eth => 0000:86:00.0 => bnxt_re4

    gpu5_eth => 0000:a3:00.0 => bnxt_re5

    gpu6_eth => 0000:c3:00.0 => bnxt_re6

    gpu7_eth => 0000:e6:00.0 => bnxt_re7

    We will use the Broadcom interface names to create the directories (rdma_cm and bnxt_re) where the DCQCN attributes as well as other parameters and statistics will be located for each interface.

    The interface specific directories do not exist until created using the following commands:

    Notice that these two directories must be present.

    If the rdma_cm directory for example is missing, try the following:

    EXAMPLE:

    Repeat these steps for all the gpu interfaces.

    NOTE: You must be a root user to make these changes.

    The new directories will contain values pertaining to ECN, ROCE traffic, and other functions:

    You can find a description of some of these parameters, as well as their current value using cat to apply within the /sys/kernel/config/bnxt_re/bnxt_re0/ports/1/cc# directory.

    EXAMPLE:

  2. Enable RoCEv2 operation.

    Even though RoCEv2 should be the default mode, the command to enable RoCEv2 is shown here.

    NOTE: This change is made under the rdma_cm directory
    NOTE: Enter the value exactly as shown, including the space: “RoCE v2” (case sensitive).

    After setting the parameter, apply the new values as follows:

    Verify the changes:

  3. Enable ECN response and notification functions.

    Even though ECN should be enabled by default, the command to enable ECN is shown here.
NOTE: This change is made under the bnxt_re0 directory.
If needed, you can disable ECN by entering instead.

When ECN is enabled on the Broadcom interfaces, they will respond to CNP packets (RP) and will generate CNP packets when ECN-marked are received (NP).

To disable it, enter echo -n 0x0 > cnp_ecn instead.

After setting the parameter, apply the new values:

Verify the changes:

You can also enable the marking of both CNP and ROCE packets as ECN-eligible (meaning; these packets can be marked across the network when congestion occurs).

To summarize these attributes:

ecn_enable Enables/Disables the RP (response point) side of ECN. It enables the device to respond to CNP packets. Default = 1 (enable)
cnp_ecn Configures marking CNP packets as ECN-eligible. Either a value of 01 or 10 for ECT field.
ecn_marking Configures marking ROCE packets as ECN-eligible. Either a value of 01 or 10 for ECT field.
  1. Configure the DSCP and PRIO values for CNP and RoCEv2 packets.
    NOTE: Configuring these values manually, as shown below, is not an option for all types of Broadcom interface cards. For example, for BCM95741X devices you can use this method to configure the ECN, and RoCE priority values but on the BCM95750X/BCM957608 devices you can configure .

    See Broadcom Ethernet Network Adapter Congestion Control Parameters

    NOTE: These changes are made under the bnxt_re0 directory.
    NOTE: The following error indicates that changing the value of this parameter directly is not supported. In the case of BCM957608 roce_prio, and cnp_prio need to be configured using bnxt_setupcc.sh (described later).

    After setting the parameter, apply the new values:

    Verify the changes:

  2. Configure the DCQCN algorithm (under the bnxt_re directory).

    The default DCQCN Congestion Control (cc-mode) algorithm in Broadcom Ethernet network adapter is DCQCN-P. The mode can be changed using these commands:

    NOTE: This change is made under the bnxt_re0 directory.

    To use DCQCN-P configure:

    To use DCQCN-D configure:

  3. Check all the attributes that were configured.

The following command shows all the interface parameters:

For more information on the DCQCN algorithm in Broadcom Ethernet network adapter check the following documents: Changing Congestion Control Mode Settings and RoCE Congestion Control

EXAMPLE:

We have highlighted some ECN/CNP related parameters:

Configuring DCQN-ECN/PFC and TOS/DSCP for RDMA Traffic attributes using niccli

You can make changes to the DCQCN and traffic marking using the NICCLI Configuration Utility.

niccli is a management tool for Broadcom Ethernet network adapters that provides detailed information, including type, status, serial number, and firmware version. It also enables the configuration of interface attributes such as DCQCN-ECN, PFC, and TOS/DSCP for optimizing RDMA traffic.

NOTE: The niccli tools need to be installed in your system.

Installing the NICCLI Configuration Utility

You can obtain a summary of the interface adapters and ethernet ports that can be managed with niccli present on the server using niccli listdev, or list-eth as shown in the example below.

You can use niccli in either oneline mode, interactive mode, or batch mode. The niccli -h help provides a high-level description of these modes. In this section, we will show some examples of how to use the oneline and interactive modes for DCQCN-ECN, PFC, and TOS/DSCP configuration.

Entering niccli with no options allows you to work in the interactive mode, where you select an adapter/interface (by index) and then the proper <command> (e.g., show, get_qos, set_map) to obtain information or make changes to the selected interface.

You can identify the interface index corresponding to each interface using the method described in the Mapping Broadcom interface name with logical interface name section. This will give you the mappings between interfaces and pcie address which you can then correlate with the output of niccli below.

Once identified, enter the interface index (first column in the output) as shown in the example below.

EXAMPLE:

Entering allows you to issue the same commands but including the target interface and then the command, all in one line. The niccli -list command can be used to determine the interface index.

EXAMPLE

The sudo niccli help provides an extensive list of commands and options available for both interactive and one-line mode.

NOTE: We will use the one-line mode for all the examples below to obtain information and make configuration changes.

The following examples show you how to use niccli to obtain information about a specific interface.

  1. Check interface status.

    The niccli -i <interface> show provides details about the interface such as type, MAC address, firmware, serial number, device health, temperature and so on.

    EXAMPLE:

  2. Check QoS settings
The commands show mappings between DSCP and Priority vales, and between priority vales, traffic classes (TC), and the output queues.

The outputs in the example show the defaults for:

  • Queues status. Only queues 0, 1, and 2 are enabled.
  • Priority to DSCP mappings: priority 7 => DSCP 48 & priority 3 => DSCP 26.
  • Priority to TC (traffic class) and queue mappings: priority 7 => TC2 (queue 0) => DSCP 48 & priority 3 => TC1 (queue 5) => DSCP 26.
NOTE: The output might be confusing; the Queue ID displayed is an internal CoS queue number. This really means queuing of traffic class 0, 1, and 2 are enabled, all other traffic classes are disabled.

The sudo niccli -i <interface-index> get_qos command provides a summary of the QoS configuration on the interface.

EXAMPLE:

IEEE 802.1Qaz ETS Configuration TLV: shows the Enhanced Transmission Selection (ETS) configuration
PRIO_MAP: 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:2

Maps priorities to Traffic Classes (TC)

Priority 0, 1, 2, 4, 5, 6 → TC 0

Priority 3 → TC 1

Priority 7 → TC 2

TC Bandwidth: 50% 50% 0%

Allocates bandwidth percentages to traffic classes.

TC 0: 50% of the total bandwidth.

TC 1: 50%.

TC 2: 0%.

TSA_MAP: 0:ets 1:ets 2:strict

Together with TC Bandwidth, TSA_MAP allocates resources and defines service priority for each TC. Equivalent to schedulers & scheduler-map in Junos.

Specifies the Transmission Selection Algorithm (TSA) used for each TC:

TC 0 and TC 1 use ETS (Enhanced Transmission Selection) and share the available bandwidth 50/50

TC 2 uses strict priority, meaning TC 2 traffic will always be sent first

IEEE 802.1Qaz PFC TLV: defines traffic classification using the APP TLV (Type-Length-Value) format
PFC enabled: 3

Indicates that PFC is enabled on priority 3.

Other priorities do not have PFC enabled.

PFC ensures that traffic with this priority can pause instead of being dropped during congestion.

IEEE 802.1Qaz APP TLV

APP#0:

Priority: 7

Sel: 5

DSCP: 48

APP#1:

Priority: 3

Sel: 5

DSCP: 26

APP#2:

Priority: 3

Sel: 3

UDP or DCCP: 4791

Maps traffic to Traffic Classes. Equivalent to multifield classifiers in Junos.

APP#0: Traffic marked with DSCP = 48 is mapped to priority 7

APP#1: Traffic marked with DSCP = 48 is mapped to priority 3

APP#2: UDP or DCCP traffic with port = 4791 (RoCEv2) is mapped to priority 3

TC Rate Limit: 100% 100% 100% 0% 0% 0% 0% 0%

TC 0, TC 1, and TC 2 can use up to 100% of the bandwidth allocated to them.

TC 3 through TC 7 are set to 0%, meaning they are not currently configured to transmit traffic.

If needed, change the priority to traffic class mappings or the applications to traffic class mappings.

We recommend keeping the default settings and making sure they are consistent with the class-of-service configuration on the leaf nodes in the GPU backend fabric.

If there are any requirements to change the priority to traffic class mappings or the applications to traffic class mappings the following commands can be used:

Priority to traffic class mappings

EXAMPLE:

Applications to traffic class mappings

EXAMPLE:

If needed, change ETS configuration attributes.

We recommend keeping the default settings and making sure they are consistent with the class-of-service configuration on the leaf nodes in the GPU backend fabric.

EXAMPLE:

If needed, configure PFC

EXAMPLE:

The following command attempts to enable the pfc on priority 5 and 6 and demonstrates that only one queue (one priority) can be configured as a lossless queue (PFC-enabled).

Configuring DCQCN and RoCE traffic marking values using bnxt_setupcc.sh

Using the bnxt_setupcc.sh utility, which can simplify the process.

The bnxt_setupcc.sh utility simplifies enabling or disabling both ECN and PFC and changing the values of DSCP and PRIO for both ROCE and CNP packets for a given interface.

Under the hood it uses niccli (default) or lldptool which can be selected as part of the command.

You need to enter bnxt_setupcc.sh followed by your selected options as described in the help menu:

EXAMPLE:

The default DSCP marking for CNP packets for interface gpu0 (bnxt_re0) is 0 as shown in the output below:

bnxt_setupcc.sh can be used to change it to the value expected by the fabric (48) as follows:

Where:

  • -u 3: Uses Broadcom niccli utility
  • -p 48: Sets the DSCP value for CNP packets to 48 (0x30)
  • -c: Configures the priority for CNP packets to 6
  • -s: Defines the DSCP value for regular RoCE packets to 26 (0x1a)
  • -r: Sets the priority for regular RoCE packets to 5
  • -m 3: Configures both PFC and congestion control (ECN).
NOTE: Device (-i) is required for the script to complete. Also, you cannot configure only one of the DSCP/PRIO values. You need to configure CNP-DSCP value (-p), CNP-PRI value (-c), RoCE-DSCP (-s), and RoCE-PRIO (-r) for the command to work.

Verify the results with:

NOTE: You need to make sure that not only bnxt_setupcc.sh is installed and executable, but also that at least one of the tools (niccli or lldptool) is installed.

The following example shows that bnxt_setupcc.sh and niccli are installed, but lldptool is not. It also shows examples of installing and using the lldptool.

The lldptool is used to check or modify the LLDP (Link Layer Discovery Protocol) settings. To enable LLDP you need to install lldpad, which also installs lldptool automatically.

To install lldpad and lldptool follow these steps:

  1. Install required dependencies.

    Before installing lldpad, ensure that the necessary libraries are installed by running the following command:

    • libconfig9 – A configuration file processing library.
    • libnl-3-200 – A library for interacting with the Linux Netlink interface.
  1. Install lldpad.

    Install lldpad by running the following command:

    This package enables LLDP on the system, allowing it to exchange network topology information with other devices.

  2. Enable lldpad.

    Enable lldp using systemctl:

    This creates a system service that ensures lldpad is always running after a reboot.

  3. Start the lldpad service

    Activate lldp using systemctl:

    This activates lldpad immediately, allowing it to process LLDP packets.

    NOTE:
    To restart lldpad manually, use:
    To disable lldpad from starting at boot, use: sudo systemctl disable lldpad
  4. Verify the installation

Check the service status using systemctl

This ensures that the tool is installed and ready to use. If everything is working properly, you should see an "active (running)" status.

You can use lldptool to enable or disable LLDP on an interface, and to check the LLDP status and the neighbors discovered on that interface. The lldptool -h shows you all the different options:

Check the Installing and Configuring Software Manually section of the Broadcom Ethernet Network Adapter User Guide or Installing the NICCLI Configuration Utility for more details.

Monitor interface and ECN/PFC operation:

Once you have the Broadcom name for a particular gpu as described at the beginning of this section, you can locate the directories where the interface’s operation status, as well as RoCE traffic and Congestion Control statistics are located.

  1. Navigate to the corresponding directory

/sys/class/infiniband/<Broadcom-interface-name>

EXAMPLE:

For gpu0_eth:

Here you can check attributes such as operational state, address, mtu, speed, and interface statistics (including transmit and received packets, dropped packets, as well as ECN-marked packets, CNP packets received and CNP packets transmitted):

To check ECN statistics, check the related counters for the specific interface:

To check PFC statistics use:

EXAMPLE:

Configuring the server to use the management interface for RCCL control traffic:

ROCm Communication Collectives Library (RCCL) creates TCP sessions to coordinate processes and exchange Queue Pair information for RoCE, GIDs (Global IDs), Local and remote buffer addresses, RDMA keys (RKEYs for memory access permissions)

NOTE: This traffic is separate from the RoCEv2 traffic (port 4791) and is used for synchronizing model parameters, partial results of operations, etc.

These TCP sessions are created when the job starts and by default use one of the GPU interfaces (same interfaces used for RoCEv2 traffic).

Example:

It is recommended that the management interface connected to the (Frontend Fabric) is used. To achieve this, include the following when starting a job: export NCCL_SOCKET_IFNAME="mgmt_eth". The same environment variable applies to both NCCL and RCCL.

Example:

NOTE: ECN is enabled by default for these sessions; net.ipv4.tcp_ecn = 1 , but can be disable with: sudo sysctl -w net.ipv4.tcp_ecn=0

AMD Pollara DCQCN configuration for RDMA Traffic

For the AMD Pollara validation, DCQCN needs to be enabled, and QOS has to be applied on the AMD NIC cards.

  1. Configure QOS on the NICs using the script. The DSCP parameters are equivalent to the values suggested in Table 25. Server DCQCN configuration parameters.
  2. Using AMD nicctl command line Utility below are the QOS parameters configured:
  3. The rdma link command can be used to check if the roce-devices association to the AMD Pollara NIC cards exist.

    The roce-devices are created when the ionic_rdma kernel module is loaded and should create the below roce-device file for each NIC card.

  4. To configure DCQCN on the AMD Pollara NICs, run the script below with appropriate parameters.
  5. Using the nicctl command, check the DCQCN profile for each roce-device.
  6. Finally run the rccl_test.sh script as below. The example below shows the tests run for “All reduce”.