Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

AMD Configuration

The AI servers covered as part of the JVD include 2 Supermicro AS-8125GS-TNMR2 Dual AMD EPYC 8U GPU and 2 Dell PowerEdge XE9680.

This section provides some guidelines to install and configure the interfaces and other relevant parameters based on the AI JVD lab testing. Always refer to the official manufacturer documentation when making changes and for more details.

AMD MI300Xx Setting BIOS Parameters

Each vendor has different BIOS settings based on differences in its UI and GPU mappings and the servers' internal architectures.

SuperMicro AS-8125GS-TNMR2

Boot the server into Setup mode (the boot to supermicro splash will take several minutes to appear):

UEFI/BIOS Area Value
Advanced -> NB Configuration ACS Enable = Disable
Advanced -> NB Configuration -> xGMI xGMI Link Width Control = Manual
xGMI Force Link Width Control = Force
xGMI Force Link Width = 2
xGMI Max Link Width control = Manual
xGMI Link Max Speed = Auto
Advanced -> PCIe/PCI/PnP Configuration Above 4G Encoding: Enabled
Re-Size BAR Support: Enabled
SR-IOV Support: Enabled
Workload = Not Configured

DELL XE9680

The following BIOS settings are recommended by Dell for their XE9680 AI/ML server. The BIOS settings disable IOMMU and ACS on the host as well.

UEFI/BIOS Area Value
BIOS -> Processor Settings Logical Processor = Disable
Virtualization Technology = Disable
SubNumaCluster = Disable
MADt Core cluster = Linear
1 BIOS -> Integrated Devices Global SRIOV = Disable 1
BIOS -> System Profile Setting Server System Profile = Performance
Workload = Not Configured
BIOS -> System Security AC Recovery Delay = Random (highly recommended)

1 Dell recommends “enabling” Global SR-IOV, but on the Dell DUTs in this lab setup, this setting was incompatible with the THOR2 NIC port mode 0 for the storage and frontend fabrics (2x200Gb vs. 1x400Gb), causing the DUT to fault on boot. Consult with your Dell account team for recommendations about this setting in your setup.”

Follow the configuration steps described in the Single-node network configuration for AMD Instinct accelerators — GPU cluster networking documentation. Notice that the disable ACS script used in step 6, must also be run before any workloads, after a server has been rebooted.

Identifying NICs and GPUs mappings

Along with the fabric and GPU server setup, this JVD which covers the configuration and setup of ethernet Network Adapters (or NIC) as below. The Broadcom BCM57608 (Thor2) ethernet network adapter was validated in Phase 1. And in Phase2, the AMD Pollara 400 NIC cards are validated.

All 4 servers are equipped with:

And either of the below NICs:

Or

Note: Most of the setup commands will apply to both Thor2 and AMD Pollara 400 NIC. However in case of any differences in steps same will be called out at appropriate places within the document.

Dell devices:

AMD MI300x GPU Server and NIC Firmware and RCCL libraries

For the purposes of Broadcom Thor 2 NIC validation following are the main OS and firmware versions configured on the MI300x GPU servers:

Broadcom Thor 2 Ethernet Adapter

Below are the details of the Operating System (OS), firmware and AMD libraries installed:

OS/Firmware Version
Ubuntu Ubuntu 24.04.2 LTS
Broadcom Thor2 NIC Firmware version 231.2.63.0

Following are the libraries installed for RCCL test for Thor2 Network adapter:

RCCL Test libraries Version command
rocm/noble 6.4.0.60400-47~24.04 amd64 apt list rocm
Rccl 1 2.22.3.60400-47~24.04 amd64 apt list rccl
mpi (Open MPI) 5.0.8a1 mpirun –version

UCX

https://github.com/openucx/ucx.git

1.15.0 /opt/ucx/bin/ucx_info -v
Note: For AMD drivers and host utilities, contact your regional AMD representative.

AMD Pensando Pollara 400 Ethernet Adapter

For the purposes of the AMD Pollara 400 NIC validation, the following are the main OS and firmware versions configured on the MI300x GPU servers:

OS/Firmware Version
Ubuntu Ubuntu 22.04.5 LTS
AMD Pollara NIC Firmware version 1.110.0-a-79

Output of the ubuntu version 22.04 installed on the MI300 servers.

Output of the AMD Pollara 400 NIC card Firmware version

Following are the libraries installed for RCCL test for AMD Pollara 400 NIC adaper:

RCCL Test libraries Version command
rocm/jammy 6.3.3.60303-74~22.04 amd64 apt list rocm
Rccl 1 7961624  
mpi (Open MPI) 5.1.0a1 /opt/ompi/bin/mpirun –version

UCX

https://github.com/openucx/ucx.git

1.20.0 /opt/ucx/bin/ucx_info -v
rccl-tests revision 6704fc6 Git branch https://github.com/ROCm/rccl-tests.git
ANP Plugin2    
Note: For AMD drivers and host utilities, contact your regional AMD representative.1. The RCCL library is private build version that AMD provided.2. ANP Plugin version is private build provided by AMD.

For more information on installing these software and dependent libraries, high level steps are provided later in section AMD Pollara firmware and dependent libraries, as these steps can only be performed once the NICs and GPUs are mapped as described in sections below.

In this section, we will explore some of the options to find information about and configure the NICs and GPUs.

ROCm Communication Collectives Library (RCCL)

In AMD servers, the ROCm provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs. These collectives implement send and receive operations such as all-reduce, all-gather, reduce, broadcast, all-to-all, and so on across multiple GPUs in one or more GPU servers.

Communication between GPUs in a single server is implemented using xGMI (inter-chip global memory interconnect), part of AMD's Infinity Fabric technology. The Infinity Fabric is a high-bandwidth, low-latency interconnect for the various components within a system including CPUs, GPUs, memory, NICs and other devices. xGMI provides socket-to-socket communication, allowing direct CPU-to-CPU or GPU-to-GPU communication.

Communication between different servers is processed by RDMA-capable NICs (e.g., RoCEv2 over Ethernet) and routed across the GPU backend fabric. These NICs can be used by any GPU at any time as there is no hard coded 1-to-1 GPU to NIC mapping. However, the use of preferred communication paths between GPUs and NICs creates the appearance of a 1:1 correspondence.

RCCL will always choose the path that has the best connection between GPUs and between GPUs and NICs, aiming to optimize bandwidth, and latency. Optimized intra-node path will be taken before forwarding inter-node.

The rocm-smi (Radeon Open Compute Platform System Management Interface) cli provides tools for configuring and monitoring AMD GPUs. It can be used to identify GPUs hardware details as well as topology information using the options such as:

--showproductname: show product details

--showtopo: show hardware topology information

--showtopoaccess: shows the link accessibility between GPUs

--showtopohops: shows the number of hops between GPUs

--showtopotype: shows the link type between GPUs

--showtoponuma: shows the numa nodes

--shownodesbw: shows the numa nodes bandwidth

--showhw: shows the hardware details

Examples from AMD Instinct MI300XX OAM:

The --showproductname shows the GPU series, model, and vendor along with additional details. The example output shows AMD Instinct™ MI300XX Platform GPUs are installed in the server.

The --showhw options shows information about the GPUs in the system, including ID

The fields are defined as follows:

Fields

Description

GPU Index of the GPU on the system, starting from 0.
NODE NUMA (Non-Uniform Memory Access) node ID associated with the GPU. Helps identify memory locality. Optimal GPU/NIC mapping often relies on NUMA proximity
DID

Device ID of the GPU. This is a unique identifier for the specific GPU model.

Useful for verifying the exact GPU model. For example, 0x74a1 corresponds to an MI300X-series GPU.

GUID

GPU Unique Identifier. This value is specific to each GPU and may relate to its PCIe device.

Useful for distinguishing GPUs in a multi-GPU environment.

GFX VER

The version of the GPU architecture (e.g., gfx942 is part of AMD's RDNA2 family).

In AMD GPUs, the GFX prefix is part of AMD's internal naming convention for their GPU microarchitecture families.

GPU architecture hardware specifications — ROCm Documentation

GFX RAS Status of GPU RAS (Reliability, Availability, Serviceability) features. Indicates error handling.
SDMA RAS Status of SDMA (System Direct Memory Access) RAS features.
UMC RAS Status of Unified Memory Controller (UMC) RAS features.
VBIOS

VBIOS (Video BIOS) version. Indicates the firmware version running on the GPU.

Identical firmware version (113-M3000100-102) for all GPUs indicates a uniform configuration.

BUS

PCIe bus address of the GPU. Helps map the GPU to its physical slot.

For example, 0000:05:00.0 is the PCIe address. It allows you to correlate GPUs to physical slots or NUMA nodes.

PARTITION ID GPU partition or instance ID. For multi-instance GPUs (e.g., MI300X), this would identify instances.All values are 0 indicate no multi-instance partitioning is enabled for these GPUs.

The --showbus options shows PCI bus related information, including correspondence between GPU IDs and PCI Bus IDs.

The --showmetrics option provides comprehensive information about the GPU status and performance, including metrics such as temperature, clock frequency, power, and pcie bandwidth.

The --showtopo options show how the GPUs in the systems can communicate with each other via XGMI (Link Type) representing one hop between any two GPUs. The weight of 15 indicates this direct communication is the preferred path.

The link type, number of hops, and weight can be also obtained using the specific options --showtopoweight, --showtopotype, and –showtopoweight:

The --shownodesbw shows the bandwidth available internally for GPU to GPU internal communication:

For additional options and details check: rocm-smi -h

For more information about ROCm-SMI as well as for the newer AMD-SMI cli, see:

NICs and GPUs mappings

Next is to perform mapping of the NIC to GPUs as shown in the below steps. These will be same for both Thor2 and AMD Pollara 400 NIC.

The information from other commands can be combined with some of the options above to find correlation between GPU and NICs following these steps:

  1. Identify NUMA Nodes and GPUs

    Use the output from rocm-smi --showtoponuma or just rocm-smi --showtopo to find mappings between GPUs and NUMA nodes.

    Look for NUMA Affinity for each GPU in the output. A description of what this attribute means is included later in this section.

    Note down which GPUs are associated with which NUMA nodes.

    Example:

    GPU 0–3 → NUMA Node 0

    GPU 4–7 → NUMA Node 1

  2. Identify NUMA Nodes for NICs

    Navigate to the /sys/class/net/ directory and check the NUMA node affinity for each network interface (excluding lo or docker interfaces):

    Note the NUMA node affinity for each NIC interface.

    Example:

  3. Correlate GPUs to NICs Based on NUMA Affinity

Using the NUMA node affinity from Step 1 (GPUs) and Step 2 (NICs), to map each GPU to NICs within the same NUMA node:

Example:

Note: You can also use the following script to automate the steps above:

Example:

You will notice that there is not a 1:1 GPU to NIC association. Instead, multiple NIC interfaces are associated with the GPU. This is because they belong to the same Non-Uniform Memory Access (NUMA) node affinity.

Systems employing a NUMA architecture contain collections of hardware resources including CPUs, GPUs memory, and PCIe devices (including NICs), grouped together in what is known as a “NUMA node”. These resources are considered "local" to each other. From the point of view of a GPU, devices in the same NUMA node are the most closely associated with that GPU. The NUMA node is identified by the NUMA Affinity.

Multiple NICs and GPUs may be connected to the same PCIe complex or switch within a NUMA node. This makes the NICs accessible to all GPUs sharing that complex. However, while all NICs in a NUMA node are accessible to any GPU in the same node, the NICs are allocated dynamically for usage by a given GPU, based on availability, traffic type, latency, and so on.

Communication Between GPUs on the Same NUMA Node (e.g., GPU1 ↔ GPU2):

GPUs on the same NUMA node (e.g., GPU1 and GPU2) communicate directly over the high-bandwidth, low-latency interconnect, such as Infinity Fabric (in AMD systems).

These interconnects avoid the CPU and main memory entirely, offering much faster communication compared to NUMA-crossing communication. Since both GPUs are "local" to the same memory controller and CPU, the communication path is highly optimized.

Communication Between GPUs on Different NUMA Nodes (e.g., GPU1 ↔ GPU4):

Communication between GPUs on different NUMA nodes (e.g., GPU1 on NUMA 0 and GPU4 on NUMA 1) must traverse additional layers of the system architecture, which introduces higher latency. The path typically follows:

  • GPU1 → CPU (NUMA 0): Data is sent from GPU1 to the CPU on NUMA 0.
  • Inter-NUMA Link: The CPUs in NUMA 0 and NUMA 1 are connected via an interconnect such as Infinity Fabric or UPI (Ultra Path Interconnect).
  • CPU (NUMA 1) → GPU4: The data is forwarded from the CPU on NUMA 1 to GPU4.

Changing NIC attributes

This section shows you how to add or change a NIC’s Interface Name, MTU, DNS, IP Addresses and Routing table entries.

Editing and reapplying the network configuration (netplan) file

The network configuration is described in the netplan *.yaml file found under: /etc/netplan/.

Notice that the actual file name might vary. Examples:

/etc/netplan/01-netcfg.yaml

/etc/netplan/00-installer-config.yaml

Changing any interface attribute involves editing this file and reapplying the network plan as shown below:

  1. Find the default names of the logical interfaces.

    You can use the following steps to achieve this:

    Thor2 NIC output:

    Interface ens31np0:

    Where

    • en: ethernet network interface.
    • s31: indicates the physical location of the network interface on the system bus. slot number 31 on the bus.
    • np0:
    • n: Network (indicates it's a network port).
    • p0: Port 0 (indicates it's the first port of this network interface).

    AMD Pollara 400 NIC output

    You can use the script gpunic.py to find mappings between GPUs and NIC per pcie bus, to identify how the NICS need to be renamed for consistency.

    Example:

    To further identify the interfaces, you can use the sudo ethtool <device> | grep Speed command.

    You want to make sure that the NICs connected to the GPU Backend fabric, the Storage Backend fabric, and the Frontend fabric are 400GE interfaces, 200GE interfaces, and 100GE interfaces respectively.

    DEFAULT INTERFACE NAME NEW NAME Speed
    enp6s0np0 gpu0_eth 400GE
    enp35s0np0 gpu1_eth 400GE
    enp67s0np0 gpu2_eth 400GE
    enp102s0np0 gpu3_eth 400GE
    enp134s0np0 gpu4_eth 400GE
    enp163s0np0 gpu5_eth 400GE
    enp195s0np0 gpu6_eth 400GE
    enp230s0np0 gpu7_eth 400GE
    enp47s0f0np0 stor0_eth 200GE
    enp47s0f0np1 stor1_eth 200GE
    enp208s0f0np0 mgmt_eth 100GE
  2. Find the interface’s MAC address:

    You can use the ip link show <device> command.

    Example:

    DEFAULT INTERFACE NAME NEW NAME MAC address
    enp6s0np0 gpu0_eth 7c:c2:55:bd:75:d0
    enp35s0np0 gpu1_eth 7c:c2:55:bd:79:20
    enp67s0np0 gpu2_eth 7c:c2:55:bd:7d:f0
    enp102s0np0 gpu3_eth 7c:c2:55:bd:7e:20
    enp134s0np0 gpu4_eth 7c:c2:55:bd:75:10
    enp163s0np0 gpu5_eth 7c:c2:55:bd:7d:c0
    enp195s0np0 gpu6_eth 7c:c2:55:bd:84:90
    enp230s0np0 gpu7_eth 7c:c2:55:bd:83:10
    enp47s0f0np0 stor0_eth 5c:25:73:66:bc:5e
    enp47s0f0np1 stor1_eth 5c:25:73:66:bc:5f
    enp208s0f0np0 mgmt_eth 5c:25:73:66:c3:ee
  3. Modify the netplan configuration file using the new name and MAC addresses determined in the previous steps.

    Example:

    Make sure to keep proper indentation, and hyphens were appropriate (e.g. before IP addresses, routes, etc.) when editing the file. For the IP addresses make sure to include the subnet mask.

    The following is an example of the netplan configuration file for one of the MI300X servers in the lab:

  4. Save the file and apply the changes using the netplan apply command.

    jnpr@MI300X-01:/etc/netplan$ sudo netplan apply jnpr@MI300X-01:/etc/netplan$
  5. Verify the changes were correctly applied.

Check that the new interface names are correct:

Thor2 NIC output:

Note: Notice that the gpu#_eth (#=0-7) interfaces are Broadcom BCM97608 interfaces while the mgmt_eth and stor#_eth interfaces are Mellanox MT2910 (ConnectX-7) interfaces. This will become important in the next section where we will cover the interfaces CoS configuration.

AMD Pollara NIC output for same command:

Note: Notice that the gpu#_eth (#=0-7) interfaces are AMD Pollara 400 NIC interfaces while the mgmt_eth and stor#_eth interfaces are Mellanox MT2910 (ConnectX-7) interfaces. This will become important in the next section where we will cover the interfaces CoS configuration. eth3 interface is the supermicro IPMI interface.

Verify that the IP addresses were configured correctly:

OR

Check that the routes were added correctly to the routing table:

OR

Check address resolution:

AMD Pollara Firmware and Dependent Libraries

Note: For AMD drivers and host utilities, contact your regional AMD representative.

For brevity the steps described here only pertain to enabling RCCL test for AMD Pollara 400 NIC and hence all the necessary dependent software and libraries are required to be installed for RCCL test to run. The steps involved pertain to the libraries listed in AMD Server and NIC Firmware and RCCL supporting libraries table.

  1. Ensure that ubuntu OS version is 22.04 as suggested in section AMD Server and NIC Firmware and RCCL supporting libraries
  2. Install RCCL library as suggested in below steps. Note that the RCCL and ANP are private libraries provided by AMD.
  3. Install Unified Communication Framework (UCX). The Unified Communication Framework (UCX), is an open source, cross-platform framework designed to provide a common set of communication interfaces for various network programming models and interfaces, refer AMD documentation for more information.
  4. Next Install OpenMPI. Note that the OpenMPI is a GitHub link and hence GitHub credentials may be required. The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available, refer OpenMPI for more information.
  5. Install Pollara drivers and firmware. This is AMD provided firmware bundle. This firmware will also install the nicctl command line utility to interact with the Pollara NICs and run commands to reset cards or configure QOS etc.

    Output of the Firmware version:

  6. Once the firmware install is complete, then run the reset card so as to reflect the firmware version.

    Pollara NIC output of Firmware update

  7. Install ANP plugin. ANP is a plugin library designed to enhance the RCCL collective communication library with extended network transport support. The ANP Plugin library is a private AMD library.
  8. Lastly Build the RCCL tests.

Configuring AMD DCQCN (ECN/PFC) and TOS/DSCP for RDMA Traffic

In the IP Services for AI Networks section we discussed the need for congestion control and traffic prioritization in the Backend GPU fabric to transport RoCE traffic between GPU servers. For these mechanisms to work properly, the servers need to be configured to properly react to congestions notifications from both ECN and PFC, and to mark the RDMA and non-RDMA traffic properly (matching the classification configuration of the fabric). We will cover how to configure the AMD servers to meet this requirement.

Congestion Control (CC) or ECN (Explicit congestion Notification)

Congestion Control (CC) or ECN (Explicit congestion Notification) is a standard (RFC 3168) backpressure mechanism for ethernet network devices that signals congestion and causes the traffic to temporarily slow down to avoid packet drops

ECN for RoCE traffic relies on fabric switches that can detect congestion and implement ECN marking for traffic downstream, and devices that can respond to these markings, as shown in Figure 63.

  • the receiving NIC or Notification point (NP) which transmits CNP when receiving ECN marked packets
  • the sending NIC or Reaction point (RP) that receives the CNP packets and reacts accordingly.

Figure 53: DCQCN – ECN Operation

Details about DCQCN – ECN (Congestion Control in Broadcom terminology) implementation in the BCM5741X Ethernet network adapter acting as NP and RP, can be found in the following documents Traffic Control Synopsis and RoCE Congestion Control.

Priority Flow Control (PFC)

Priority Flow Control (PFC) is a standard (IEEE 802.1Qbb) backpressure mechanism for ethernet network devices that signals congestion and causes traffic on a particular priority to temporarily stop to avoid packet drops.

PFC for RoCE traffic relies on fabric switches that can detect congestion and generate PFC Pause frames upstream and devices that can respond to these markings:

  • the sending NIC that receives the PFC Pause frames and reacts accordingly.

Details about DCQCN – PFC implementation in BCM5741X Ethernet network adapters acting as RP can be found in the following documents Traffic Control Synopsis, Priority Flow Control Feature in Ethernet Network Adapters, and Quality of Service

Figure 54: DCQCN – PFC Operation

TOS/DSCP for RDMA Traffic

RDMA traffic must be properly marked to allow the switch to correctly classify it, and to place it in the lossless queue for proper treatment. Marking can be either DSCP within the IP header, or PCP in the ethernet frame vlan-tag field. Whether DSCP or PCP is used depends on whether the interface between the GPU server and the switch is doing vlan tagging (802.1q) or not. Figure 64 shows how RDMA and CNP are marked differently and as a result, the fabric switch classified and schedules the two types of packets differently.

Figure 55: TOS/DSCP operation