Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

GPU Backend Fabric

The GPU Backend fabric provides the infrastructure for GPUs to communicate with each other within a cluster, using RDMA over Converged Ethernet (RoCEv2). ROCEv2 boosts data center efficiency, reduces overall complexity, and increases data delivery performance by enabling the GPUs to communicate as they would with the InfiniBand protocol.

Packet loss can impact job completion times and should be avoided. Therefore, when designing the network infrastructure to support RoCEv2 for an AI cluster, one of the key objectives is to provide a lossless fabric, while also achieving maximum throughput, minimal latency, and minimal network interference for the AI traffic flows. ROCEv2 is more efficient over lossless networks, resulting in optimum job completion times.

The GPU Backend fabric in this JVD was designed with these goals in mind and follows a 3-stage IP clos architecture combined with NVIDIA’s Rail Optimized Stripe Architecture (discussed in the next section), as shown in Figure 5.

Figure 5: GPU Backend Fabric Architecture

The GPU Backend devices included in this fabric, and the connections between them, are summarized in the following table:

Table 3: GPU Backend devices per cluster and stripe

Cluster Stripe Nvidia DGX GPU Servers

GPU Backend Leaf Nodes switch model

(gpu-backend-leaf#)

GPU Backend Spine Nodes switch model

(gpu-backend-spine#)

1 1 A100-01 to A100-04 QFX5230-64CD x 8

QFX5230-64CD x 2

OR

PTX10008 w/

JNP10K-LC1201

1 2 A100-05 to A100-08 QFX5220-32CD x 8
2 1 H100-01 to H100-02 QFX5240-64OD x 8 QFX5230-64OD x 4
2 2 H100-03 to H100-04 QFX5240-64OD x 8

Table 4: Connections between servers, leaf and spine nodes per cluster and stripe in the GPU Backend

Cluster Stripe

GPU Servers <=>

GPU Backend Leaf Nodes

GPU Backend Spine Nodes <=>

GPU Backend Leaf Nodes

1 1

1 x 200GE links

between each A100 server and each leaf node (200GE x 8 links per server)

2 x 400GE links

between each leaf node and each spines node (2 x 400GE x 2 links per leaf node)

1 2

1 x 200GE links

between each A100 server and each leaf nodes (200GE x 8 links per server)

2 x 400GE links

between each leaf node and each spines node (2 x 400GE x 2 links per leaf node)

2 1

1 x 400GE links

between each H100 server and each leaf nodes (400GE x 8 links per server)

2 x 400GE links

between each leaf node and each spines node (2 x 400GE x 4 links per leaf node)

2 2

1 x 400GE links

between each H100 server and each leaf nodes (400GE x 8 links per server)

2 x 400GE links

between each leaf node and each spines node (2 x 400GE x 4 links per leaf node)

  • All the Nvidia A100 servers in the lab are connected to the QFX5220 and QFX5230 leaf nodes in cluster 1 using 200GE interfaces, while the H100 servers are connected to the QFX5240 leaf nodes in cluster 2 using 400GE interfaces.
  • This fabric is a pure L3 IP fabric (either IPv4 or IPV6) that uses EBGP for route advertisement (described in the networking section).
  • Connectivity between the servers and the leaf nodes is L2 vlan-based with an IRB on the leaf nodes acting as default gateway for the servers (described in the networking section).
The speed and number of links between the GPU servers and leaf nodes and between the leaf and spine nodes determines the oversubscription factor. As an example, consider the number of GPU servers available in the lab, and how they are connected to the GPU backend fabric as described above.

Table 5: Per cluster, per stripe Server to Leaf Bandwidth

Server to Leaf Bandwidth per Stripe (per Cluster)
Cluster AI Systems (server type) Servers per Stripe Server <=> Leaf Links per Server Bandwidth of Server <=> Leaf Links [Gbps]

Total Bandwidth

Servers <=> Leaf per stripe [Tbps}

1 A100 4 8 200 4 x 8 x 200/1000 = 6.4
2 H100 2 8 400 2 x 8 x 400/1000 = 6.4

Table 6: Per cluster, per stripe Leaf to Spine Bandwidth

Leaf to Spine Bandwidth per Stripe
Leaf <=> Spine Links Per Spine Node & Per Stripe

Speed Of

Leaf <=> Spine Links

[Gbps]

Number of Spine Nodes

Total Bandwidth

Leaf <=> Spine Per Stripe

[Tbps]

8 2 x 400 2 12.8

The (over)subscription rate is simply calculated by comparing the numbers from the two tables above:

In cluster 1, the bandwidth between the servers and the leaf nodes is 6.4 Tbps per stripe, while the bandwidth available between the leaf and spine nodes is 12.8 Tbps per stripe. This means that the fabric has enough capacity to process all traffic between the GPUs even when this traffic was 100% inter-stripe, while still having extra capacity to accommodate additional servers without becoming oversubscribed.

Figure 6: Extra Capacity Example

We also tested connecting the H100 GPU servers along the A100 servers to the stripes in Cluster 1 as follows:

Figure 7: 1:1 Subscription Example

Table 7: Per cluster, per stripe Server to Leaf Bandwidth with all servers connected to same cluster

Server to Leaf Bandwidth per Stripe
Cluster Al Systems Servers per Stripe Server <=> Leaf Links per Server

Server <=> Leaf Links Bandwidth

[Gbps]

Total Servers <=> Leaf Links

Bandwidth per stripe

[Tbps]

1 A100 4 8 200 4 x 8 x 200/1000 = 6.4
H100 2 8 400 2 x 8 x 400/1000 = 6.4
  Total Bandwidth of Server <=> Leaf Links 12.8

The bandwidth between the servers and the leaf nodes is now 12.8 Tbps per stripe, while the bandwidth available between the leaf and spine nodes is also 12.8 Tbps per stripe (as shown in table above). This means that the fabric has enough capacity to process all traffic between the GPUs even when this traffic was 100% inter-stripe, but now there is no extra capacity to accommodate additional servers. The subscription factor in this case is 1:1 (no subscription).

To run oversubscription testing, we disabled some of the interfaces between the leaf and spines to reduce the available bandwidth as shown in the example in Figure 8:

Figure 8: 2:1 Oversubscription Example

The total Servers to Leaf Links bandwidth per stripe has not changed. It is still 12.8 Tbps as shown in table 3 in the previous scenario.

However, the bandwidth available between the leaf and spine nodes is now only 6.4 Tbps per stripe.

Table 8: Per Stripe Leaf to Spine Bandwidth

Leaf to Spine Bandwidth per Stripe
Leaf <=> Spine Links Per Spine Node & Per Stripe

Speed Of

Leaf <=> Spine Links

[Gbps]

Number of Spine Nodes

Total Bandwidth

Leaf <=> Spine Per Stripe

[Tbps]

8 1 x 400 2 6.4

This means that the fabric no longer has enough capacity to process all traffic between the GPUs even if this traffic was 100% inter-stripe, potentially causing congestion and traffic loss. The oversubscription factor in this case is 2:1.