Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

IP Fabric GPU Backend Fabric Architecture

The GPU backend fabric in this cluster is built using Juniper QFX5240-64OD switches, which serve as both leaf and spine nodes, as shown in Figure 1. The architecture includes two stripes, each composed of eight QFX5240 leaf nodes. All leaf nodes are interconnected across four QFX5240 spine nodes, forming a Layer 3 IPv6 fabric that uses EBGP for route advertisement and native IPv6 for forwarding.

Figure 1: GPU Backend Fabric Architecture GPU Backend Fabric Architecture

The NVIDIA H100 as well as the non-queue pair pinning (non-QPP) servers connect to the leaf nodes using 400GE interfaces. The leaf-to-spine links are also configured with 400GE uplinks. While the Juniper QFX5240 switches support 800GE uplinks to the spine nodes (refer to Table 1), the current configuration uses 400GE. This choice is intentional. By selecting 400GE uplinks, congestion scenarios can be tested with the existing resources.

Table 1: Comparison of Oversubscription Ratios Based on Leaf-to-Spine Link Speed
Leaf-to-Spine Link Speed

Leaf Uplink

Capacity

(per leaf)

Total

Leaf-to-Spine

BW

Server-to-Leaf BW

(4 servers)

Server-to-Leaf BW

(8 servers)

Oversubscription

Ratio

(4 servers)

Oversubscription Ratio

(8 servers)

200 Gbps 4 × 200G = 800 Gbps 12.8 Tbps 12.8 Tbps 25.6 Tbps

1:1

(balanced)

2:1

(oversubscribed)

400 Gbps 4 × 400G = 1.6 Tbps 25.6 Tbps 12.8 Tbps 25.6 Tbps

1:2

(overprovisioned)

1:1

(balanced)

800 Gbps 4 × 800G = 3.2 Tbps 51.2 Tbps 12.8 Tbps 25.6 Tbps

1:4

(heavily overprovisioned)

1:2

(overprovisioned)

Currently, the lab includes four servers where Queue Pair Pinning (QPP) can be implemented. Two additional non-QPP-enabled servers per stripe were added, increasing the total server bandwidth to match the available spine uplink capacity. Additional RoCEv2 traffic is injected using an IXIA traffic generator to create realistic congestion scenarios and validate congestion control mechanisms.

A summary of the GPU backend fabric components and their connectivity is provided in Tables 2 and 3 below.

Table 2: GPU Backend Devices per Cluster and Stripe
Stripe GPU Servers

GPU Backend Leaf

nodes switch model

GPU Backend Spine nodes switch model
1

H100 x 2

(H100-01 & H100-02)

QFX5240-64OD x 8

(gpu-backend-001_leaf#; #=1-8)

QFX5240-64OD x 4

(gpu-backend-spine#; #=1-4)

2

H100 x 2

(H100-01 & H100-02)

QFX5240-64OD x 8

(gpu-backend-002_leaf#; #=1-8)

Table 3: GPU Backend Connections between Servers, Leaf Nodes and Spine Nodes
Stripe

GPU Servers <=>

GPU Backend Leaf Nodes

GPU Backend Leaf Nodes <=>

GPU Backend Spine Nodes

1

Total number of 400GE links

between servers and leaf nodes =

8 (number of GPUs per server) x

1 (number of 400GE server to leaf links) x4 (number of servers) = 32

Total number of 400GE links

between GPU backend leaf nodes and spine nodes =

8 (number of leaf nodes) x

2 (number of 400GE links per leaf to spine connection) x

4 (number of spine nodes) = 64

2

Total number of 400GE links

between servers and leaf nodes =

8 (number of GPUs per server) x

1 (number of 400GE server to leaf links) x

4 (number of servers) = 32

Total number of 400GE links

between GPU backend leaf nodes and spine nodes =

8 (number of leaf nodes) x

2 (number of 400GE links per leaf to spine connection) x

4 (number of spine nodes) = 64

Oversubscription Factor

The speed and number of links between the GPU servers and leaf nodes, as well as the links between the leaf and spine nodes, determine the overall oversubscription factor of the fabric.

With only the four NVIDIA H100 QPP-enabled servers connecting to the fabric using 8 × 400GE interfaces (3.2 Tbps per server), the total server-to-leaf bandwidth is 12.8 Tbps. Each of the 16 leaf nodes connects to four spine nodes using 400GE links, providing a total leaf-to-spine bandwidth of 25.6 Tbps (see Table 4). This results in a 1:2 ratio, meaning the fabric is overprovisioned 1:2, providing more than enough bandwidth for full GPU-to-GPU communication—even under 100% inter-stripe traffic.

To implement a balanced, non-oversubscribed (1:1) configuration that can still be congested for testing we introduced additional RoCEv2 traffic, two additional servers per stripe, without Queue Pair Pinning (QPP) support, were added. This brings the total to eight servers, increasing the server-to-leaf bandwidth to 25.6 Tbps (see Table 5), which matches the available spine uplink capacity. In this extended setup, additional RoCEv2 traffic is injected using IXIA to create realistic congestion scenarios and validate congestion control mechanisms across the backend fabric.

Note:

The recommendation is to deploy the fabric following a 1:1 subscription factor.

Table 4: Per Stripe Server to Leaf Bandwidth
Server to Leaf Bandwidth per Stripe
Stripe

Number of servers

per Stripe

Number of 400 GE server <=> leaf links per server

(Same as number of leaf nodes & number of GPUs per server)

Server <=> Leaf

Link Bandwidth [Gbps]

Total Servers <=> Leaf Links Bandwidth per stripe [Tbps]
1 2 8 400 Gbps 2 x 8 x 400 Gbps = 6.4 Tbps
2 2 8 400 Gbps 2 x 8 x 400 Gbps = 6.4 Tbps
     

Total

Server <=> Leaf Bandwidth

12.8 Tbps
Table 5: Per Stripe Leaf to Spine Bandwidth
Leaf nodes to spine nodes bandwidth per Stripe
Stripe Number of leaf nodes Number of spine nodes

Number of 400 GE leaf <=> spine links

per leaf node

Server <=> Leaf

Link Bandwidth [Gbps]

Bandwidth Leaf <=> Spine

Per Stripe [Tbps]

1 8 4 1 400 Gbps 8 x 4 x 1 x 400 Gbps = 12.8 Tbps
2 8 4 1 400 Gbps 8 x 4 x 1 x 400 Gbps = 12.8 Tbps
       

Total

Leaf <=> Spine Bandwidth

25.6 Tbps

Table 6: Per Stripe Server to Leaf Bandwidth
Server to Leaf Bandwidth per Stripe
Stripe

Number of servers

per Stripe

Number of 400 GE server ó lea links per server

(Same as number of leaf nodes &

number of GPUs per server)

Server <=> Leaf

Link Bandwidth [Gbps]

Total Servers <=> Leaf Links Bandwidth per stripe [Tbps]
1 4 8 400 Gbps 4 x 8 x 400 Gbps = 12.8 Tbps
2 4 8 400 Gbps 4 x 8 x 400 Gbps = 12.8 Tbps
     

Total

Server <=> Leaf Bandwidth

25.6 Tbps

GPU Servers to leaf nodes connectivity follows the Rail-optimized architecture as described in the Backend GPU Rail Optimized Stripe Architecture section of the AI Data Center Network with Juniper Apstra, NVIDIA GPUs, and WEKA Storage JVD.