IP Fabric GPU Backend Fabric Architecture

The GPU backend fabric in this cluster is built using Juniper QFX5240-64OD switches, which serve as both leaf and spine nodes, as shown in Figure 1. The architecture includes two stripes, each composed of eight QFX5240 leaf nodes. All leaf nodes are interconnected across four QFX5240 spine nodes, forming a Layer 3 IPv6 fabric that uses EBGP for route advertisement and native IPv6 for forwarding.

Figure 1: GPU Backend Fabric Architecture

The NVIDIA H100 as well as the non-queue pair pinning (non-QPP) servers connect to the leaf nodes using 400GE interfaces. The leaf-to-spine links are also configured with 400GE uplinks. While the Juniper QFX5240 switches support 800GE uplinks to the spine nodes (refer to Table 1), the current configuration uses 400GE. This choice is intentional. By selecting 400GE uplinks, congestion scenarios can be tested with the existing resources.

Table 1: Comparison of Oversubscription Ratios Based on Leaf-to-Spine Link Speed
Leaf-to-Spine Link Speed	Leaf Uplink Capacity (per leaf)	Total Leaf-to-Spine BW	Server-to-Leaf BW (4 servers)	Server-to-Leaf BW (8 servers)	Oversubscription Ratio (4 servers)	Oversubscription Ratio (8 servers)
200 Gbps	4 × 200G = 800 Gbps	12.8 Tbps	12.8 Tbps	25.6 Tbps	1:1 (balanced)	2:1 (oversubscribed)
400 Gbps	4 × 400G = 1.6 Tbps	25.6 Tbps	12.8 Tbps	25.6 Tbps	1:2 (overprovisioned)	1:1 (balanced)
800 Gbps	4 × 800G = 3.2 Tbps	51.2 Tbps	12.8 Tbps	25.6 Tbps	1:4 (heavily overprovisioned)	1:2 (overprovisioned)

Currently, the lab includes four servers where Queue Pair Pinning (QPP) can be implemented. Two additional non-QPP-enabled servers per stripe were added, increasing the total server bandwidth to match the available spine uplink capacity. Additional RoCEv2 traffic is injected using an IXIA traffic generator to create realistic congestion scenarios and validate congestion control mechanisms.

A summary of the GPU backend fabric components and their connectivity is provided in Tables 2 and 3 below.

Table 2: GPU Backend Devices per Cluster and Stripe
Stripe	GPU Servers	GPU Backend Leaf nodes switch model	GPU Backend Spine nodes switch model
1	H100 x 2 (H100-01 & H100-02)	QFX5240-64OD x 8 (gpu-backend-001_leaf#; #=1-8)	QFX5240-64OD x 4 (gpu-backend-spine#; #=1-4)
2	H100 x 2 (H100-01 & H100-02)	QFX5240-64OD x 8 (gpu-backend-002_leaf#; #=1-8)	QFX5240-64OD x 4 (gpu-backend-spine#; #=1-4)

Table 3: GPU Backend Connections between Servers, Leaf Nodes and Spine Nodes
Stripe	GPU Servers <=> GPU Backend Leaf Nodes	GPU Backend Leaf Nodes <=> GPU Backend Spine Nodes
1	Total number of 400GE links between servers and leaf nodes = 8 (number of GPUs per server) x 1 (number of 400GE server to leaf links) x4 (number of servers) = 32	Total number of 400GE links between GPU backend leaf nodes and spine nodes = 8 (number of leaf nodes) x 2 (number of 400GE links per leaf to spine connection) x 4 (number of spine nodes) = 64
2	Total number of 400GE links between servers and leaf nodes = 8 (number of GPUs per server) x 1 (number of 400GE server to leaf links) x 4 (number of servers) = 32	Total number of 400GE links between GPU backend leaf nodes and spine nodes = 8 (number of leaf nodes) x 2 (number of 400GE links per leaf to spine connection) x 4 (number of spine nodes) = 64

Oversubscription Factor

The speed and number of links between the GPU servers and leaf nodes, as well as the links between the leaf and spine nodes, determine the overall oversubscription factor of the fabric.

With only the four NVIDIA H100 QPP-enabled servers connecting to the fabric using 8 × 400GE interfaces (3.2 Tbps per server), the total server-to-leaf bandwidth is 12.8 Tbps. Each of the 16 leaf nodes connects to four spine nodes using 400GE links, providing a total leaf-to-spine bandwidth of 25.6 Tbps (see Table 4). This results in a 1:2 ratio, meaning the fabric is overprovisioned 1:2, providing more than enough bandwidth for full GPU-to-GPU communication—even under 100% inter-stripe traffic.

To implement a balanced, non-oversubscribed (1:1) configuration that can still be congested for testing we introduced additional RoCEv2 traffic, two additional servers per stripe, without Queue Pair Pinning (QPP) support, were added. This brings the total to eight servers, increasing the server-to-leaf bandwidth to 25.6 Tbps (see Table 5), which matches the available spine uplink capacity. In this extended setup, additional RoCEv2 traffic is injected using IXIA to create realistic congestion scenarios and validate congestion control mechanisms across the backend fabric.

Note:

The recommendation is to deploy the fabric following a 1:1 subscription factor.

Table 4: Per Stripe Server to Leaf Bandwidth
Server to Leaf Bandwidth per Stripe
Stripe	Number of servers per Stripe	Number of 400 GE server <=> leaf links per server (Same as number of leaf nodes & number of GPUs per server)	Server <=> Leaf Link Bandwidth [Gbps]	Total Servers <=> Leaf Links Bandwidth per stripe [Tbps]
1	2	8	400 Gbps	2 x 8 x 400 Gbps = 6.4 Tbps
2	2	8	400 Gbps	2 x 8 x 400 Gbps = 6.4 Tbps
			Total Server <=> Leaf Bandwidth	12.8 Tbps

Table 5: Per Stripe Leaf to Spine Bandwidth
Leaf nodes to spine nodes bandwidth per Stripe
Stripe	Number of leaf nodes	Number of spine nodes	Number of 400 GE leaf <=> spine links per leaf node	Server <=> Leaf Link Bandwidth [Gbps]	Bandwidth Leaf <=> Spine Per Stripe [Tbps]
1	8	4	1	400 Gbps	8 x 4 x 1 x 400 Gbps = 12.8 Tbps
2	8	4	1	400 Gbps	8 x 4 x 1 x 400 Gbps = 12.8 Tbps
				Total Leaf <=> Spine Bandwidth	25.6 Tbps

Table 6: Per Stripe Server to Leaf Bandwidth
Server to Leaf Bandwidth per Stripe
Stripe	Number of servers per Stripe	Number of 400 GE server ó lea links per server (Same as number of leaf nodes & number of GPUs per server)	Server <=> Leaf Link Bandwidth [Gbps]	Total Servers <=> Leaf Links Bandwidth per stripe [Tbps]
1	4	8	400 Gbps	4 x 8 x 400 Gbps = 12.8 Tbps
2	4	8	400 Gbps	4 x 8 x 400 Gbps = 12.8 Tbps
			Total Server <=> Leaf Bandwidth	25.6 Tbps

GPU Servers to leaf nodes connectivity follows the Rail-optimized architecture as described in the Backend GPU Rail Optimized Stripe Architecture section of the AI Data Center Network with Juniper Apstra, NVIDIA GPUs, and WEKA Storage JVD.