Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Recommendations

The AI Data Center Network with Juniper Apstra, NVIDIA GPUs, and WEKA Storage JVD follows an industry-standard dedicated IP Fabric design. Three distinct fabrics provide maximum efficiency while maintaining focus on AI model scale, expedited completion times, and rapid evolution with the advent of AI technologies.

To follow best practice recommendations:

  • A minimum of 4 spines in each fabric is suggested.
Note:

Though the design for cluster 1 in this document only includes only 2 spines, we found that under certain dual failure scenarios, combined with congestion, the fabric becomes susceptible to PFC storms (not vendor-unique). We recommend deploying the solution with 4 spines as described for the QFX5240s fabric (cluster 2) even when using different switch models.

  • Follow a rail-optimized fabric and maintain a 1:1 relation with bandwidth subscription and Leaf to GPU symmetry.
  • Implement Dynamic Load Balancing instead of traditional ECMP for optimal load distribution.
  • Implement DCQCN (PFC and ECN) to ensure a lossless fabric in the GPU Backend Fabric, and possibly in the Storage Backend Fabric as required per vendor recommendation.
  • The minimum recommended Junos OS releases for this JVD are:
  • Junos OS Release 23.4R2-S3 is for the Juniper QFX5130-32CD
  • Junos OS Release 23.4X100-D20 for the Juniper QFX5220-32CD
  • Junos OS Release 23.4X100-D20 for the Juniper QFX5230-64CD
  • Junos OS Release 23.4X100-D20 for the Juniper QFX5240-64CD
  • Junos OS Release 23.4R2-S3 for the Juniper PTX10008
  • Configure DCQCN (PFC and ECN) parameters on the Nvidia servers and change the NCCL_SOCKET interface to be the management (frontend) interface.

The Juniper hardware listed in the Juniper Hardware and Software Components section are the best-suited switch platforms regarding features, performance, and the roles specified in this JVD.

Table 53: Revision History

Date Version Description
December 2024 JVD-AICLUSTERDC-AIML-02-08 Added PTX as spine.
November 2024 JVD-AICLUSTERDC-AIML-02-05 Utilized Junos OS Evolved Release 23.4X100-D20 for the leaf and spine switches.