Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Recommendations Summary

The AI Data Center Network with Juniper Apstra, AMD GPUs, and VAST Storage JVD follows an industry-standard dedicated IP Fabric design. Three distinct fabrics provide maximum efficiency while maintaining focus on AI model scale, expedited completion times, and rapid evolution with the advent of AI technologies.

To follow best practice recommendations:

  • A minimum of 4 spines in each fabric is suggested.
Note:

Though the design for cluster 1 in this document only includes only 2 spines, we found that under certain dual failure scenarios, combined with congestion, the fabric becomes susceptible to PFC storms (not vendor-unique). We recommend deploying the solution with 4 spines as described for the QFX5240s fabric (cluster 2) even when using different switch models.

  • Follow a rail-optimized fabric and maintain a 1:1 relation with bandwidth subscription and Leaf to GPU symmetry.
  • Implement Dynamic Load Balancing (DLB) instead of traditional ECMP for optimal load distribution.
  • Implement DCQCN (PFC and ECN) to ensure a lossless fabric in the GPU Backend Fabric, and possibly in the Storage Backend Fabric as required per vendor recommendation.
  • Configure DCQCN (PFC and ECN) parameters on the AMD servers and change the NCCL_SOCKET interface to be the management (frontend) interface.
  • The minimum recommended Junos OS releases for this JVD are:
    • Junos OS Release 23.4X100-D20 for the Juniper QFX5240-64CD
Note:

For minimum software released for QFX5220-64CD, QFX5230-64CD, PTX10008, check the AI Data Center Network with Juniper Apstra, NVIDIA GPUs, and WEKA Storage—Juniper Validated Design (JVD) Recommendations Section.

The Juniper hardware listed in the Juniper Hardware and Software Components section are the best-suited switch platforms regarding features, performance, and the roles specified in this JVD.