Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Test Objectives

The primary objectives of the JVD testing can be summarized as:

  • Qualification of the complete AI fabric design functionality including the Frontend, GPU Backend, and Storage Backend fabrics, and connectivity between NVIDIA GPUs and WEKA Storage.
  • Qualification of the deployment steps based on Juniper Apstra.
  • Ensure the design is well-documented and will produce a reliable, predictable deployment for the customer.

The qualification objectives included validating:

  • validation of blueprint deployment, device upgrade, incremental configuration pushes/provisioning, Telemetry/Analytics checking, failure mode analysis, congestion avoidance and mitigation, and verification of host, storage, and GPU traffic.

Test Goals

The AI JVD testing for the described network included the following:

  • Design and blueprint deployment through Apstra of three distinct fabrics
  • Fabric operation and monitoring through Apstra analytics and telemetry dashboard
  • Congestion management with PFC and ECN, including failure scenarios
  • End-to-end traffic flow, with Dynamic Load Balancing
  • System health, ARP, ND, MAC, BGP (route, next hop), interface traffic counters, and so on
  • Software operation verification (no anomalies, or issues found)
  • AI fabric with Juniper Apstra successfully performing under the following required scenarios (must):
    • Node failure (reboot)
    • Interface failures (interface down/up, Laser on/off):

Under these scenarios the following were evaluated/validated:

  • Completion of AI Job models within MLCommons Training benchmarks
  • Traffic recovery was validated after all failure scenarios.
  • impact to the fabric and check anomalies reporting in Apstra.

Other features tested:

  • Mellanox Connect-X NIC card default settings.
  • DSCP and CNP configuration on the NICs
  • Connectivity between fabric-connected hosts created by Apstra towards NSX-managed hosts.
  • BERT/DLRM test completion times
  • Llama2 Inference against existing infrastructure.

Refer to the test report for more information.

Test Non-Goals

This test report is for AI JVD phase 2 (release 2) which adds QFX5240 as a spine. Future releases will include PTX10008, multi-tenancy, and the newest Selective DLB and Global Load Balancing features.