ON THIS PAGE
Test Objectives
The primary objectives of the JVD testing can be summarized as:
- Qualification of the complete AI fabric design functionality including the Frontend, GPU Backend, and Storage Backend fabrics, and connectivity between NVIDIA GPUs and WEKA Storage.
- Qualification of the deployment steps based on Juniper Apstra.
- Ensure the design is well-documented and will produce a reliable, predictable deployment for the customer.
The qualification objectives included validating:
- validation of blueprint deployment, device upgrade, incremental configuration pushes/provisioning, Telemetry/Analytics checking, failure mode analysis, congestion avoidance and mitigation, and verification of host, storage, and GPU traffic.
Test Goals
The AI JVD testing for the described network included the following:
- Design and blueprint deployment through Apstra of three distinct fabrics
- Fabric operation and monitoring through Apstra analytics and telemetry dashboard
- Congestion management with PFC and ECN, including failure scenarios
- End-to-end traffic flow, with Dynamic Load Balancing
- System health, ARP, ND, MAC, BGP (route, next hop), interface traffic counters, and so on
- Software operation verification (no anomalies, or issues found)
- AI fabric with Juniper Apstra successfully performing under the
following required scenarios (must):
- Node failure (reboot)
- Interface failures (interface down/up, Laser on/off):
Under these scenarios the following were evaluated/validated:
- Completion of AI Job models within MLCommons Training benchmarks
- Traffic recovery was validated after all failure scenarios.
- impact to the fabric and check anomalies reporting in Apstra.
Other features tested:
- Mellanox Connect-X NIC card default settings.
- DSCP and CNP configuration on the NICs
- Connectivity between fabric-connected hosts created by Apstra towards NSX-managed hosts.
- BERT/DLRM test completion times
- Llama2 Inference against existing infrastructure.
Refer to the test report for more information.
Test Non-Goals
This test report is for AI JVD phase 2 (release 2) which adds QFX5240 as a spine. Future releases will include PTX10008, multi-tenancy, and the newest Selective DLB and Global Load Balancing features.