JVD Validation Goals and Scope
Tests Objectives
The primary objectives of the JVD testing can be summarized as:
- Qualification of the complete AI fabric design functionality including the Frontend, GPU Backend, and Storage Backend fabrics, and connectivity between AMD GPUs and Vast Storage.
- Qualification of the deployment steps based on Juniper Apstra.
- Ensure the design is well-documented and will produce a reliable, predictable deployment for the customer.
The qualification objectives included validating:
- Validation of blueprint deployment, device upgrade, incremental configuration pushes/provisioning, Telemetry/Analytics checking, failure mode analysis, congestion avoidance and mitigation, and verification of host, storage, and GPU traffic.
Tests Scope
The AI JVD testing for the described network included the following:
- Design and blueprint deployment through Apstra of three distinct fabrics
- Fabric operation and monitoring through Apstra analytics and telemetry dashboard
- Congestion management with PFC and ECN, including failure scenarios
- End-to-end traffic flow, with Dynamic Load Balancing (DLB)
- System health, ARP, ND, MAC, BGP (route, next hop), interface traffic counters, and so on
- Software operation verification (no anomalies, or issues found)
- AI fabric with Juniper Apstra successfully performing under the
following required scenarios (must):
- Node failure (reboot)
- Interface failures (interface down/up, Laser on/off):
Under these scenarios the following were evaluated/validated:
- Completion of AI Job models within MLCommons Training benchmarks
- Traffic recovery was validated after all failure scenarios.
- impact to the fabric and check anomalies reporting in Apstra.
Other features tested
- Broadcom 97608 THOR2 NICs
- Mellanox Connect-X NIC card default settings.
- DSCP and CNP configuration on the NICs
- Connectivity between fabric-connected hosts created by Apstra towards NSX-managed hosts.
- BERT/LLAMA3 test completion times
- Llama2 Inference against existing infrastructure.
Refer to the test report for more information.
Tested Optics
Table 37: Frontend Fabric Optics
Frontend Fabric | ||||
---|---|---|---|---|
Part number | Optics Name | Device Role | Device Model | Interface/NIC type |
740-085351 | QSFP56-DD-400GBASE-DR4 | spine | QFX5130-32CD | QSFP-DD |
740-085351 | QSFP56-DD-400GBASE-DR4 | leaf | QFX5130-32CD | QSFP-DD |
740-061405 | QSFP-100GBASE-SR4-T2 | leaf | QFX5130-32CD | QSFP28 |
740-046565 | QSFP+-40G-SR4 w/ 4x10G breakout cable. | leaf | QFX5130-32CD | QSFP+ |
AFBR-709SMZ | AVAGO 10GBASE-SR SFP+ 300m | Server | SuperMicro Headend Server | Intel X710 |
AFBR-89CDDZ | AVAGO 100GbE QSFP28 300m |
GPU Server |
AMD MI300Xx Dell XE96880 | BCM97608 THOR2 |
AFBR-89CDDZ | AVAGO 100GbE QSFP28 300m |
GPU Server |
AMD MI300Xx SuperMicro AS-8125GS-TNMR2 |
ConnectX-7 |
Table 38: Backend Storage Fabric Optics
Backend Storage Fabric | ||||
---|---|---|---|---|
Part number | Optics Name | Device Role | Device Model | Interface/NIC type |
740-085351 | QSFP56-DD-400GBASE-DR4 | spine | QFX5220-32CD | QSFP-DD |
740-085351 | QSFP56-DD-400GBASE-DR4 | leaf | QFX5220-32CD | QSFP-DD |
740-058734 | QSFP-100GBASE-SR4 | leaf | QFX5220-32CD | QSFP28 |
720-128730 | QSFP56-DD-2x200GBASE-CR4-CU-2.5M w/ 400G DAC Breakout into 2X200G | leaf | QFX5220-32CD | QSFP-DD |
740-061405 | QSFP-100GBASE-SR4 | leaf | QFX5220-32CD | QSFP28 |
740-159002 | QSFP56-DD-2x200G-BOAOC-5M | GPU Server | AMD MI300Xx Dell XE9680 | BCM97608 THOR2 |
740-159002 | QSFP56-DD-2x200G-BOAOC-5M | GPU Server |
AMD MI300Xx SuperMicro AS-8125GS-TNMR2 |
ConnectX-7 |
740-061405 | QSFP-100GBASE-SR4 | Storage | Vast Storage CBOX | ConnectX-6 |
740-061405 | QSFP-100GBASE-SR4 | Storage | Vast Storage DBOX | ConnectX-6 |
Table 39: Backend GPU Fabric
Backend GPU Fabric | ||||
---|---|---|---|---|
Part number | Optics Name | Device Role | Device Model | Interface/NIC type |
740-174933 | OSFP-800G-DR8 | spine | QFX5240-64OD | OSPF800 |
740-174933 | OSFP-800G-DR8 | leaf | QFX5240-64OD | OSPF800 |
740-085351 | QDD-400G-DR4 | GPU Server | AMD MI300Xx Dell XE9680 | BCM97608 THOR2 |
740-085351 | QDD-400G-DR4 | GPU Server |
AMD MI300Xx SuperMicro AS-8125GS-TNMR2 |
BCM97608 THOR2 |
For optics tested on QFX5220-64CD, QFX5230-64CD, PTX10008, WEKA storage and NVIDIA GPUs servers check AI Data Center Network with Juniper Apstra, NVIDIA GPUs, and WEKA Storage—Juniper Validated Design (JVD) Tested Optics Section.