Terraform Automation of Apstra for the AI Fabric
AI Terraform Configs
Juniper has compiled a set of Terraform configs to help set up data center fabrics for an AI cluster. AI training requires a dedicated GPU Backend fabric, a dedicated Storage Backend fabric, and a Frontend fabric. Here we show such Apstra-managed network fabrics deploying logical devices, racks and templates for DGX (or HGX equivalent) servers based on A100 and H100 GPUs having 200GE and 400GE access connectivity respectively. The logical devices, racks and templates defined here create the NVIDIA Rail-optimized topology.
The github repository for AI designs using Apstra can be found:
https://github.com/Juniper/terraform-apstra-examples/tree/master/ai-cluster-designs/
AI JVD Specific Terraform Configs
Based on the AI cluster designs with rail-optimized GPU fabrics of various sizes, this Terraform config for Apstra will build a set of 3 blueprints for a reference AI cluster's dedicated GPU Backend fabric, a dedicated Storage Backend fabric, and a Frontend fabric.
This example shall serve as a Juniper Validated Design (JVD) set of configurations that can be applied to larger clusters. It has two NVIDIA rail-optimized groups with Juniper QFX5220 leaf switches in one stripe of 8 and QFX5230 leaf switches in another stripe of 8. It has options for both QFX5230 spines or high-radix PTX10008 spines, with examples here for A100s and H100-based servers in uniform racks or as deployed in the "Lab Leaf" rack with mixed server access for half A100 and half H100 connectivity to serve as an example, and because that is what is used in the real lab test environment for this configuration.
The github repository for this specific AI JVD can be found:
https://github.com/Juniper/terraform-apstra-examples/tree/master/ai-cluster-jvd/
Figure 63: Sample GPU Backend Terraform Template
Figure 64: Sample GPU Backend Terraform Template: Rack Type
Figure 65: Sample GPU Backend Terraform Template: Logical Device
Figure 66: Terraform Template: All Templates Examples