Scaling
The size of an AI cluster varies significantly depending on the specific requirements of the workload. The number of nodes in an AI cluster is influenced by factors such as the complexity of the machine learning models, the size of the datasets, the desired training speed, and the available budget. The number varies from a small cluster with less than 100 nodes to a data center-wide cluster comprising 10000s of compute, storage, and networking nodes. A minimum of 4 spines must always be deployed for path diversity and reduction of PFC failure paths.
Table 18: Fabric Scaling - Devices and Positioning
| Small | Medium | Large |
|---|---|---|
| 64 – 2048 GPU | 2048 – 8192 GPU | 8192 – 32768 GPU |
| With support for up to 2048 GPUs, the Juniper QFX5240-64CD/QFX5241-64CD or QFX5230-64CD can be used as Spine and leaf devices to support single or dual-stripe applications. To follow best practice recommendations, a minimum of 4 Spines should be deployed, even in a single-stripe fabric. | With support for 2048 – 8192 GPUs, the Juniper QFX5240-64CD/QFX5241-64CD can be used as Spine and leaf devices to achieve appropriate scale. This 3-stage, rail-based fabric design provides physical connectivity to 16 Stripes from 64 Spines and 1024 leaf nodes, maintaining a 1:1 subscription throughput model. | For infrastructures supporting more than 8192 GPUs, the Juniper PTX1000x Chassis spine and QFX5240-64CD/QFX5241-64CD leaf nodes can support up to 32768 GPUs. This 3-stage, rail-based fabric design provides physical connectivity to 64 Stripes from 64 Spines and 4096 leaf nodes, maintaining a 1:1 subscription throughput model. |
|
|
|


