Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

AI-ML Data Center Overview

As artificial intelligence (AI) and machine learning (ML) applications expand, the networks supporting these AI-ML applications require increased capacity to handle large data flows. This requirement is particularly true for the data centers that store AI-ML data sets. Junos® OS Evolved offers a set of innovative features for AI-ML data centers. Use this guide to learn how to configure these features to optimize operations inside AI-ML data center fabrics.

Generative AI and ML applications such as large language models (LLMs) are based on statistical analysis of data sets: the more often the computational model finds a pattern in the data, the more it reinforces that pattern in its output. Through this repetitive pattern finding, these models are able to accomplish tasks such as convincingly imitating human speech. However, a generative AI application is only as good as the data set it is trained on. The larger the data set, the more patterns the model is able to detect. For this reason, AI and ML applications require large data sets. These data sets are stored in data centers.

To increase the speed of training, AI and ML models are often trained within the data center network through parallel computing. Complex computations occur simultaneously on graphics processing unit (GPU) clusters. The GPUs in a cluster are hosted on server nodes that are distributed across the data center. The network must synchronize the output from the GPUs within a cluster to create a fully trained model. This synchronization requires the continuous movement of large data flows across the back end of the network.

The large data flows, also known as elephant flows, in AI-ML data centers require robust networks. When dealing with large data flows, an insufficient network quickly encounters problems such as traffic congestion, dropped packets, and link failures. These network problems are especially unacceptable when dealing with data that requires high levels of accuracy. One robust network design ideal for AI-ML data centers is the Rail-Optimized Stripe. This AI cluster architecture minimizes network disruption by moving data to a GPU on the same rail as the destination. An IP Clos architecture is another functional AI-ML data center fabric design.

Juniper Networks® QFX Series Switches running Junos OS Evolved are ideal candidates for both Rail-Optimized Stripe architectures and IP Clos network designs. For example, the QFX5220-32CD, QFX5230-64CD, QFX5240-64OD, and QFX5240-QD switches work well in both network types as leaf, spine, and superspine devices. These switches also function well as a group of leaf-spine switches called a point of distribution (POD). To build larger AI-ML clusters in your data center, you can use a superspine layer to interconnect different PODs. You can deploy these switches as a single POD or multiple PODs for maximum flexibility and network redundancy. In addition, these devices support advanced AI-ML features that solve many load balancing and traffic management problems common to AI-ML data centers.