What Is AI Data Center Networking?

What Is AI Data Center Networking?

AI data center networking refers to the data center networking fabric that enables artificial intelligence (AI). It supports the rigorous network scalability, performance, and low latency requirements of AI and machine learning (ML) workloads, which are particularly demanding in the AI training phase. 

In early high-performance computing (HPC) and AI training networks, InfiniBand, a high-speed, low-latency, proprietary networking technology, initially gained popularity for its fast and efficient communication between servers and storage systems. Today, the open alternative is Ethernet, which is gaining significant traction in the AI data center networking market and is expected to become the dominant technology.

There are multiple reasons for Ethernet’s growing adoption, but operations and cost stand apart. The talent pool of network professionals who can build and operate an Ethernet versus a proprietary InfiniBand network is massive, and a broad array of tools are available to manage such networks compared to InfiniBand technology, which is sourced primarily via Nvidia.

 

What AI-Driven Requirements Are Addressed by AI Data Center Networking?

Generative AI is proving to be a transformative technology around the world. Generative AI, and large deep-learning AI models in general, bring new AI data center networking requirements. There are three phases to developing an AI model:

  • Phase 1: Data preparation–Gathering and curating data sets to be fed into the AI model.
  • Phase 2: AI training–Teaching an AI model to perform a specific task by exposing it to large amounts of data. During this phase, the AI model learns patterns and relationships within the training data to develop virtual synapses to mimic intelligence.
  • Phase 3: AI inference–Operating in a real-world environment to make predictions or decisions based on new, unseen data.

Phase 3 is generally supported with existing data center and cloud networks. However, Phase 2 (AI training) requires extensive data and compute resources to support its iterative process in which the AI model learns from continuously gathered data to refine its parameters. Graphics processing units (GPUs) are well suited for AI learning and inference workloads but must work in clusters to be efficient. Scaling up clusters improves the efficiency of the AI model but also increases cost, so it is critical to use AI data center networking that does not impede the cluster’s efficiency.

Many, even tens of thousands of GPU servers (with costs exceeding $400,000 per server in 2023), must be connected to train large models. As a result, optimizing job completion time and minimizing or eliminating tail latency (a condition where outlier AI workloads slow the completion of the entire AI job) are keys to optimizing the return on GPU investment. In this use case, the AI data center network must be 100% reliable and cause no efficiency degradations in the cluster.   

 

How Does AI Data Center Networking Work?

Although pricey GPU servers typically drive the overall cost of AI data centers, AI data center networking is critical because a high-performing network is required to maximize GPU utilization. Ethernet is an open, proven technology best suited to provide this solution deployed in a data center network architecture enhanced for AI. The enhancements include congestion management, load balancing, and minimized latency to optimize job completion time (JCT). Finally, simplified management and automation ensure reliability and continued performance.

Fabric Design

Various fabric designs may be used in AI data center networking; however, an any-to-any non-blocking Clos fabric is recommended to optimize the training framework. These fabrics are built using a consistent networking speed of 400 Gbps (moving to 800 Gbps) from the NIC to leaf and through the spine. A two-layer, three-stage non-blocking fabric or three-layer, five-stage non-blocking fabric may be used depending on the model size and GPU scale.

Flow Control and Congestion Avoidance

In addition to fabric capacity, additional design considerations increase the reliability and efficiency of the overall fabric. These considerations include properly sized fabric interconnects with the optimal number of links and the ability to detect and correct flow imbalances to avoid congestion and packet loss. Explicit congestion notification (ECN) with data center quantized congestion notification (DCQCN) plus priority-based flow control resolve flow imbalances to ensure lossless transmission.

To reduce congestion, dynamic and adaptive load balancing is deployed at the switch. Dynamic load balancing redistributes flows locally at the switch to distribute them evenly. Adaptive load balancing monitors flow forwarding and next hop tables to identify imbalances and steer traffic away from congested paths.

When congestion is not avoided, ECN provides early notification to applications. During these periods, leafs and spines update ECN-capable packets to notify senders of the congestion, which causes the senders to slow transmission to avoid packet drops in transit. If the endpoints do not react in time, priority-based flow control (PFC) allows Ethernet receivers to share feedback with senders on buffer availability. Finally, during periods of congestion, leafs and spines can pause or throttle traffic on specific links to reduce congestion and avoid packet drops, enabling lossless transmissions for specific traffic classes.

Scale and Performance

Ethernet has emerged as the open-standard solution of choice to handle the rigors of high-performance computing and AI applications. It has evolved over time (including the current progression to 800 GbE and data center bridging (DCB)) to become faster, more reliable, and scalable, making it the preferred choice for handling high data throughput and low-latency requirements necessary for mission-critical AI applications.

Automation

Automation is the final piece for an effective AI data center networking solution, though not all automation is created equal. For full value, the automation software must provide experience-first operations. It is used in design, deployment, and management of the AI data center on an ongoing basis. It automates and validates the AI data center network lifecycle from Day 0 through Day 2+. This results in repeatable and continuously validated AI data center designs and deployments that not only remove human error but also take advantage of telemetry and flow data to optimize performance, facilitate proactive troubleshooting, and avert outages.   

 

Juniper AI Data Center Networking Solution Builds Upon Decades of Networking Experience and AIOps Innovations

Juniper’s AI data center networking solution builds upon our decades of networking experience and AIOps innovations to round out open, fast, and simple-to-manage Ethernet-based AI networking solutions. These high-capacity, scalable, non-blocking fabrics deliver the highest AI performance, fastest job completion time, and most efficient GPU utilization. The Juniper AI data center networking solution leverages three fundamental architectural pillars:

  • Massively scalable performance–To optimize job completion time and therefore GPU efficiency
  • Industry-standard openness–To extend existing data center technologies with industry-driven ecosystems that promote innovation and drive down costs over the long term
  • Experience-first operations–To automate and simplify AI data center design, deployment, and operations for back-end, front-end, and storage fabrics

These pillars are supported by:

  • A high-capacity, lossless AI data center network design taking advantage of an any-to-any non-blocking Clos fabric, the most versatile topology to optimize AI training frameworks
  • High-performance switches and routers, including Juniper PTX Series Routers, based on Juniper Express Silicon for the spine/super spine, and QFX Series Switches, based on Broadcom’s Tomahawk ASICs as leaf switches providing AI server connectivity
  • Fabric efficiency with flow control and collision avoidance
  • Open, standards-based Ethernet scale and performance with 800 GbE
  • Extensive automation using Juniper Apstra® intent-based networking software to automate and validate the AI data center network lifecycle from Day 0 through Day 2+

 

AI Data Center Networking FAQs

What problem does AI data center networking solve?

AI data center networking solves the performance requirements of generative AI and large deep-learning AI models in general. AI training, in particular, requires extensive data and compute resources to support its iterative process in which the AI model learns from continuously gathered data to refine its parameters. Graphics processing units (GPUs) are well suited for AI learning and inference workloads but must work in clusters to be efficient. Scaling up clusters improves the efficiency of the AI model but also increases cost, so it is critical to use AI data center networking that does not impede the efficiency of the cluster.

Many, even tens of thousands of GPU servers (with costs exceeding $400,000 per server in 2023) must be connected to train large models. As a result, maximizing job completion time and minimizing or eliminating tail latency (a condition where outlier AI workloads slow the completion of the entire AI job) are keys to optimizing the return on GPU investment. In this use case, the AI data center network must be 100% reliable and cause no efficiency degradations in the cluster.   

What are the advantages of Ethernet over InfiniBand for AI data center networking?

In early high-performance computing (HPC) and AI training networks, InfiniBand, a high-speed, low-latency, proprietary networking technology initially gained popularity for its fast and efficient communication between servers and storage systems. Today, the open alternative Ethernet is gaining significant traction in the modern AI data center networking market and is expected to become the dominant technology.

While proprietary technologies like InfiniBand can bring advancements and innovation, they are expensive, charging premiums where competitive supply-and-demand markets can’t regulate costs. In addition, the talent pool of network professionals who can build and operate an Ethernet versus a proprietary InfiniBand network is massive, and a broad array of tools are available to manage such networks compared to InfiniBand technology, which is sourced primarily via Nvidia.

Next to IP, Ethernet is the world's most widely adopted networking technology. Ethernet has evolved to become faster, more reliable, and scalable, making it preferred for handling the high data throughput and low-latency requirements of AI applications. The progression to 800GbE and data center bridging (DCB) Ethernet enhancements enable high-capacity, low latency, and lossless data transmission, making Ethernet fabrics highly desirable for high-priority and mission-critical AI traffic.

What AI data center networking solutions/productions/technology does Juniper offer?

Juniper’s AI data center networking solution provides a high-capacity, lossless AI data center network design that uses an any-to-any non-blocking Clos fabric, the most versatile topology to optimize AI training frameworks. The solution takes advantage of high-performance, open standards-based Ethernet switches and routers with interfaces up to 800 GbE. In addition, it uses Juniper Apstra intent-based networking software to automate and validate the AI data center network lifecycle from Day 0 through Day 2+.