Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Load Balancing Overview for AI-ML Data Centers

Significant load balancing challenges arise when AI-ML data centers process elephant flows. If elephant flows are not load-balanced properly across the network, they are likely to cause traffic congestion. When traffic congestion does occur, ineffective load balancing can compound the problem by inadvertently directing traffic to already congested links. Junos OS Evolved offers several types of load-balancing configurations that are optimized for the challenges of elephant flows.

As the network administrator, you can configure three main types of load balancing on your network:

  • Static load balancing (SLB)—In SLB, you configure certain types of traffic to always use certain links. SLB is the most basic type of load balancing.

  • Dynamic load balancing (DLB)—DLB dynamically chooses the link for a traffic flow based on the size of the traffic queue and the local link bandwidth utilization. DLB also checks the health of a link before rerouting traffic. DLB is more effective at avoiding traffic congestion than SLB.

    DLB has several modes and types that allow for customization, including:

    • Selective DLB—Selectively enable DLB for certain per-packet scenarios and use SLB for others.

    • Flowlet mode—In flowlet mode, DLB tracks the status of flows using an inactivity timer. When the inactivity timer expires for a particular flow, DLB rechecks whether that link is still optimal for that flow. If the link is no longer optimal, DLB selects a new egress link.

    • Reactive path rebalancing—Use this enhancement to DLB to move the traffic to a better quality link even when flowlet mode is enabled.

  • Global load balancing (GLB)—GLB is an improvement on DLB. While DLB takes into account only the local link bandwidth utilization, GLB has visibility into the bandwidth utilization of links at the next-to-next-hop (NNH) level. GLB can reroute traffic flows to avoid traffic congestion farther out in the network than what DLB can detect.

You can use these different load balancing techniques in parallel within your AI-ML data center fabric.