Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Juniper RDMA-aware Load Balancing (LB) and BGP-DPF – GPU Backend Fabric IP Services

In this section, we describe the strategies employed to address traffic congestion and optimize load distribution within the Backend GPU fabric.

Congestion Management and Congestion Control Configuration

Congestion management and congestion control are implemented using Data Center Quantized Congestion Notification (DCQCN) approach, ensuring traffic fairness and maintaining stability across the lossless fabric.

AI clusters impose unique demands on network infrastructure due to their high-density and low-entropy traffic patterns, characterized by frequent elephant flows and minimal flow variability. Moreover, most AI training workloads require uninterrupted, lossless packet delivery to be completed successfully. Consequently, when designing a network infrastructure for AI traffic flows, the key objectives include maximizing throughput, minimizing latency, and minimizing network interference while ensuring lossless operation. These requirements necessitate the deployment of effective congestion control mechanisms.

Data Center Quantized Congestion Notification (DCQCN) has become the industry-standard method for end-to-end congestion control in RoCEv2 environments. DCQCN provides mechanisms to adjust traffic rates in response to congestion events without relying on packet drops, striking a balance between reducing traffic rates and maintaining ongoing traffic flow.

It is important to note that DCQCN is primarily required in the GPU backend fabric, where the majority of AI workload traffic resides. It is generally unnecessary in the Frontend and Storage Backend fabrics.

DCQCN combines two complementary mechanisms to implement flow and congestion control:

  • Priority-based Flow Control (PFC)
  • Explicit Congestion Notification (ECN)

Priority-Based Flow Control (PFC) mitigates data loss by pausing traffic transmission for specific traffic classes, based on IEEE 802.1p priorities or DSCP markings mapped to queues.

When congestion is detected, PFC operates by sending PAUSE control frames upstream, requesting the sender to halt transmission of traffic associated with a specific priority. The sender completely stops sending traffic for that priority until the congestion subsides or the PAUSE timer expires.

While PFC prevents packet drops and allows the receiver to catch up, it also impacts application performance for traffic using the affected queues. Furthermore, resuming transmission after a pause can lead to sudden traffic surges, potentially re-triggering congestion. For these reasons, PFC should be configured carefully so that it is used as a last resource.

Explicit Congestion Notification (ECN) offers a proactive congestion signaling mechanism, reducing transmission rates while allowing traffic to continue flowing during congestion periods.

When congestion occurs, ECN bits in the IP header are marked (11), prompting the receiver to generate Congestion Notification Packets (CNPs), which inform the source to throttle its transmission rate. Unlike PFC, ECN aims to gradually reduce congestion without halting traffic completely or triggering packet drops.

Best Practice: Combining PFC and ECN provides the most effective congestion control strategy in a lossless IP fabric supporting RoCEv2. Their parameters must be carefully tuned so that ECN mechanisms are triggered before PFC.

For more detailed guidance, refer to Introduction to Congestion Control in Juniper AI Networks , which outlines best practices for building lossless fabrics for AI workloads using DCQCN (ECN and PFC) congestion control methods alongside Dynamic Load Balancing (DLB). The document is based on validation against DLRM training models and demonstrates how ECN thresholds, PFC parameters, input drops, and tail drops can be monitored and adjusted to optimize fabric performance for RoCEv2 traffic.

Note:

While we provide general recommendations and lab-validated parameters, each AI workload may present distinct traffic patterns. Class of Service (CoS) and load balancing attributes might need to be further tuned to match the specific characteristics of a particular model and cluster environment.

This leaf and spines nodes in the JVD are configured with CoS parameters that were determined to provide the best performance.

The following configuration is applied uniformly across all devices in the fabric.

Traffic Classification

Traffic classification is based on DSCP and implemented using the fabric-dscp classifier, which defines two forwarding classes: NO-LOSS and CNP . This classifier is applied to all et-* unit * logical interfaces.

All incoming traffic with DSCP 011010 (26) is classified as NO-LOSS, while traffic marked with DSCP 110000 (48) is classified as CNP. All GPU servers are configured to mark RoCEv2 traffic with DSCP 26 and Congestion Notification Packets (CNPs) with DSCP 48.

Note:

Refer to the Configuring NVIDIA DCQCN – ECN section of the AI Data Center Network with Juniper Apstra, NVIDIA GPUs, and WEKA Storage JVD and the Configuring AMD DCQCN – ECN section of the AI Data Center Network with Juniper Apstra, AMD GPUs, and Vast Storage JVD for details on how to configure DCQCN parameters on the Nvidia and AMD GPU servers.

CNP traffic is assigned to output queue 3, while NO-LOSS traffic is assigned to output queue 4.

Queue 4 is configured as a lossless using the no-loss attribute, and it mapped to PFC priority 3. Defining a queue as lossless ensures that packets mapped to this class are not dropped due to congestion—an essential requirement for RoCEv2. Configuring a forwarding class as lossless also impacts buffer allocation on the switch, reserving additional space to support flow control mechanisms such as PFC.

There are two types of buffers:

  • Shared Buffer Pool: A global memory space dynamically shared by all ports. It is partitioned between lossy and lossless traffic types. Larger shared buffers help absorb traffic bursts.
  • Dedicated Buffer Pool: A reserved portion of memory allocated per port which is then divided among the queues on that port. Though it can be tuned a minimum amount is always reserved by the system. A larger dedicated buffer pool means congestion on one port is less likely to affect traffic on another port because the traffic does not need to use as much shared buffer space. The larger the dedicated buffer pool, the less bursty traffic the switch can handle because there is less dynamic shared buffer memory.

The recommended values for the Shared and Dedicated Buffers in this JVD are as follows:

Shared buffers:

  • Ingress lossless percent 66: Reserves 66% of the ingress shared buffer space for lossless traffic (e.g., RoCEv2).
  • Ingress lossless-headroom percent 24: Carves out an additional 24% of ingress buffer space specifically as headroom for burst absorption. This ensures that RoCEv2 flows have sufficient space to accommodate microbursts while waiting for PFC pause frames to take effect.
  • Ingress lossy percent 10: Reserves 10% of ingress shared buffer space for lossy traffic.
  • Ingress lossless dynamic-threshold 10: Allows the lossless buffer pool to dynamically expand into unused lossy buffer space by up to 10%, providing flexibility under heavy load./
  • Egress lossless percent 66: Reserves 66% of egress shared buffer space for lossless traffic.
  • Egress lossy percent 10: Allocates 10% for lossy traffic.

Dedicated Buffers (per-port or per-queue):

  • Ingress percent 15: Allocates 15% of the total ingress buffer capacity as dedicated buffers. These are not shared and are reserved for specific traffic classes or ports.
  • Egress percent 30:Reserves 30% of egress buffer space for dedicated use.

When this buffer space begins to fill, the PFC mechanism sends Ethernet PAUSE frames to the traffic source instructing it to temporarily halt transmission and prevent packet loss.

Since traffic classification is DSCP-based and interfaces between GPU servers and leaf nodes are untagged, the PFC implementation is DSCP based PFC. The congestion-notification-profile pfc, which is applied to all et-* interfaces, defines operation details for PFC.

Note:

The congestion-notification-profile might be interpreted as related to Congestion Notification Packets (ECN). Congestion-notification-profile can also can be found abbreviated as CNP in some documentation. However, this profile defines the behavior of the PFC, not the ECN.

The PFC watchdog function monitors for deadlock or stuck queues caused by persistent PFC pause conditions. If a queue remains paused for too long (indicating possible head-of-line blocking), the watchdog can take corrective actions to avoid traffic stall conditions.

The input dscp code-point 011010 pfc statement specifies that incoming traffic marked with DSCP value 011010 (decimal 26) should trigger PFC when congestion is detected. Essentially, if DSCP 26 (RoCEv2) traffic is experiencing congestion, PFC frames for priority 3 will be generated to pause upstream senders (PFC priority 3 mapped to code point 26). The pause frames will be generated for a priority 3 based on the forwarding-class NO-LOSS configuration previously described.

In the example below:

Figure 1: PFC Pause Frames Generation Example PFC Pause Frames Generation Example

The combination of the following commands applied to interfaces et-0/0/0:0 and et-0/0/1:0, configures the device to classify all inbound traffic with DSCP 26 to the forwarding class NO-LOSS which is assigned to Queue 4, and mapped to pfc-priority 3, makes queue 4 a no-loss queue, and enables PFC for traffic with DSCP 26.

The output ieee-802.1 code-point 011 flow-control-queue 4 statement specifies that when paused frames with priority 3 are received, traffic for queue 4 must stop.

Figure 2: PFC Received Pause Frames Behavior PFC Received Pause Frames Behavior

Traffic Scheduling

The scheduler map sm1 is applied to all et-* interfaces and defines how traffic for each forwarding class is scheduled.

Two schedulers are included:

  • s1 for NO-LOSS traffic (queue 4)
  • s2-cnp for CNP traffic (queue 3)

NO-LOSS Traffic Scheduling (Scheduler s1)

Scheduler s1 controls how traffic in the NO-LOSS forwarding class (queue 4) is serviced. It applies the drop-profile dp1 and enables Explicit Congestion Notification (ECN) marking using the explicit-congestion-notification statement.

Note:

Drop profiles in Junos are commonly used to control how aggressively packets are dropped as the queue buffer fills up. However, when ECN is enabled, the profile is used to mark packets instead of dropping them. Marking packets means setting the Congestion Experienced (CE) bit in the IP header based on the configured thresholds.

Figure 3: ECN Profile Example A graph with a line going up AI-generated content may be incorrect.

The profile dp1 defines a linear drop curve where:

  • At 55% buffer fill, packets are not marked (0% probability).
  • At 90% buffer fill, all matching packets are marked (100% probability).
  • Between 55% and 90%, the marking probability increases linearly from 0% to 100%.

This approach ensures early congestion feedback to RoCEv2 endpoints while maintaining lossless delivery.

CNP Traffic Scheduling (Scheduler s2-cnp)

Scheduler s2-cnp specifies how CNP traffic in queue 3 is serviced. It assigns the queue strict-high priority and reserves 5% of the interface’s bandwidth:

Assigning strict-high priority along with a minimum bandwidth ensures that, during congestion, the Congestion Notification Packets (CNPs) required to trigger source-based rate reduction in DCQCN can be transmitted across the fabric.

Note:

Strict-high priority queues are always serviced before any other queues—except for other high-priority queues—which could potentially starve lower-priority traffic. However, the risk of starvation in this case is minimal, because CNP traffic is generally very low volume. As a result, there is no need to rate-limit this queue.

Congestion Management and Congestion Control Verification

The show class-of-service interface <interface> command shows the scheduler-map, whether congestion-notification is enable and the profile name, as well as the classifier applied to the interface.

The show class-of-service classifier <classifier-name> command shows the mapping between DSCP values and forwarding classes and can be used to confirm correct assignments (CNP => 48, and NO-LOSS => 26)

The show class-of-service forwarding-class command output shows the forwarding-classes to queue mapping. Can be used to confirmed correct mapping (CNP => queue 3, and NO-LOSS => queue 4), as well and the No-loss status and PFC priority of the NO-LOSS queue.

The show class-of-service scheduler-map sm1 command output shows the scheduler map sm1 and the schedulers s1, and s2-cnp, including their priority, assigned rate, and whether ECN is enabled.

The show interfaces queue <interface> command combined with different options and output filter can help determine if there have been any packet drops, ECN marking, and PFC Pause frames.

The output shows the number of CNP packets (DSCP 48) that have been queued. Increments in this value indicate congestion has been detected along the path and the receiver is sending CNP packets in in response to packets with CE = 1.

The output shows the number of NO-LOSS packets (DSCP = 26) marked with CE=1. If this number is increasing that is an indication that congestion has been detected.

The output shows the number of packets marked with CE=1 that have been seen on interface et-0/0/0:0.

The output shows the number of PFC pause frames that have been sent/received per priority on interface et-0/0/0:0.

The output shows bandwidth allocation, transmit rate, and queue priority for the forwarding classes CNP, and NO-LOSS on interface et-0/0/0:0.

The output shows peak queue occupancy for each queue on interface et-0/0/0:0.

The output shows systems buffer allocations.

Note:

Juniper ITM (Ingress Traffic Manager) is a component that manages packet buffering and queues.

Load Balancing Failure Scenario – Fallback to DLB

RDMA-aware Load Balancing as described in this document is the primary load balancing solution, but when a link or switch fails traffic will fall back to backup ECMP multipaths and leverages DLB for ECMP decisions. As an example, consider the scenario where 4 spines exist, and 4 colors have been configured:

As a result, traffic is forwarded across their preferred path.

Table 1: Per Color Path Summary
COLOR PREFERRED PATH BACKUP PATHS
Green SPINE 1 SPINE2, SPINE3, SPINE4
Blue SPINE 2 SPINE1, SPINE3, SPINE4
Red SPINE 3 SPINE1, SPINE2, SPINE4
Orange SPINE 4 SPINE1, SPINE2, SPINE3
Figure 4: Traffic Forwarding Across Preferred Paths Traffic Forwarding Across Preferred Paths

If the link between Stripe 1 Leaf 1 and SPINE 4 fails, orange traffic is rerouted across the backup paths. The load is distributed based on the DLB.

Figure 5: Traffic Forwarding across Backup Paths after a Failure Traffic Forwarding across Backup Paths after a Failure

Dynamic Load Balancing (DLB)

Dynamic Load Balancing (DLB) ensures that all paths are utilized more fairly, by not only looking at the packet headers, but also considering real-time link quality based on port load (link utilization) and port queue depth when selecting a path. This method provides better results when multiple long-lived flows moving large amounts of data need to be load balanced.

DLB can be configured in two different modes:

  • Per packet mode: packets from the same flow are sprayed across link members of an IP ECMP group, which can cause packets to arrive out of order.
  • Flowlet Mode: packets from the same flow are sent across a link member of an IP ECMP group. A flowlet is defined as bursts of the same flow separated by periods of inactivity. If a flow pauses for longer than the configured inactivity timer, it is possible to reevaluate the link members' quality, and for the flow to be reassigned to a different link.

In this JVD, both the leaf and spine nodes are configured to Load Balance traffic using Dynamic Load Balancing flowlet-mode, applied to both IPv4 and IPv6 traffic.

For more information refer to Load Balancing in the Data Center which provides a comprehensive deep dive into the various load-balancing mechanisms and their evolution to suit the needs of the data center.

The following example shows the configuration applied on all devices:

This configuration defines how flows are identified and the conditions for reassigning them to alternate ECMP paths based on real-time congestion and flow characteristics.

The hash-key family inet layer-3 and hash-key family inet layer-4 statements configure the ECMP hashing function to include both IP addresses and TCP/UDP ports, ensuring granular distribution of IPv4 flows across ECMP paths.

The parameters under enhanced-hash-key modify the DLB hashing algorithm for ECMP traffic forwarding, enabling flowlet-based detection and intelligent reassignment. These include:

  • ecmp-dlb flowlet inactivity-interval

Specifies the minimum inter-packet gap (in microseconds) used to detect the boundary between flowlets. A new flowlet is recognized when this threshold is exceeded.

The recommended value is 128 µsec.

  • ecmp-dlb flowset-table-size

Defines the maximum number of flowset (macroflow) entries that can be stored in the DLB hash table. This controls how many active flows the device can track for dynamic reassignment. This value must be a multiple of 8.

The recommended value is 2048.

  • sampling rate :

Defines the sampling rate to detect congestion by configuring the QFX forwarding ASIC to sample the port load on the egress ECMP members, and update quality scores.

The recommended value is 1,000,000, which means 1 in every million packets is sampled, balancing overhead and responsiveness.

  • ether-type ipv4 and ether-type ipv6 :

Enable enhanced ECMP DLB for both IPv4 and IPv6 packets

Load Balancing Verification

To verify the DLB parameters currently in use, you can use the operational command: show forwarding-options enhanced-hash-key . The output shows the values applied by the system for ECMP Dynamic Load Balancing (DLB), including flowlet behavior.

The Egress Port Load Weight shown in the output defines the weights given to port load and port queue length when calculating the port quality score. The EgressBytes Min and EgressBytes Max Thresholds define quality bands. DLB assigns any egress port with a port load falling below this minimum to the highest quality band (7). Any port load larger than the maximum threshold falls into the lowest quality band (0). DLB divides the remaining port load quantities among quality bands 1 through 6.

We recommend maintaining the default values, Egress Port Load Weight (50) EgressBytes Min Threshold (10) and EgressBytes Max Thresholds (50). No configuration is needed to use these values.

Figure 6: DLB Quality Bands A table with text and numbers AI-generated content may be incorrect.