Type 5 EVPN/VXLAN GPU Backend Fabric, SLAAC IPv6 Overlay over IPv6 Link-Local Underlay - IP Services

In this section, we describe the strategies employed to address traffic congestion and optimize load distribution within the Backend GPU fabric.

Congestion Management and Congestion Control Configuration

Congestion management and congestion control are implemented through a VXLAN-aware Data Center Quantized Congestion Notification (DCQCN) approach, ensuring traffic fairness and maintaining stability across the lossless fabric.

AI clusters impose unique demands on network infrastructure due to their high-density and low-entropy traffic patterns, characterized by frequent elephant flows and minimal flow variability. Moreover, most AI training workloads require uninterrupted, lossless packet delivery to be completed successfully. Consequently, when designing a network infrastructure for AI traffic flows, the key objectives include maximizing throughput, minimizing latency, and minimizing network interference while ensuring lossless operation. These requirements necessitate the deployment of effective congestion control mechanisms.

Data Center Quantized Congestion Notification (DCQCN) has become the industry-standard method for end-to-end congestion control in RoCEv2 environments. DCQCN provides mechanisms to adjust traffic rates in response to congestion events without relying on packet drops, striking a balance between reducing traffic rates and maintaining ongoing traffic flow.

It is important to note that DCQCN is primarily required in the GPU backend fabric, where the majority of AI workload traffic resides. It is generally unnecessary in the Frontend and Storage Backend fabrics.

DCQCN combines two complementary mechanisms to implement flow and congestion control:

Priority-based Flow Control (PFC)
Explicit Congestion Notification (ECN)

Priority-Based Flow Control (PFC) mitigates data loss by pausing traffic transmission for specific traffic classes, based on IEEE 802.1p priorities or DSCP markings mapped to queues.

When congestion is detected, PFC operates by sending PAUSE control frames upstream, requesting the sender to halt transmission of traffic associated with a specific priority. The sender completely stops sending traffic for that priority until the congestion subsides or the PAUSE timer expires.

While PFC prevents packet drops and allows the receiver to catch up, it also impacts application performance for traffic using the affected queues. Furthermore, resuming transmission after a pause can lead to sudden traffic surges, potentially re-triggering congestion. For these reasons, PFC should be configured carefully so that it is used as a last resource.

Explicit Congestion Notification (ECN) offers a proactive congestion signaling mechanism, reducing transmission rates while allowing traffic to continue flowing during congestion periods.

When congestion occurs, ECN bits in the IP header are marked (11), prompting the receiver to generate Congestion Notification Packets (CNPs), which inform the source to throttle its transmission rate. Unlike PFC, ECN aims to gradually reduce congestion without halting traffic completely or triggering packet drops.

When deploying ECN in a VXLAN overlay, it is essential to ensure that ECN markings from the outer VXLAN/IP headers are copied into the inner payload headers. This enables congestion signals detected in the transport layer (the VXLAN network) to correctly propagate to the inner RoCEv2 flows, ensuring that the devices generating the RoCEv2 traffic can be notified of congestion so they can respond accordingly. The QFX5240 switches will perform this function automatically without the need for any additional configuration.

Best Practice: Combining PFC and ECN provides the most effective congestion control strategy in a lossless IP fabric supporting RoCEv2. Their parameters must be carefully tuned so that ECN mechanisms are triggered before PFC.

For more detailed guidance, refer to Introduction to Congestion Control in Juniper AI Networks , which outlines best practices for building lossless fabrics for AI workloads using DCQCN (ECN and PFC) congestion control methods alongside Dynamic Load Balancing (DLB). The document is based on validation against DLRM training models and demonstrates how ECN thresholds, PFC parameters, input drops, and tail drops can be monitored and adjusted to optimize fabric performance for RoCEv2 traffic.

Note:

While we provide general recommendations and lab-validated parameters, each AI workload may present distinct traffic patterns. Class of Service (CoS) and load balancing attributes might need to be further tuned to match the specific characteristics of a particular model and cluster environment.

This leaf and spines nodes in the JVD are configured with CoS parameters that were determined to provide the best performance.

The following configuration is applied uniformly across all devices in the fabric.

Traffic Classification

Traffic classification is based on DSCP and implemented using the fabric-dscp classifier, which defines two forwarding classes: NO-LOSS and CNP . This classifier is applied to all et-* unit * logical interfaces.

All incoming traffic with DSCP 011010 (26) is classified as NO-LOSS, while traffic marked with DSCP 110000 (48) is classified as CNP. All GPU servers are configured to mark RoCEv2 traffic with DSCP 26 and Congestion Notification Packets (CNPs) with DSCP 48.

Note:

Refer to Configuring NVIDIA DCQCN – ECN section of the AI Data Center Network with Juniper Apstra, NVIDIA GPUs, and WEKA Storage JVD and the Configuring AMD DCQCN – ECN of the AI Data Center Network with Juniper Apstra, AMD GPUs, and Vast Storage JVD for details on how to configure DCQCN parameters on the Nvidia and AMD GPU servers.

CNP traffic is assigned to output queue 3, while NO-LOSS traffic is assigned to output queue 4.

Queue 4 is configured as a lossless using the no-loss attribute, and it mapped to PFC priority 3. Defining a queue as lossless ensures that packets mapped to this class are not dropped due to congestion, an essential requirement for RoCEv2. Configuring a forwarding class as lossless also impacts buffer allocation on the switch, reserving additional space to support flow control mechanisms such as PFC.

There are two types of buffers:

Shared Buffer Pool: A global memory space dynamically shared by all ports. It is partitioned between lossy and lossless traffic types. Larger shared buffers help absorb traffic bursts.
Dedicated Buffer Pool: A reserved portion of memory allocated per port which is then divided among the queues on that port. Though it can be tuned a minimum amount is always reserved by the system. A larger dedicated buffer pool means congestion on one port is less likely to affect traffic on another port because the traffic does not need to use as much shared buffer space. The larger the dedicated buffer pool, the less bursty traffic the switch can handle because there is less dynamic shared buffer memory.

The recommended values for the Shared and Dedicated Buffers in this JVD are as follows:

Shared buffers:

Ingress lossless percent 66: Reserves 66% of the ingress shared buffer space for lossless traffic (e.g., RoCEv2).
Ingress lossless-headroom percent 24: Carves out an additional 24% of ingress buffer space specifically as headroom for burst absorption. This ensures that RoCEv2 flows have sufficient space to accommodate microbursts while waiting for PFC pause frames to take effect.
Ingress lossy percent 10: Reserves 10% of ingress shared buffer space for lossy traffic.
Ingress lossless dynamic-threshold 10: Allows the lossless buffer pool to dynamically expand into unused lossy buffer space by up to 10%, providing flexibility under heavy load./
Egress lossless percent 66: Reserves 66% of egress shared buffer space for lossless traffic.
Egress lossy percent 10: Allocates 10% for lossy traffic.

Dedicated Buffers (per-port or per-queue):

Ingress percent 15: Allocates 15% of the total ingress buffer capacity as dedicated buffers. These are not shared and are reserved for specific traffic classes or ports.
Egress percent 30: Reserves 30% of egress buffer space for dedicated use.

When this buffer space begins to fill, the PFC mechanism sends Ethernet PAUSE frames to the traffic source instructing it to temporarily halt transmission and prevent packet loss.

Since traffic classification is DSCP-based and interfaces between GPU servers and leaf nodes are untagged, the PFC implementation is DSCP based PFC. The congestion-notification-profile pfc, which is applied to all et-* interfaces, defines operation details for PFC.

Note:

The congestion-notification-profile might be interpreted as related to Congestion Notification Packets (ECN). congestion-notification-profile can also be found abbreviated as CNP in some documentation. However, this profile defines the behavior of PFC, not ECN.

The PFC watchdog function monitors for deadlock or stuck queues caused by persistent PFC pause conditions. If a queue remains paused for too long (indicating possible head-of-line blocking), the watchdog can take corrective actions to avoid traffic stall conditions.

The input dscp code-point 011010 pfc statement specifies that incoming traffic marked with DSCP value 011010 (decimal 26) should trigger PFC when congestion is detected. Essentially, if DSCP 26 (RoCEv2) traffic is experiencing congestion, PFC frames for priority 3 will be generated to pause upstream senders (PFC priority 3 mapped to code point 26). The pause frames will be generated for a priority 3 based on the forwarding-class NO-LOSS configuration previously described.

In the example below:

Figure 52: PFC Pause Frames Generation Example

The combination of the following commands applied to interfaces et-0/0/0:0 and et-0/0/1:0, configures the device to classify all inbound traffic with DSCP 26 to the forwarding class NO-LOSS which is assigned to Queue 4, and mapped to pfc-priority 3, makes queue 4 a no-loss queue, and enables PFC for traffic with DSCP 26

The output ieee-802.1 code-point 011 flow-control-queue 4 statement specifies that when paused frames with priority 3 are received, traffic for queue 4 must stop.

Figure 53: PFC Received Pause Frames Behavior

Traffic Scheduling

The scheduler map sm1 is applied to all et-* interfaces and defines how traffic for each forwarding class is scheduled.

Two schedulers are included:

s1 for NO-LOSS traffic (queue 4)
s2-cnp for CNP traffic (queue 3)

NO-LOSS Traffic Scheduling (Scheduler s1)

Scheduler s1 controls how traffic in the NO-LOSS forwarding class (queue 4) is serviced. It applies the drop-profile dp1 and enables Explicit Congestion Notification (ECN) marking using the explicit-congestion-notification statement.

Note:

Drop profiles in Junos are commonly used to control how aggressively packets are dropped as the queue buffer fills up. However, when ECN is enabled, the profile is used to mark packets instead of dropping them. Marking packets means setting the Congestion Experienced (CE) bit in the IP header based on the configured thresholds.

Figure 54: ECN Profile Example

A graph with a line going up AI-generated content may be incorrect.

The profile dp1 defines a linear drop curve where:

At 55% buffer fill, packets are not marked (0% probability).
At 90% buffer fill, all matching packets are marked (100% probability).
Between 55% and 90%, the marking probability increases linearly from 0% to 100%.

This approach ensures early congestion feedback to RoCEv2 endpoints while maintaining lossless delivery.

CNP Traffic Scheduling (Scheduler s2-cnp)

Scheduler s2-cnp specifies how CNP traffic in queue 3 is serviced. It assigns the queue strict-high priority and reserves 5% of the interface’s bandwidth:

Assigning strict-high priority along with a minimum bandwidth ensures that, during congestion, the Congestion Notification Packets (CNPs) required to trigger source-based rate reduction in DCQCN can be transmitted across the fabric.

Note:

Strict-high priority queues are always serviced before any other queues, except for other high-priority queues, which could potentially starve lower-priority traffic. However, the risk of starvation in this case is minimal, because CNP traffic is generally very low volume. As a result, there is no need to rate-limit this queue.

Congestion Management and Congestion Control Verification

The show class-of-service interface <interface> command shows the scheduler-map, whether congestion-notification is enabled and the profile name, as well as the classifier applied to the interface.

The show class-of-service classifier <classifier-name> command shows the mapping between DSCP values and forwarding classes and can be used to confirm correct assignments (CNP => 48, and NO-LOSS => 26)

The show class-of-service forwarding-class command output shows the forwarding-classes to queue mapping. Can be used to confirm correct mapping (CNP => queue 3, and NO-LOSS => queue 4), as well and the No-loss status and PFC priority of the NO-LOSS queue.

The show class-of-service scheduler-map sm1 command output shows the scheduler map sm1 and the schedulers s1, and s2-cnp, including their priority, assigned rate, and whether ECN is enabled.

The show interfaces queue <interface> command combined with different options and output filter can help determine if there have been any packet drops, ECN marking, and PFC Pause frames.

The output shows the number of CNP packets (DSCP 48) that have been queued. Increments in this value indicate congestion has been detected along the path and the receiver is sending CNP packets in response to packets with CE = 1.

The output shows the number of NO-LOSS packets (DSCP = 26) marked with CE=1. If this number is increasing that is an indication that congestion has been detected.

The output shows the number of packets marked with CE=1 that have been seen on interface et-0/0/0:0.

The output shows the number of PFC pause frames that have been sent/received per priority on interface et-0/0/0:0.

The output shows bandwidth allocation, transmit rate, and queue priority for the forwarding classes CNP, and NO-LOSS on interface et-0/0/0:0.

The output shows peak queue occupancy for each queue on interface et-0/0/0:0.

The output shows systems buffer allocations.

Note:

Juniper ITM (Ingress Traffic Manager) is a component that manages packet buffering and queues.

Load Balancing Configuration

The fabric architecture used in this JVD for both the Frontend and backend follows the 2-stage clos design, with every leaf node connected to all the available spine nodes, and via multiple interfaces. As a result, multiple paths are available between the leaf and spine nodes to reach other devices.

AI traffic characteristics may impede optimal link utilization when implementing traditional Equal Cost Multiple Path (ECMP) Static Load Balancing (SLB) over these paths. This is because the hashing algorithm which looks at specific fields in the packet headers will result in multiple flows mapped to the same link due to their similarities. Consequently, certain links will be favored, and their high utilization may impede the transmission of smaller low-bandwidth flows, leading to potential collisions, congestion and packet drops. To improve the distribution of traffic across all the available paths, either Dynamic Load Balancing (DLB) or Global Load Balancing (GLB) can be implemented instead.

Dynamic Load Balancing (DLB)

Dynamic Load Balancing (DLB) ensures that all paths are utilized more fairly, by not only looking at the packet headers, but also considering real-time link quality based on port load (link utilization) and port queue depth when selecting a path. This method provides better results when multiple long-lived flows moving large amounts of data need to be load balanced.

DLB can be configured in two different modes:

Per packet mode: packets from the same flow are sprayed across link members of an IP ECMP group, which can cause packets to arrive out of order.
Flowlet Mode: packets from the same flow are sent across a link member of an IP ECMP group. A flowlet is defined as bursts of the same flow separated by periods of inactivity. If a flow pauses for longer than the configured inactivity timer, it is possible to reevaluate the link members' quality, and for the flow to be reassigned to a different link.

In this JVD, both the leaf and spine nodes are configured to Load Balance traffic using Dynamic Load Balancing flowlet-mode, applied to both IPv4 and IPv6 traffic.

For more information refer to Load Balancing in the Data Center which provides a comprehensive deep dive into the various load-balancing mechanisms and their evolution to suit the needs of the data center.

The following example shows the configuration applied on all devices:

This configuration defines how flows are identified and the conditions for reassigning them to alternate ECMP paths based on real-time congestion and flow characteristics.

The hash-key family inet layer-3 and hash-key family inet layer-4 statements configure the ECMP hashing function to include both IP addresses and TCP/UDP ports, ensuring granular distribution of IPv4 flows across ECMP paths.

The parameters under enhanced-hash-key modify the DLB hashing algorithm for ECMP traffic forwarding, enabling flowlet-based detection and intelligent reassignment. These include:

ecmp-dlb flowlet inactivity-interval

Specifies the minimum inter-packet gap (in microseconds) used to detect the boundary between flowlets. A new flowlet is recognized when this threshold is exceeded.

The recommended value is 128 µsec.

ecmp-dlb flowset-table-size

Defines the maximum number of flowset (macroflow) entries that can be stored in the DLB hash table. This controls how many active flows the device can track for dynamic reassignment. This value must be a multiple of 8.

The recommended value is 2048.

sampling rate:

Defines the sampling rate to detect congestion by configuring the QFX forwarding ASIC to sample the port load on the egress ECMP members, and update quality scores.

The recommended value is 1,000,000, which means 1 in every million packets is sampled, balancing overhead and responsiveness.

ether-type ipv4 and ether-type IPv6:

Enable enhanced ECMP DLB for both IPv4 and IPv6 packets

Load Balancing Verification

To verify the DLB parameters currently in use, you can use the operational command: show forwarding-options enhanced-hash-key. The output shows the values applied by the system for ECMP Dynamic Load Balancing (DLB), including flowlet behavior.

The Egress Port Load Weight shown in the output defines the weights given to port load and port queue length when calculating the port quality score. The EgressBytes Min and EgressBytes Max Thresholds define quality bands. DLB assigns any egress port with a port load falling below this minimum to the highest quality band (7). Any port load larger than the maximum threshold falls into the lowest quality band (0). DLB divides the remaining port load quantities among quality bands 1 through 6.

We recommend maintaining the default values, Egress Port Load Weight (50) EgressBytes Min Threshold (10) and EgressBytes Max Thresholds (50). No configuration is needed to use these values.

Figure 55: DLB quality bands.

ON THIS PAGE

Type 5 EVPN/VXLAN GPU Backend Fabric, SLAAC IPv6 Overlay over IPv6 Link-Local Underlay - IP Services

Congestion Management and Congestion Control Configuration

Traffic Classification

Traffic Scheduling

NO-LOSS Traffic Scheduling (Scheduler s1)

CNP Traffic Scheduling (Scheduler s2-cnp)

Load Balancing Configuration

Dynamic Load Balancing (DLB)

Load Balancing Verification