DSCP-based PFC for Layer 3 Untagged Traffic
You can configure DSCP-based PFC to support lossless behavior for untagged traffic across Layer 3 connections to Layer 2 subnetworks for protocols such as Remote Direct Memory Access (RDMA) over converged Ethernet version 2 (RoCEv2).
Overview
With DSCP-based PFC, pause frames are generated to notify the peer that the link is congested based on a configured 6-bit Distributed Services code point (DSCP) value in the Layer 3 IP header of incoming traffic, rather than a 3-bit IEEE 802.1p code point in the Layer 2 VLAN header.
Because PFC can only send pause frames corresponding to PFC priority code points, the 6-bit configured DSCP value must be mapped to a 3-bit PFC priority to use in pause frames when DSCP-based PFC is triggered. Configuring the mapping involves mapping the PFC priority value to a no-loss forwarding class when you map the forwarding class to a queue, defining a congestion notification profile to enable PFC on traffic with the desired DSCP value, and configuring a DSCP classifier to associate the PFC priority-mapped forwarding class (along with the loss priority) with the configured DSCP value on which to trigger PFC pause frames.
The peer device should have output PFC and a corresponding flow control queue configured to match the PFC priority configuration on the device.
Use Feature Explorer to confirm platform and release support for specific features.
DSCP-based PFC for Layer 3 Untagged Traffic in AI-ML Data Centers
AI and ML applications are rapidly expanding in data centers. When dealing with AI and ML workloads and large data sets, one critical challenge is handling the size of the data. Offloading the computation to graphics processing units (GPUs) can significantly speed up this task. However, the data size and the model, especially with large language models (LLMs), often exceed the memory capacity of a single GPU. As a result, you commonly require multiple GPUs to achieve reasonable job completion times, especially for training.
The performance of an AI data center depends on the number of GPUs that are used and the efficiency of the network that connects them. Slowdowns in the network can lead to underutilization of GPUs and longer job completion times. Ethernet-based networks are becoming more popular as an alternative to InfiniBand for AI data center networking. One solution is the Remote Direct Memory Access (RDMA) over Converged Ethernet version 2 (RoCEv2) network.
RoCEv2 involves encapsulating RDMA protocol packets within UDP packets for transport over Ethernet networks. The RoCEv2 protocol utilizes priority-based flow control (PFC) to establish a drop-free network, while data center quantized congestion notification (DCQCN) provides end-to-end congestion control for RoCEv2. Junos OS Evolved supports DCQCN by combining explicit congestion notification (ECN) and PFC to enable end-to-end lossless AI Ethernet networking.
To support lossless IPv6 traffic across Layer 3 (L3) connections to Layer 2 (L2) subnetworks, you can configure PFC to operate using 6-bit Differentiated Services code point (DSCP) values from L3 headers of untagged VLAN traffic. You can use PFC with DSCP as an alternative to IEEE 802.1p priority values in L2 VLAN-tagged packet headers. You need DSCP-based PFC to support RoCEv2.
Benefits-
Utilize Ethernet-based networks for AI-ML data center networking.
-
Improve network efficiency for large data sets.
-
Enable end-to-end lossless AI-ML Ethernet networking.
Configuration
To configure DSCP-based PFC:
Map a lossless forwarding class to a PFC priority—a 3-bit value represented in decimal form (0-7)—to use in the PFC pause frames.
You must also assign an output queue to the forwarding class with the
queue-num
option. Theno-loss
option is required in this case to support lossless behavior for DSCP-based PFC, and thepfc-priority
statement specifies the priority value mapping, as follows:[edit class-of-service] user@device# set forwarding-classes class class-name queue-num queue-number no-loss user@device# set forwarding-classes class class-name pfc-priority pfc-priority
Define an input congestion notification profile to enable PFC on traffic specified by the desired 6-bit DSCP value. Optionally configure the maximum receive unit (MRU) and cable length (used to determine PFC buffer headroom space reserved for the link):
Note:You cannot configure both DSCP-based PFC and IEEE 802.1p PFC under the same congestion notification profile.
[edit class-of-service] user@device# set congestion-notification-profile name input dscp code-point code-point-bits pfc mru mru-value user@device# set congestion-notification-profile name cable-length cable-length-value
Set up a DSCP classifier for the configured DSCP value and no-loss forwarding class mapped in the previous steps:
[edit class-of-service] user@device# set classifiers dscp classifier-name forwarding-class class-name loss-priority level code-points code-point-bits
Assign the classifier and congestion notification profile set up in the previous steps to an interface on which you are enabling DSCP-based PFC:
[edit class-of-service] user@device# set interfaces interface-name classifiers dscp classifier-name user@device# set interfaces interface-name congestion-notification-profile profile-name
Review your configuration.
For example, with the following sample commands configuring DSCP-based PFC for interface xe-0/0/1, PFC pause frames will be generated with PFC priority 3 when incoming traffic with DSCP value 110000 becomes congested:
set interfaces xe-0/0/1 unit 0 family inet address 10.1.1.2/24 set class-of-service forwarding-classes class fc1 queue-num 1 no-loss set class-of-service forwarding-classes class fc1 pfc-priority 3 set class-of-service congestion-notification-profile dpfc-cnp input dscp code-point 110000 pfc set class-of-service classifiers dscp dpfc forwarding-class fc1 loss-priority low code-points 110000 set class-of-service interfaces xe-0/0/1 congestion-notification-profile dpfc-cnp set class-of-service interfaces xe-0/0/1 classifiers dscp dpfc
Configuration for PTX10000 Series Routers
Verify the configuration.
Check the ingress port.
show interfaces interface-name extensive | match Priority
show interfaces queue interface-name
Display the DSCP-based input congestion notification profile.
show class-of-service congestion-notification-profile cnp name
Display which forwarding classes are mapped to each PFC priority.
show class-of-service forwarding-classes