Understanding PFC Using DSCP at Layer 3 for Untagged Traffic
Protocols such as Remote Direct Memory Access (RDMA) over converged Ethernet version 2 (RoCEv2) require lossless behavior for traffic across Layer 3 connections to Layer 2 Ethernet subnetworks. Traditionally, priority-based flow control (PFC) can be used to prevent traffic loss when congestion occurs on Layer 2 or Layer 3 interfaces for VLAN-tagged traffic by selectively pausing traffic on any of eight priorities corresponding to IEEE 802.1p code points in the VLAN headers of incoming traffic on an interface. However, untagged traffic—traffic without VLAN tagging—cannot be examined for IEEE 802.1p code points on which to pause traffic.
To support lossless traffic flow at Layer 3 for untagged traffic, we support enabling PFC for Layer 3 interfaces and Layer 2 access interfaces using Distributed Services code point (DSCP) values in the Layer 3 IP header of incoming traffic, rather than IEEE 802.1p code point values in a Layer 2 VLAN header.
Overview of DSCP-based PFC
PFC is a data center bridging technology operating at Layer 2, and DSCP information is exchanged in IP headers at Layer 3. However, you can configure DSCP-based PFC, which preserves lossless behavior across Layer 3 network connections for untagged traffic.
PFC operates by generating pause frames for traffic identified on configured code points in incoming traffic to notify the peer to pause transmission when the link is congested. With DSCP-based PFC enabled, pause frames are triggered based on a configured 6-bit DSCP value (corresponding to decimal values 0-63) in the Layer 3 IP header of incoming traffic.
However, PFC can only send pause frames with a 3-bit PFC priority—one of 8 code points corresponding to decimal values 0-7—which, for VLAN-tagged traffic, usually corresponds to the IEEE 802.1p code points in the incoming traffic VLAN headers. Untagged traffic provides no reference for IEEE 802.1p code point values, so to trigger PFC on a DSCP value, the DSCP value must be mapped explicitly in the configuration to a PFC priority to use in the PFC pause frames sent to the peer when congestion occurs for that code point. You can map traffic on a DSCP value to a PFC priority when you define the no-loss forwarding class with which you want to classify DSCP-based PFC traffic. The forwarding class must also be mapped to an output queue with no-loss behavior.
You cannot assign the same PFC priority to more than one forwarding class because the mapped PFC priority value is used as the forwarding class ID when DSCP-based PFC is configured.
A DSCP classifier (instead of an IEEE 802.1p classifier) is also required to specify that incoming traffic with the above-configured DSCP value belongs to the no-loss forwarding class. Any DSCP values for which DSCP-based PFC is enabled on a interface must be specified in either the default DSCP classifier or in a user-defined DSCP classifier associated with the interface.
To enable DSCP-based PFC on an interface, define an input congestion notification profile with the same DSCP value (and desired buffering parameters), and associate it with the interface.
The peer device should have a matching PFC configuration for the mapped PFC priority code points.
Limitations of DSCP-based PFC
The following are limitations of DSCP-based PFC:
-
You cannot configure both DSCP-based PFC and IEEE 802.1p PFC under the same congestion notification profile, or associate both a DSCP-based congestion notification profile and an IEEE 802.1p congestion notification profile with the same interface.
-
DSCP-based PFC is supported on Layer 3 interfaces and Layer 2 access interfaces for untagged traffic only. PFC behavior is unpredictable if VLAN-tagged packets are received on an interface with DSCP-based PFC enabled.
-
Each no-loss forwarding class can only be associated with a unique 3-bit PFC priority value from 0 through 7.
Configurable PFC Accounting Thresholds
On supported platforms, there are virtual PFC pause buffers called PFC accounts that
you define within a congestion notification profile (CNP). Each ingress port can
have two such PFC accounts, You can independently set the PFC priority to transmit
pause frames and the thresholds of XOFF and XON
for each PFC account.
Consider Figure 1, which shows a typical pause buffer. In this diagram, the buffer starts to fill
from the bottom up due to congestion on the egress port. When the buffer fill
reaches XOFF, a PFC Pause frame is sent upstream to pause traffic
associated with the PFC class. The headroom space allows for in-flight packets and
processing delays so that the upstream device can pause traffic before the buffer
fills completely and begins dropping packets. The system uses the cable length and
the maximum receive unit (MRU) to calculate the amount of buffer headroom reserved
to support PFC. The the shorter the cable length and lower the MRU, the less
headroom buffer space is required for PFC.
When congestion reduces and the buffer fill falls under the XON
threshold level, a resume frame is sent upstream to restart the data traffic.
For PFC to work effectively you must correctly set XOFF,
XON, and the headroom buffer for each PFC account. Junos
calculates the headroom space based on the defined cable length and other internally
calculated factors.
You define a PFC account for input traffic in a CNP:
Define one or two PFC accounts. Set a PFC priority for each account, and if necessary, set
XOFFandXONfor each account.Set the code-points that you are using for PFC and assign a PFC account to each code-point.
Set the correct
cable-lengthfor the CNP. The cable length is the distance between the interface and its peer interfaces in meters.
Platform-Specific PFC Behavior
Use Feature Explorer to confirm platform and release support for specific features.
Use the following table to review platform-specific behaviors for your platform.
| Platform | Difference |
|---|---|
|
PTX10000 Series |
|
Change History Table
Feature support is determined by the platform and release you are using. Use Feature Explorer to determine if a feature is supported on your platform.