PFC Watchdog
PFC Watchdog Overview
Priority-based flow control (PFC) allows independent flow control for each class of service to ensure that congestion does not result in frame loss. PFC pause frames instruct the link partner to halt packet transmission. These frames can propagate through the network, causing traffic on the PFC streams to stop in what is known as a PFC pause storm. Use the PFC watchdog to detect and to resolve PFC pause storms.
The PFC watchdog monitors PFC-enabled ports for PFC pause storms. The PFC watchdog intervenes when a PFC-enabled port receives PFC pause frames for an extended period of time and is unable to schedule any of the data packets on PFC-enabled queues. The PFC watchdog mitigates the situation by disabling the queue where the PFC pause storm was detected for a set length of time. This length of time, called the recovery time, is configurable. After the recovery time passes, the PFC watchdog reenables the affected queue.
You can monitor the number of PFC pause storms that have been detected and recovered, as well as the number of packets that have been dropped, on a particular interface.
In lossless Ethernet fabrics such as those used in AI-ML data centers, the PFC watchdog plays a critical role in keeping the fabric lossless even during congestion.
Use Feature Explorer to confirm platform and release support for specific features.
Benefits
-
Quickly detect and resolve PFC pause storms.
-
Maintain lossless traffic links.
-
Improve link quality.
Understanding PFC Watchdog
The PFC watchdog has three key functions: detection, mitigation, and restoration.
Detection
The PFC watchdog checks the status of PFC queues at regular intervals called polling intervals. If the PFC watchdog finds a PFC queue with a non-zero pause timer, it compares the queue's current transmit counter register to the last recorded value. If the PFC queue has not transmitted any packets since the last polling interval, the PFC watchdog checks if there are any packets in the queue. If there are packets on the queue that are not being transmitted and there are no flow control frames on that port, the PFC watchdog detects a stall condition.
The PFC watchdog monitors the PFC-enabled queues periodically for continuous PFC pause
assertion by the downstream device when the queue is empty. If this occurs, PFC watchdog
detects a stall condition. The system must detect this stall condition within a specified
amount of time. This length of time is determined by how you configure two statements:
poll-interval
and detection
.
The PFC watchdog checks the status of PFC queues at regular intervals. Configure this
interval in milliseconds using the poll-interval
statement. The PFC
watchdog checks the status of the queues once per polling interval. The default interval
is 100 ms. The minimum interval is 100 ms and the maximum is 1000 ms.
The PFC watchdog must detect stall conditions for at least two consecutive polling
intervals before it determines that a PFC queue has stalled. Configure the
detection
statement to control how many polling intervals the PFC
watchdog waits before it mitigates the stalled traffic. The default is two polling
intervals. The maximum number is 10 polling intervals.
The total detection time is the length of the polling interval multiplied by the number of polling intervals.
Mitigation
When the PFC watchdog detects that a PFC queue has stalled, it moves the queue to the mitigation state. First it disables the queue where it detected the PFC pause storm for a period of time called the recovery time.
Configure the pfc-watchdog-action
statement to specify the action that
the PFC watchdog takes to mitigate the traffic congestion. The only option is the drop
action. It drops all queued packets and all newly arriving packets for the stalled PFC
queue. The system monitors all packet drops on the PFC queue during the recovery time.
Restoration
When the recovery time ends, the PFC watchdog collects the ingress drop counters and any other drop counters associated with disabling the PFC queue. The PFC watchdog maintains a count of the packets lost during the last recovery and the total number of lost packets due to PFC mitigation since the device was started. The PFC watchdog then restores the queue and re-enables PFC.
Use the recovery
statement to configure how long the PFC watchdog
disables the affected queue. The minimum recovery period is 200 ms and the maximum is
10,000 ms. After the recovery time passes, the PFC watchdog re-enables PFC on the affected
queues.
Configure PFC Watchdog
You can enable the PFC watchdog on all PFC-enabled queues. The PFC watchdog recovery is a global setting, so it requires the same action on all ports to function. When you configure the PFC watchdog on multiple ports, make sure all ports are configured with the same type of action (drop or forward). By default, all ports use the drop action.
Enabling PFC watchdog on the congestion notification profile without configuring other options enables the PFC watchdog with the default values. By default, the polling interval is 100 ms, the detection period is set to 2 (that is, two polling intervals, or 200 ms), and the recovery time is 200 ms.
PFC watchdog only works for PFC queues. To designate a queue as a PFC queue, use the
flow-control-queue
statement with the
queue number. For example:
set class-of-service congestion-notification-profile cnp output ieee-802.1 code-point 011 flow-control-queue 3 set class-of-service congestion-notification-profile cnp output ieee-802.1 code-point 100 flow-control-queue 4
Use the PFC Watchdog for Monitoring
You can track PFC watchdog events in the system log. The device logs PFC watchdog detection and recovery events in the system log with a timestamp. You can identify these logs from the following messages:
CDA PfcWd: PFC Watchdog detection enabled on ifd: et-0/0/16 Poll Interval:100ms Detection Period:200ms Recovery Interval:200ms
—PFC watchdog was enabled on a new port.CDA PfcWd: PFC Storm Detected! on ifd:et-0/0/16 Queue: 3 Priority: 3 BLOCKED for AutoRecovery Recovery Time: 200ms
—PFC watchdog detected a stall condition.CDA PfcWd: PFC Storm Recovered on Port ifd:et-0/0/16 Queue: 3 Priority: 3 UNBLOCKED after AutoRecovery Recovery Time: 200ms
—PFC watchdog restores the PFC queue and the queue recovers from the PFC pause storm.
You can also monitor the PFC watchdog statistics on a particular interface. Use the following command to view the number of PFC pause storms that have been detected and recovered, as well as the number of packets that have been dropped, on the PFC queues on an interface:
user@device> show interfaces interface extensive ... Priority Flow Control Watchdog Statistics: Detected Recovered LastPacketDropCount TotalPacketDropCount Queue : 0 0 0 0 0 Queue : 1 0 0 0 0 Queue : 2 0 0 0 0 Queue : 3 0 0 0 0 Queue : 4 0 0 0 0 Queue : 5 0 0 0 0 Queue : 6 0 0 0 0 Queue : 7 0 0 0 0 ...