Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

PFC Watchdog

PFC Watchdog Overview

Priority-based flow control (PFC) allows independent flow control for each class of service to ensure that congestion does not result in frame loss. PFC pause frames instruct the link partner to halt packet transmission. These frames can propagate through the network, causing traffic on the PFC streams to stop in what is known as a PFC pause storm. Use the PFC watchdog to detect and to resolve PFC pause storms.

The PFC watchdog monitors PFC-enabled ports for PFC pause storms. The PFC watchdog intervenes when a PFC-enabled port receives PFC pause frames for an extended period of time and is unable to schedule any of the data packets on PFC-enabled queues. The PFC watchdog mitigates the situation by disabling the queue where the PFC pause storm was detected for a set length of time. This length of time, called the recovery time, is configurable. After the recovery time passes, the PFC watchdog reenables the affected queue.

You can monitor the number of PFC pause storms that have been detected and recovered, as well as the number of packets that have been dropped, on a particular interface.

In lossless Ethernet fabrics such as those used in AI-ML data centers, the PFC watchdog plays a critical role in keeping the fabric lossless even during congestion.

Use Feature Explorer to confirm platform and release support for specific features.

Benefits

  • Quickly detect and resolve PFC pause storms.

  • Maintain lossless traffic links.

  • Improve link quality.

Understanding PFC Watchdog

The PFC watchdog has three key functions: detection, mitigation, and restoration.

Detection

The PFC watchdog checks the status of PFC queues at regular intervals called polling intervals. If the PFC watchdog finds a PFC queue with a non-zero pause timer, it compares the queue's current transmit counter register to the last recorded value. If the PFC queue has not transmitted any packets since the last polling interval, the PFC watchdog checks if there are any packets in the queue. If there are packets on the queue that are not being transmitted and there are no flow control frames on that port, the PFC watchdog detects a stall condition.

The PFC watchdog monitors the PFC-enabled queues periodically for continuous PFC pause assertion by the downstream device when the queue is empty. If this occurs, PFC watchdog detects a stall condition. The system must detect this stall condition within a specified amount of time. This length of time is determined by how you configure two statements: poll-interval and detection.

The PFC watchdog checks the status of PFC queues at regular intervals. Configure this interval in milliseconds using the poll-interval statement. The PFC watchdog checks the status of the queues once per polling interval. The default interval is 100 ms. The minimum interval is 100 ms and the maximum is 1000 ms.

The PFC watchdog must detect stall conditions for at least two consecutive polling intervals before it determines that a PFC queue has stalled. Configure the detection statement to control how many polling intervals the PFC watchdog waits before it mitigates the stalled traffic. The default is two polling intervals. The maximum number is 10 polling intervals.

The total detection time is the length of the polling interval multiplied by the number of polling intervals.

Mitigation

When the PFC watchdog detects that a PFC queue has stalled, it moves the queue to the mitigation state. First it disables the queue where it detected the PFC pause storm for a period of time called the recovery time.

Configure the pfc-watchdog-action statement to specify the action that the PFC watchdog takes to mitigate the traffic congestion. The only option is the drop action. It drops all queued packets and all newly arriving packets for the stalled PFC queue. The system monitors all packet drops on the PFC queue during the recovery time.

Restoration

When the recovery time ends, the PFC watchdog collects the ingress drop counters and any other drop counters associated with disabling the PFC queue. The PFC watchdog maintains a count of the packets lost during the last recovery and the total number of lost packets due to PFC mitigation since the device was started. The PFC watchdog then restores the queue and re-enables PFC.

Use the recovery statement to configure how long the PFC watchdog disables the affected queue. The minimum recovery period is 200 ms and the maximum is 10,000 ms. After the recovery time passes, the PFC watchdog re-enables PFC on the affected queues.

Configure PFC Watchdog

You can enable the PFC watchdog on all PFC-enabled queues. The PFC watchdog recovery is a global setting, so it requires the same action on all ports to function. When you configure the PFC watchdog on multiple ports, make sure all ports are configured with the same type of action (drop or forward). By default, all ports use the drop action.

Enabling PFC watchdog on the congestion notification profile without configuring other options enables the PFC watchdog with the default values. By default, the polling interval is 100 ms, the detection period is set to 2 (that is, two polling intervals, or 200 ms), and the recovery time is 200 ms.

PFC watchdog only works for PFC queues. To designate a queue as a PFC queue, use the flow-control-queue statement with the queue number. For example:

  1. Enable PFC watchdog. Use the pfc-watchdog statement at the [edit class-of-service congestion-notification-profile profile-name] hierarchy level:
  2. Configure the polling interval in milliseconds. The polling interval is how often the PFC watchdog checks the status of PFC queues.
  3. Configure the detection interval number. The detection interval number is how many polling intervals the PFC watchdog waits before it mitigates the stalled traffic.
  4. Specify the action that the PFC watchdog takes to mitigate the traffic congestion.
  5. Configure the recovery time in milliseconds. The recovery time is how long the PFC watchdog disables the affected queue for before it restores PFC.
  6. Verify your configuration with the show class-of-service congestion-notification-profile profile-name command.
    The detection time shown is the polling interval multiplied by the detection interval number. In this case, the polling interval is 100 milliseconds, so the configured number of detection intervals was two.

Use the PFC Watchdog for Monitoring

You can track PFC watchdog events in the system log. The device logs PFC watchdog detection and recovery events in the system log with a timestamp. You can identify these logs from the following messages:

  • CDA PfcWd: PFC Watchdog detection enabled on ifd: et-0/0/16 Poll Interval:100ms Detection Period:200ms Recovery Interval:200ms—PFC watchdog was enabled on a new port.
  • CDA PfcWd: PFC Storm Detected! on ifd:et-0/0/16 Queue: 3 Priority: 3 BLOCKED for AutoRecovery Recovery Time: 200ms—PFC watchdog detected a stall condition.
  • CDA PfcWd: PFC Storm Recovered on Port ifd:et-0/0/16 Queue: 3 Priority: 3 UNBLOCKED after AutoRecovery Recovery Time: 200ms—PFC watchdog restores the PFC queue and the queue recovers from the PFC pause storm.

You can also monitor the PFC watchdog statistics on a particular interface. Use the following command to view the number of PFC pause storms that have been detected and recovered, as well as the number of packets that have been dropped, on the PFC queues on an interface: