Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Understanding CoS Scheduling Across the QFabric System

Beginning with Junos OS Release 13.1R2, you can configure two-tier hierarchical scheduling on each Node device fabric interface, and beginning with Junos OS Release 14.1X53-D15, you can configure two-tier hierarchical scheduling on Interconnect device fabric interfaces on a QFabric system. Configuring CoS on the fabric interfaces provides increased control over traffic scheduling across the QFabric system, and helps to ensure predictable bandwidth consumption across the fabric path.

You can configure CoS on the following QFabric system interface types:

  • Node device access interfaces (xe interfaces)—Schedule traffic on the output queues of the 10-Gigabit Ethernet access ports using standard Node device CoS scheduling configuration components, as described elsewhere in the QFX Series documentation. You can configure different scheduling for different ports and output queues.

  • Node device fabric interfaces (fte interfaces)—Schedule traffic on the output queues of the 40-Gbps fabric interfaces that connect a Node device to a QFX3008-I or a QFX3600-I Interconnect device using standard Node device CoS scheduling configuration components. You can configure different scheduling for different interfaces and output queues.

  • Interconnect device fabric interfaces (fte interfaces)—Schedule traffic on the output queues of the 40-Gbps fabric interfaces that connect an Interconnect device to a Node device. Configuring schedulers, mapping schedulers to output queues, and applying scheduling to interfaces on Interconnect devices differ in some aspects from scheduling configuration on Node devices. You can configure different scheduling for different interfaces and fabric forwarding class sets (fabric fc-sets).

  • Interconnect device internal Clos fabric interfaces (bfte interfaces)—Schedule traffic on the internal 40-Gbps Clos fabric interfaces that connect the ingress and egress stages of the Interconnect device Clos fabric, using the same scheduling components as the Interconnect device fabric (fte) interfaces. You can configure one Clos fabric interface scheduler, which the system applies to all of the internal Clos fabric interfaces. You cannot configure different schedulers for different Clos fabric interfaces.

Configuring scheduling on Interconnect device fabric interfaces differs from configuring scheduling on Node device interfaces because the Interconnect device is a shared infrastructure that supports traffic from multiple Node devices and CoS configurations.

Note:

On Node device access interfaces and fabric, the hierarchical scheduling you configure is the Junos OS implementation of enhanced transmission selection (ETS, described in IEEE 802.1Qaz). On Interconnect device fabric interfaces, the hierarchical scheduling you configure is not an implementation of ETS, although it functions similarly to ETS in that excess port bandwidth is shared.

If the 40-Gbps fabric links that connect Node devices to Interconnect devices become oversubscribed, you can configure CoS to control how those fabric links allocate bandwidth to traffic as described in this topic:

CoS Flow Through the QFabric System

Figure 1 shows the CoS flow across the QFabric system.

Figure 1: QFabric System CoS FlowQFabric System CoS Flow

Packets from the access network enter the QFabric system at the ingress interfaces of a QFabric system Node device, cross the Interconnect device fabric, and then are forwarded to their destination through the egress interfaces of another QFabric system Node device.

Note:

Traffic that uses the same Node device for both traffic ingress and traffic egress does not cross the fabric. CoS for this type of traffic is the same as CoS on a standalone switch.

When a packet enters the QFabric system, it receives CoS treatment at each interface it traverses:

  1. A packet enters the QFabric system on a 10-Gigabit Ethernet access interface on a QFabric Node device. At the Node device ingress interface, the packet is classified into a forwarding class, which groups the packet with other traffic that requires similar CoS treatment and maps the packet to the appropriate output queue. To support lossless traffic delivery, enable PFC on the IEEE 802.1p code points of lossless priorities.

  2. Next, the packet exits the QFabric Node device on a 40-Gbps fabric interface that is connected to the QFabric Interconnect device. At the Node device egress interface, the packet is placed in the correct output queue and receives the configured (or default) CoS scheduling, which determines the bandwidth and priority allocated to the packet for its journey from the Node device to the Interconnect device.

  3. The packet enters the Interconnect device on the 40-Gbps fabric interface connected to the ingress Node device. At the Interconnect device ingress interface, the forwarding class of the packet maps the packet to a fabric fc-set, which groups the packet with other traffic that requires similar CoS treatment and maps the packet to the appropriate output queue. Flow control is applied automatically to traffic in lossless fabric fc-sets to preserve the lossless characteristics of that traffic. (Lossless forwarding classes are mapped to the lossless fabric fc-sets.) Other traffic uses standard packet drop for flow control.

  4. The packet progresses from the Interconnect device ingress interface to the internal, three-stage, 40-Gbps Clos fabric interfaces. At the Clos fabric interfaces, packet flow control to protect lossless traffic is applied automatically to traffic in lossless fabric fc-sets. At the egress interfaces from the Clos fabric interfaces, the packet is placed in the correct output queue and receives CoS scheduling.

    Note:

    If you do not use the default Clos fabric interface scheduling, you can configure one scheduler that is applied to all three of the Clos fabric interfaces.

  5. The packet exits the QFabric Interconnect device on the 40-Gbps fabric interface connected to the egress Node device. At the Interconnect device egress interface, the packet is placed in the correct output queue and receives the configured (or default) CoS scheduling, which determines the bandwidth and priority allocated to the packet for its journey from the Interconnect device egress to the Node device.

  6. The packet enters the egress Node device on the 40-Gbps fabric interface connected to the Interconnect device egress interface. At the Node device fabric interface, the packet forwarding class determines the fc-set in which the packet is placed and the output queue the packet uses. Packet flow control to protect lossless traffic is applied automatically to traffic in lossless fabric fc-sets.

  7. The packet exits the QFabric system from the egress Node device on a 10-Gigabit Ethernet access interface. At the Node device egress interface, the packet is placed in the correct output queue and receives the configured (or default) CoS scheduling, which determines the bandwidth and priority allocated to the packet for its journey from the Node device to the packet destination.

You can use default CoS scheduling or configure CoS scheduling on any or all of the Node device interfaces and on Interconnect device fabric (fte) interfaces. If you configure scheduling on one of these interfaces, you can still use default scheduling on other interfaces. Because you configure one scheduler for all of the Interconnect device Clos fabric interfaces (bfte interfaces), all of the Clos fabric interfaces either use the configured scheduling or the default scheduling, but not a mix of configured and default scheduling.

Note:

To support lossless traffic delivery, you must enable PFC on the IEEE 802.1p code points of lossless priorities (forwarding classes) at the Node device network access ingress interfaces. Flow control is applied to lossless priorities automatically on the fabric (fte and bfte) interfaces.

Hierarchical Scheduling Architecture on QFabric System Node Devices

CoS architecture on Node device access interfaces is the same as CoS architecture on standalone switch access interfaces. CoS architecture on Node device fabric interfaces is also the same as the CoS architecture on the access interfaces. You apply schedulers to queues (priorities), fc-sets (priority groups), and interfaces in the same hierarchical manner as described in Understanding CoS Hierarchical Port Scheduling (ETS).

You configure scheduling on Node device fabric interfaces (fte interfaces) using the same statements and configuration constructs that you use to configure scheduling on Node device access interfaces (xe interfaces). For example, on Node device fabric interfaces you can:

  • Define up to four fc-sets (three unicast, one multidestination)

    Note:

    If the interface handles strict-high priority traffic, you must define a separate fc-set (priority group) for strict-high priority traffic. Strict-high priority traffic cannot be mixed with traffic of other priorities in an fc-set. For example, you might choose to create different fc-sets for best effort, lossless, strict-high priority, and multidestination traffic.

  • Map forwarding classes to fc-sets

  • Configure scheduling for each forwarding class (scheduler)

  • Configure scheduling for each fc-set (traffic control profile)

The only differences in configuring CoS on Node device fabric interfaces compared to configuring CoS on Node device access interfaces are:

  • You specify a Node device fabric interface instead of a Node device access interface when you apply CoS to an interface.

  • You cannot attach classifiers or congestion notification profiles to fabric interfaces.

Default Scheduling on Node Device Fabric Interfaces

Default scheduling on Node device fabric interfaces is the same as default scheduling on Node device access interfaces. Only the default forwarding classes (best-effort, network-control, fcoe, no-loss, and multidestination) receive port bandwidth, based on the default minimum guaranteed bandwidth (transmit rate) scheduler settings for each default forwarding class.

All of the default forwarding classes are placed in one default group and receive port bandwidth based on their default transmit rate settings (weights). Forwarding classes that are not default forwarding classes receive no bandwidth.

Each default forwarding class receives a guaranteed minimum percentage of the port bandwidth based on the default transmit rate. Table 1 shows the default transmit rate for each of the default forwarding classes.

Table 1: Default Node Device Fabric Interface Forwarding Class Scheduler Configuration

Default Forwarding Classes

Transmit Rate (Percentage of Class Group Bandwidth)

best-effort

5%

fcoe

35%

no-loss

35%

network-control

5%

mcast

20%

Bandwidth is divided among the default forwarding classes in a ratio proportional to the default transmit rate for the forwarding class.

Hierarchical CoS Architecture Across a QFabric System Interconnect Device

Because Interconnect devices support traffic from multiple Node devices that have multiple CoS configurations, configuring CoS on Interconnect device fabric interfaces differs from configuring CoS on Node device access and fabric interfaces.

The hierarchical CoS scheduling structure on the Interconnect device interfaces consists of two tiers:

  1. Fabric forwarding class sets—Similar to fc-sets on Node devices, fabric fc-sets group traffic for transport across the Interconnect device fabric. Fabric fc-sets are global and apply to all traffic that crosses the fabric from all Node devices. See Understanding CoS Fabric Forwarding Class Sets for a detailed description of fabric fc-sets.

  2. Class groups—Fabric fc-sets are grouped into class groups for transport across the Interconnect device.

Node devices and Interconnect devices each have a two-tier hierarchical CoS scheduling architecture. The architectures are slightly different, but each scheduling tier performs analogous functions, as shown in Table 2.

Table 2: Bandwidth Scheduler Architecture on Node Devices and Interconnect Devices

Bandwidth Allocation Pool

Node Devices

Interconnect Devices

Port—Entire amount of bandwidth available to traffic on a port.

Access (xe) or fabric (fte) interfaces

Fabric (fte) or Clos fabric (bfte) interfaces

Priority group—Group of traffic types that requires similar CoS treatment. Each priority group receives a portion of the total available port bandwidth.

Forwarding class set (fc-set)

Class group

Priority—Most granular tier of bandwidth allocation. Each priority receives a portion of the total available priority group bandwidth.

Forwarding class (mapped to output queue)

Fabric fc-set (mapped to output queue)

Fabric FC-Sets

Fabric fc-sets are groups of forwarding classes that receive similar CoS treatment across the Interconnect device. Fabric fc-sets are global to the QFabric system and apply to all traffic that traverses the fabric, from all connected Node devices. The CoS you configure on a fabric fc-set applies to all the traffic that belongs to that fabric fc-set.

For example, a fabric fc-set that includes the best-effort forwarding class handles all of the best-effort traffic from all of the connected Node devices that traverses the Interconnect device fabric.

There are 12 default fabric fc-sets, including five visible fabric fc-sets and seven hidden fabric fc-sets. The five visible fabric fc-sets have forwarding classes mapped to them by default. By default, the seven hidden fabric fc-sets do not carry traffic, but you can map forwarding classes to the hidden fabric fc-sets if you want to use them.

You can configure the forwarding class membership of each fabric fc-set, and you can configure CoS for each fabric fc-set. However, you cannot create new fabric fc-sets, and you cannot delete the 12 default fabric fc-sets.

Each fabric fc-set is mapped to an output queue. Each fabric interface has 12 output queues, one for each of the 12 fabric fc-sets. The traffic from all of the forwarding classes mapped to a fabric fc-set uses that fabric fc-set’s output queue.

Fabric fc-sets are grouped into class groups for transport across the Interconnect device.

Class Groups for Fabric FC-Sets

To transport traffic across the fabric, the QFabric system organizes the fabric fc-sets into three default classes called class groups. Class groups are not user-configurable. The three class groups are:

  • Strict-high priority—All traffic in the fabric fc-set fabric_fcset_strict_high. This class group includes the traffic in strict-high priority and network-control forwarding classes and in any forwarding classes you create on a Node device that consist of strict-high priority or network-control forwarding class traffic.

  • Unicast—All traffic in the fabric fc-sets fabric_fcset_be, fabric_fcset_noloss1, and fabric_fcset_noloss2. This class group includes the traffic in the best-effort, fcoe, and no-loss forwarding classes, and the traffic in any forwarding classes you create on a Node device that consist of best-effort or lossless traffic. If you use any of the hidden no loss fabric fc-sets (fabric_fcset_noloss3, fabric_fcset_noloss4, fabric_fcset_noloss5, or fabric_fcset_noloss6), that traffic is part of this class group.

  • Multidestination—All traffic in the fabric fc-set fabric_fcset_multicast1. This class group includes the traffic in the mcast forwarding class and in any forwarding classes you create on a Node device that consist of multidestination traffic. If you use any of the hidden multidestination fabric fc-sets (fabric_fcset_multicast2, fabric_fcset_multicast3, or fabric_fcset_multicast4), that traffic is part of this class group.

Default CoS on Interconnect Device Fabric Interfaces

If you do not configure CoS on the Interconnect device fabric interfaces, the Interconnect device interfaces use the default CoS configuration as described in this section:

Default Class Group Scheduling

Default class group bandwidth scheduling is analogous to default fc-set (priority group) scheduling on a Node device. Default class group scheduling uses weighted round-robin (WRR) scheduling, in which each class group receives a portion of the total available fabric interface bandwidth, based on the class group’s traffic type, as shown in Table 3:

Table 3: Class Group Default Scheduling Properties and Membership

Class Group

Fabric fc-sets

Forwarding Classes (Default Mapping)

Class Group Scheduling Properties (Weight)

Strict-high priority

fabric_fcset_strict_high

  • All strict-high priority forwarding classes

  • network-control

Traffic in the strict-high priority class group is served first. This class group receives all of the bandwidth it needs to empty its queues and therefore can starve other types of traffic during periods of high-volume strict priority traffic. Plan carefully and use caution when determining how much traffic to configure as strict-high priority traffic.

Unicast

  • fabric_fcset_be

  • fabric_fcset_noloss1

  • fabric_fcset_noloss2

Includes the hidden lossless fabric fc-sets if used:

  • fabric_fcset_noloss3

  • fabric_fcset_noloss4

  • fabric_fcset_noloss5

  • fabric_fcset_noloss6

  • best-effort

  • fcoe

  • no-loss

Note:

No forwarding classes are mapped to the hidden lossless fabric_fcsets by default.

Traffic in the unicast class group receives an 80% weight in the weighted round-robin (WRR) calculations. After the strict-high priority class group has been served, the unicast class group receives 80% of the remaining fabric bandwidth. (If more bandwidth is available, the unicast class group can use more bandwidth.)

Multidestination

fabric_fcset_multicast1

Includes the hidden multidestination fabric fc-sets if used:

  • fabric_fcset_multicast2

  • fabric_fcset_multicast3

  • fabric_fcset_multicast4

  • mcast

Note:

No forwarding classes are mapped to the hidden multidestination fabric_fcsets by default.

Traffic in the multidestination class group receives a 20% weight in the WRR calculations. After the strict-high priority class group has been served, the multidestination class group receives 20% of the remaining fabric bandwidth. (If more bandwidth is available, the multidestination class group can use more bandwidth.)

If you use the default fabric CoS configuration, only the five visible fabric fc-sets have traffic mapped to them by default. The fabric fc-sets within each class group are weighted by their transmit rates (guaranteed minimum bandwidth), and they receive bandwidth from the class group’s total bandwidth using weighted round-robin (WRR) scheduling.

Default Fabric FC-Set Bandwidth Scheduling

Default fabric fc-set bandwidth scheduling is analogous to default forwarding class (priority) scheduling on a Node device. Each fabric fc-set receives a guaranteed minimum percentage of the port bandwidth that the class group receives. The guaranteed minimum percentage is called the transmit rate.

Table 4 shows the default transmit rate for each of the default fabric fc-sets.

Table 4: Default Fabric FC-Set Scheduler Configuration

Default Fabric FC-Set

Transmit Rate (Percentage of Class Group Bandwidth)

fabric_fcset_strict_high

0%

fabric_fcset_noloss1

35%

fabric_fcset_noloss2

35%

fabric_fcset_be

10%

fabric_fcset_multicast1

20%

Each fabric fc-set belongs to a class group. Each class group receives a portion of the total available port bandwidth. Each fabric fc-set in a class group receives a portion of the total available class group bandwidth based on the transmit rate (weight) of the fabric fc-set.

Traffic in fabric_fcset_strict_high does not have a default transmit rate because fabric_fcset_strict_high receives all of the bandwidth needed to empty its queue before other queues are served. Traffic in the remaining fabric fc-sets receive bandwidth in a ratio proportional to the default transmit rate of each fabric fc-set.

Each of the following hidden fabric fc-sets receives a default scheduling weight of 1 if you do not configure CoS scheduling for it:

  • fabric_fcset_noloss3

  • fabric_fcset_noloss4

  • fabric_fcset_noloss5

  • fabric_fcset_noloss6

  • fabric_fcset_multicast2

  • fabric_fcset_multicast3

  • fabric_fcset_multicast4

You must explicitly map forwarding classes to hidden fabric fc-sets and configure scheduling for that traffic if you want to use the hidden fabric fc-sets. Default scheduling does not use the hidden fabric fc-sets.

Default Class Group and Fabric FC-Set Scheduling Example

The following example shows how default scheduling allocates the total port bandwidth among the class groups and fabric fc-sets. The example assumes that traffic is mapped to each of the forwarding classes in the five visible fabric fc-sets, and that the strict-high priority class group consumes an average of 10 percent of the 40-Gbps fabric interface bandwidth (4 gigabits), leaving 90 percent of the fabric interface bandwidth (36 gigabits) for the remaining class groups.

In this scenario, by default, the strict-high priority class group includes one fabric fc-set (fabric_fcset_strict_high), the unicast class group includes three fabric fc-sets (fabric_fcset_be, fabric_fcset_noloss1, and fabric_fcset_noloss2) and the multidestination class group includes one fabric fc-set (fabric_fcset_multicast1). Each individual fabric fc-set receives the following treatment:

  • Strict-high priority class group (fabric_fcset_strict_high)—Assumed to average 10 percent (4 gigabits) for the purposes of this example. Because the strict-high priority class group is served first and receives all of the bandwidth it requires to empty its queue, in real networks the amount of required bandwidth fluctuates and affects the amount of bandwidth available to the other class groups.

    Tip:

    To prevent strict-high priority traffic from using too much bandwidth, you can set a maximum bandwidth limit by configuring a scheduler shaping rate for the fabric_fcset_strict_high fabric fc-set.

  • Unicast class group (fabric_fcset_be, fabric_fcset_noloss1, and fabric_fcset_noloss2)—Each of these fabric fc-sets receives a weighted portion of the 80 percent of the total port bandwidth available to the unicast class group after the strict-high traffic has been served. The weight corresponds to the transmit rate of each fabric fc-set. The following calculations show the minimum port bandwidth allocated to each of the unicast class group fabric fc-sets:

    • fabric_fcset_be

      10/(35+35+10)% of 80% of the available port bandwidth (12.5% of 80% of port bandwidth)

      The 10 that is the numerator in 10/(35+35+10) is the percentage of bandwidth allocated to the fabric_fcset_be by the transmit rate weight. The (35+35+10) in the denominator sums the percentage of bandwidth (transmit rate weights) allocated to each of the three fabric fc-sets in the unicast class group.

      The 80 percent represents 80 percent of the port bandwidth available after serving strict-high priority traffic (36 gigabits).

      The resulting equation is:

      10/(35+35+10)% x (0.8 x 36 gigabits) = approximately 3.6 gigabits

    • fabric_fcset_noloss1 and fabric_fcset_noloss2

      The default minimum bandwidth for the two visible lossless fabric fc-sets is the same because both of these fabric fc-sets have the same transmit rate weight.

      35/(35+35+10)% of 80% of the port bandwidth (43.75% of 80% of port bandwidth)

      The 35 that is the numerator in 35/(35+35+10) is the percentage of bandwidth allocated to each of the noloss fabric fc-sets by the transmit rate weight. The (35+35+10) in the denominator sums the percentage of bandwidth (transmit rate weights) allocated to each of the three fabric fc-sets in the unicast class group.

      The 80 percent represents 80 percent of the port bandwidth available after serving strict-high priority traffic (36 gigabits).

      The resulting equation is:

      35/(35+35+10)% x (0.8 x 36 gigabits) = approximately 12.6 gigabits

  • Multidestination class group (fabric_fcset_multicast1)—Because only one fabric fc-set is configured by default in the multidestination class group, it receives 100 percent of the 20 percent of the total port bandwidth available to the multidestination class group after the strict-high traffic has been served:

    100/(100)% of 20% of the available port bandwidth (100% of 20% of available port bandwidth)

    The resulting equation is:

    100/100% x (0.2 x 36 gigabits) = approximately 7.2 gigabits

Default PFC and Lossless Transport Across the Interconnect Device

The Interconnect device incorporates flow control mechanisms to support lossless transport during periods of congestion on the fabric. To support the priority-based flow control (PFC) feature on the Node devices, the Interconnect device fabric supports lossless transport for up to six IEEE 802.1p priorities when the following two configuration constraints are met:

  1. The IEEE 802.1p priority used for the traffic that requires lossless transport is mapped to a lossless forwarding class (a forwarding class configured with the no-loss parameter or the default fcoe or no-loss forwarding class).

  2. The lossless forwarding class must be mapped to one of the lossless fabric fc-sets (fabric_fcset_noloss1, fabric_fcset_noloss2, fabric_fcset_noloss3, fabric_fcset_noloss4, fabric_fcset_noloss5, or fabric_fcset_noloss6). If you do not explicitly map lossless forwarding classes to fabric fc-sets, lossless forwarding classes are mapped by default to lossless fabric fc-sets fabric_fcset_noloss1 and fabric_fcset_noloss2.

When traffic meets these two constraints, the fabric propagates back-pressure from egress queues during periods of congestion. However, to achieve end-to-end lossless transport across the QFabric system, you must also configure a congestion notification profile to enable PFC on the Node device ingress interfaces. To achieve end-to-end lossless transport across the network, you must configure PFC on all of the devices in the lossless traffic path.

For all other combinations of IEEE 802.1p priority to forwarding class mapping and all other combinations of forwarding class to fabric fc-set mapping, the default congestion control mechanism is normal packet drop. For example:

  • Case 1—If the IEEE 802.1p priority 5 is mapped to the lossless fcoe forwarding class, and the fcoe forwarding class is mapped to the fabric_fcset_noloss1 fabric fc-set, then the congestion control mechanism is PFC.

  • Case 2—If the IEEE 802.1p priority 5 is mapped to the lossless fcoe forwarding class, and the fcoe forwarding class is mapped to the fabric_fcset_be fabric fc-set, then the congestion control mechanism is packet drop, and the traffic does not receive lossless treatment.

  • Case 3—If the IEEE 802.1p priority 5 is mapped to the lossless no-loss forwarding class, and the no-loss forwarding class is mapped to the fabric_fcset_noloss2 fabric fc-set, then the congestion control mechanism is PFC.

  • Case 4—If the IEEE 802.1p priority 5 is mapped to the lossless no-loss forwarding class, and the no-loss forwarding class is mapped to the fabric_fcset_be fabric fc-set, then the congestion control mechanism is packet drop, and the traffic does not receive lossless treatment.

  • Case 5—If the IEEE 802.1p priority 5 is mapped to the lossy best-effort forwarding class, and the best-effort forwarding class is mapped to the fabric_fcset_be fabric fc-set, then the congestion control mechanism is packet drop.

  • Case 6—If the IEEE 802.1p priority 5 is mapped to the lossy best-effort forwarding class, and the best-effort forwarding class is mapped to the fabric_fcset_noloss1 fabric fc-set, then the congestion control mechanism is packet drop.

Note:

Lossless transport across the fabric must also meet the following two conditions:

  1. The maximum cable length between the Node device and the Interconnect device is 150 meters of fiber cable.

  2. The maximum frame size is 9216 bytes.

If the MTU is 9216 KB, in some cases the QFabric system supports only five lossless forwarding classes instead of six lossless forwarding classes because of headroom buffer limitations.

The number of IEEE 802.1p priorities (forwarding classes) the QFabric system can support for lossless transport across the Interconnect device fabric depends on several factors:

  • Approximate fiber cable length—The longer the fiber cable that connects Node device fabric (FTE) ports to the Interconnect device fabric ports, the more data the connected ports need to buffer when a pause is asserted. (The longer the fiber cable, the more frames are traversing the cable when a pause is asserted. Each port must be able to store all of the “in transit” frames in the buffer to preserve lossless behavior and avoid dropping frames.)

  • MTU size—The larger the maximum frame sizes the buffer must hold, the fewer frames the buffer can hold. The larger the MTU size, the more buffer space each frame consumes.

  • Total number of Node device fabric ports connected to the Interconnect device—The higher the number of connected fabric ports, the more headroom buffer space the Node device needs on those fabric ports to support the lossless flows that traverse the Interconnect device. Because more buffer space is used on the Node device fabric ports, less buffer space is available for the Node device access ports, and a lower total number of lossless flows are supported.

The QFabric system supports six lossless priorities (forwarding classes) under most conditions. The priority group headroom that remains after allocating headroom to lossless flows is sufficient to support best-effort and multidestination traffic.

Table 5 shows how many lossless priorities the QFabric system supports under different conditions (fiber cable lengths and MTUs) in cases when the QFabric system supports fewer than six lossless priorities. The number of lossless priorities is the same regardless of how many Node device FTE ports are connected to the Interconnect device. However, the higher the number of FTE ports connected to the Interconnect device, the lower the number of total lossless flows supported. In all cases that are not shown in Table 5, the QFabric system supports six lossless priorities.

Note:

The system does not perform a configuration commit check that compares available system resources with the number of lossless forwarding classes configured. If you commit a configuration with more lossless forwarding classes than the system resources can support, frames in lossless forwarding classes might be dropped.

Table 5: Lossless Priority (Forwarding Class) Support for Node Devices When Fewer than Six Lossless Priorities Are Supported

MTU in Bytes

Fiber Cable Length in Meters (Approximate)

Maximum Number of Lossless Priorities (Forwarding Classes) on the Node Device

9216 (9K)

100

5

9216 (9K)

150

5

Note:

The total number of lossless flows decreases as resource consumption increases. For a Node device, the higher the number of FTE ports connected to the Interconnect device, the larger the MTU, and the longer the fiber cable length, the fewer total lossless flows the QFabric system can support.

Configuring CoS on Interconnect Device Fabric Interfaces

If you do not want to use default CoS scheduling across the Interconnect device fabric, you can configure two-tier hierarchical scheduling on the external 40-Gbps fabric interfaces (fte interfaces) and on the internal 40-Gbps Clos fabric interfaces (bfte interfaces).

This section describes:

Similarities Between Node Device Scheduling and Interconnect Device Scheduling

Configuring two-tier hierarchical scheduling on Interconnect device fabric interfaces follows the same general process as configuring scheduling on Node device interfaces, in that you perform the following actions in both cases:

  • Define drop profiles to control packet loss for lossy traffic; do not use drop profiles on lossless traffic or multidestination traffic. (However, if you configure a drop profile on lossless traffic or on multidestination traffic, the system does not return a commit error.)

  • Define schedulers to configure the bandwidth for different types of traffic.

  • Map schedulers to output queues (by mapping schedulers to forwarding classes on Node devices, and by mapping schedulers to fabric fc-sets on Interconnect devices).

  • Associate hierarchical scheduling with interfaces to apply scheduling to traffic on those interfaces.

Another similarity is that you cannot configure classifiers or congestion notification profiles (to enable PFC) on fabric interfaces. Flow control is applied automatically to lossless queues on fabric interfaces, and packet classification occurs at the Node device ingress interface.

Differences Between Node Device and Interconnect Device Hierarchical Scheduling

Configuring the two-tier scheduling hierarchy on Interconnect device fabric interfaces is different in several important ways than configuring the two-tier scheduling hierarchy on Node device interfaces, as shown in Table 6:

Table 6: Node Device and Interconnect Device Hierarchical Scheduling Differences

Hierarchical Scheduling Component

Node Devices

Interconnect Devices

Priority scheduling hierarchy tier

  • Each forwarding class is mapped to an output queue. Classifiers map forwarding classes to priorities (IEEE 802.1p code points).

  • You map schedulers to forwarding classes to provide scheduling for priorities.

  • You can create and delete forwarding classes.

  • Each fabric fc-set is mapped to an output queue, and is mapped internally to priorities (IEEE 802.1p code points).

  • You map schedulers to fabric fc-sets to provide scheduling for priorities.

  • You cannot create or delete fabric fc-sets. Only the 12 default fabric fc-sets are available (but you can change the default mapping of forwarding classes to fabric fc-sets).

Scheduler mapping to priorities (bandwidth allocation to priorities)

The scheduler-maps statement maps a forwarding class to a scheduler.

The scheduler-map-forwarding-class-sets statement maps a fabric fc-set to a scheduler.

Priority group scheduling hierarchy tier

  • Each fc-set represents a priority group.

  • You associate fc-sets with traffic control profiles to provide scheduling for priority groups.

  • You can create and delete fc-sets.

  • Each class group represents a priority group.

  • You cannot change the types of traffic associated with a class set (each class set is dedicated to one type of traffic: strict-high priority, unicast, or multidestination traffic).

  • You cannot create or delete class groups.

Priority group bandwidth allocation method

You create traffic control profiles to determine the port scheduling resources assigned to priority groups (fc-sets).

You do not configure priority group (class group) scheduling using a traffic control profile. Instead, the QFabric system uses the sum of the fabric fc-set minimum guaranteed bandwidths (transmit rates) to determine the port scheduling resources for the class group, as described in Hierarchical Scheduling Bandwidth Allocation later in this topic.

Scheduler transmit rate, shaping rate, and drop priority parameters

  • Transmit rate and shaping rate—You can specify either a percentage value or an absolute value for these two parameters.

  • Priority—Scheduling for forwarding classes includes the priority parameter, which sets the scheduling drop priority as either low or strict-high.

  • Transmit rate and shaping rate—You can only specify a percentage value for these two parameters; you cannot specify an absolute value.

  • Priority—You cannot specify the priority parameter because the class groups automatically determine the drop priority. If you try to map a scheduler that includes a priority setting to a fabric fc-set, the system generates a commit error.

Hierarchical scheduler association with interfaces

Specify an access interface or a Node device fabric interface and associate it with an fc-set (determines which forwarding classes use the interface) and a traffic control profile (determines scheduling for both the priority group and the priorities in the priority group).

Specify an Interconnect device fabric interface and associate it with a fabric forwarding class set scheduler map. The fabric forwarding class set scheduler map determines the fabric fc-sets associated with the interface and their scheduling properties.

You can associate one fabric forwarding class scheduler map with an interface. Different interfaces can have different fabric forwarding class scheduler maps.

Classifiers and congestion notification profiles (enabling PFC on priorities)

You can attach classifiers and congestion notification profiles to access interfaces (although you cannot attach them to fabric interfaces).

You cannot attach classifiers and congestion notification profiles to fabric interfaces.

Note:

Because the queue scheduler transmit rate is used differently on Node devices and Interconnect devices, and because you cannot specify the scheduler priority parameter on Interconnect devices, you should configure different schedulers for Node device interfaces and Interconnect device interfaces.

Hierarchical Scheduling Configuration Components

Some of the configuration components used for Interconnect device CoS scheduling are similar to the CoS configuration components used for Node device CoS scheduling, but some of the components are different because configuring the two-tier scheduling hierarchy differs in some respects on the two devices. Figure 2 shows a block diagram of the components used to configure hierarchical scheduling on the Interconnect device.

Figure 2: Configuration Components of Interconnect Device Hierarchical SchedulingConfiguration Components of Interconnect Device Hierarchical Scheduling

Hierarchical Scheduling Bandwidth Allocation

The purpose of hierarchical scheduling is to allocate the available port bandwidth to class groups (priority groups), and then to allocate class group bandwidth to the fabric fc-sets (priorities) that belong to the class group. Hierarchical scheduling provides better port bandwidth utilization and greater flexibility to allocate port resources to queues (priorities) and to groups of queues (priority groups) than flat scheduling.

Note:

Available port bandwidth is the bandwidth that remains after the port services all of its strict-high priority traffic.

You allocate bandwidth to priorities by configuring scheduling for fabric fc-sets. For each fabric fc-set, you can configure a scheduler that defines the guaranteed minimum bandwidth (transmit rate), the maximum bandwidth (shaping rate), and the packet drop profile for lossy unicast traffic assigned to that fabric fc-set. (Lossless fabric fc-sets use flow control to prevent packet loss and do not use drop profiles; multidestination traffic does not use drop profiles.)

Bandwidth is allocated to priority groups (class groups) automatically, based on the minimum guaranteed bandwidth (transmit rate) of the fabric fc-sets that belong to the class group. The sum of the transmit rates of the fabric fc-sets in a class group equals the total minimum guaranteed port bandwidth of that class group.

So the QFabric system uses the fabric fc-set transmit in two ways to calculate bandwidth allocation:

  1. The transmit rate of a fabric fc-set sets the minimum guaranteed bandwidth allocated to that fabric fc-set from the class group bandwidth pool.

  2. The sum of the fabric fc-set transmit rates in a class group sets the minimum guaranteed port bandwidth allocated to that class group.

The transmit rate percentage that you configure in a fabric fc-set scheduler does not necessarily equal the minimum percentage of available port bandwidth allocated to that fabric fc-set, because port bandwidth is allocated to strict-high priority traffic first, and only the remaining port bandwidth is allocated to the rest of the traffic based on the fabric fc-set transmit rates. In other words, the bandwidth available to a class group after the system services strict-high priority traffic is divided among the fabric fc-sets in that class group in proportion to the transmit rate configured for each fabric fc-set.

Hierarchical scheduling on fabric interfaces allocates guaranteed minimum port bandwidth in the following manner:

  1. The sum of the transmit rates of the fabric fc-sets in a class group determines the amount of available port bandwidth allocated to the class group. For example, a class group that has three fabric fc-sets with transmit rates of 10 percent, 20 percent, and 30 percent, receives 60 percent of the available port bandwidth (10+20+30 = 60).

  2. The fabric fc-set transmit rate is used again to determine the proportion of class group bandwidth allocated to the fabric fc-set. For example, in a class group with three fabric fc-sets that have transmit rates of 10 percent, 20 percent, and 30 percent (class group receives 60 percent of available port bandwidth), the fabric fc-set with a transmit rate of 20 percent receives one-third of the class group bandwidth (20 is one-third of 60).

    It is important to understand that this is not one-third of the total available port bandwidth, but one-third of the 60 percent of total available port bandwidth that the class group receives.

Note:

The sum of the transmit rates of all of the fabric fc-sets in the unicast and the multidestination class groups cannot exceed 100 percent. (You cannot configure the system to schedule more than 100 percent as the minimum guaranteed bandwidth for all of the unicast and multidestination fabric fc-sets. The sum of the transmit rates of all unicast and multidestination fabric fc-sets must be less than or equal to 100 percent.)

Interconnect Device Hierarchical Scheduling (Class Group and Fabric FC-Set) Example

The following example shows how configuring hierarchical scheduling allocates the total port bandwidth among the class groups and fabric fc-sets. The example shows a configuration in which:

  • The strict-high priority class group has no scheduler (transmit rate of fabric_fcset_strict_high is 0 percent and no maximum bandwidth is set, so the strict-high priority traffic can use as much bandwidth as needed).

  • The unicast class group consists of the following three fabric fc-sets and transmit rates:

    • fabric_fcset_be, 25 percent

    • fabric_fcset_noloss1, 15 percent

    • fabric_fcset_noloss2, 20 percent

  • The multidestination class group consists of the following two fabric fc-sets and transmit rates:

    • fabric_fcset_multicast1, 10 percent

    • fabric_fcset_multicast2, 30 percent

Total available port bandwidth (port bandwidth remaining after serving strict-high priority traffic) is divided between the unicast and multidestination class groups:

  • Unicast class group—Contains three fabric fc-sets (fabric_fcset_be, fabric_fcset_noloss1, and fabric_fcset_noloss2) with a combined transmit rate of 60 percent (25+15+ 20). Therefore, the unicast class group receives 60 percent of the total available port bandwidth.

  • Multidestination class group—Contains two fabric fc-sets (fabric_fcset_multicast1 and fabric_fcset_multicast2) with a combined transmit rate of 40 percent (10+30). Therefore, the multidestination class group receives 40 percent of the total available port bandwidth.

The class group bandwidth is divided among the fabric fc-sets based on the transmit rate of each fabric fc-set in relation to the class group bandwidth.

The unicast class group bandwidth is divided among its three fabric fc-sets:

  • fabric_fcset_be

    25/(15+20+25) percent of 60 percent of the available port bandwidth (41.6 percent of 60 perecent of available port bandwidth)

    The 25 that is the numerator in 25/(15+20+25) is the percentage of bandwidth allocated to the fabric_fcset_be by the transmit rate weight. The (15+20+25) in the denominator sums the percentage of bandwidth (transmit rate weights) allocated to each of the three fabric fc-sets in the unicast class group.

    The 60 percent represents 60 percent of the port bandwidth available after serving strict-high priority traffic. If no strict-high priority traffic is on the system, the equation results in the following bandwidth allocation to the fabric_fcset_be:

    25/(15+20+25) percent x (0.6 x 40 gigabits) = approximately 9.98 gigabits

  • fabric_fcset_noloss1

    15/(15+20+25) percent of 60 percent of the available port bandwidth (25 percent of 60 percent of available port bandwidth)

    The 15 that is the numerator in 15/(15+20+25) is the percentage of bandwidth allocated to the fabric_fcset_noloss1 by the transmit rate weight. The (15+20+25) in the denominator sums the percentage of bandwidth (transmit rate weights) allocated to each of the three fabric fc-sets in the unicast class group.

    The 60 percent represents 60 percent of the port bandwidth available after serving strict-high priority traffic. If no strict-high priority traffic is on the system, the equation results in the following bandwidth allocation to the fabric_fcset_noloss1:

    15/(15+20+25) percent x (0.6 x 40 gigabits) = approximately 6 gigabits

  • fabric_fcset_noloss2

    20/(15+20+25) percent of 60 percent of the available port bandwidth (33.3 percent of 60 percent of available port bandwidth)

    The 20 that is the numerator in 20/(15+20+25) is the percentage of bandwidth allocated to the fabric_fcset_noloss2 by the transmit rate weight. The (15+20+25) in the denominator sums the percentage of bandwidth (transmit rate weights) allocated to each of the three fabric fc-sets in the unicast class group.

    The 60 percent represents 60 percent of the port bandwidth available after serving strict-high priority traffic. If no strict-high priority traffic is on the system, the equation results in the following bandwidth allocation to the fabric_fcset_noloss2:

    20/(15+20+25) percent x (0.6 x 40 gigabits) = approximately 7.99 gigabits

The multidestination class group bandwidth is divided among its two fabric fc-sets:

  • fabric_fcset_multicast1

    10/(10+30) percent of 40 percent of the available port bandwidth (25 percent of 40 percent of available port bandwidth)

    The 10 that is the numerator in 10/(10+30) is the percentage of bandwidth allocated to the fabric_fcset_multicast1 by the transmit rate weight. The (10+30) in the denominator sums the percentage of bandwidth (transmit rate weights) allocated to each of the two fabric fc-sets in the multidestination class group.

    The 40 percent represents 40 percent of the port bandwidth available after serving strict-high priority traffic. If no strict-high priority traffic is on the system, the equation results in the following bandwidth allocation to the fabric_fcset_multicast1:

    10/(10+30) percent x (0.4 x 40 gigabits) = approximately 4 gigabits

  • fabric_fcset_multicast2

    30/(10+30) percent of 40 percent of the available port bandwidth (75 percent of 40 percent of available port bandwidth)

    The 30 that is the numerator in 30/(10+30) is the percentage of bandwidth allocated to the fabric_fcset_multicast2 by the transmit rate weight. The (10+30) in the denominator sums the percentage of bandwidth (transmit rate weights) allocated to each of the two fabric fc-sets in the multidestination class group.

    The 40 percent represents 40 percent of the port bandwidth available after serving strict-high priority traffic. If no strict -high priority traffic is on the system, the equation results in the following bandwidth allocation to the fabric_fcset_multicast2:

    30/(10+30) percent x (0.4 x 40 gigabits) = approximately 12 gigabits

Configuring Scheduling on Node Device Fabric Interfaces

If you do not want to use default CoS scheduling on Node device fabric interfaces, you can configure two-tier hierarchical scheduling (ETS) the same way that you configure ETS on Node device access interfaces.

Similarities Between Node Device Fabric Interface and Access Interface Scheduling

Configuring scheduling on a Node device fabric interface is similar to configuring scheduling on an access interface in many ways. In both cases, you configure:

  • Schedulers to specify the output scheduling for forwarding class traffic

  • Scheduler maps to map schedulers to forwarding classes

  • Forwarding classes (or use the default forwarding classes)

  • Forwarding class sets (groups of forwarding classes that require similar CoS treatment)

  • A separate fc-set for strict-high priority traffic (an fc-set cannot contain a mix of strict-high priority traffic and traffic with a different priority)

  • Traffic control profiles to specify the output scheduling for fc-sets

  • Traffic control profile and fc-set mapping to interfaces

On Node device fabric interfaces, you configure ETS in the same way and ETS works the same way as on Node device access interfaces

In addition, strict-high priority queues are served first, and then the remaining port bandwidth is allocated to other traffic.

Differences Between Node Device Fabric Interface and Access Interface Scheduling

Configuring scheduling on a Node device fabric interface differs from configuring scheduling on an access interface in several ways. On fabric interfaces:

  • You cannot configure classifiers.

  • You cannot configure congestion notification profiles (flow control is applied automatically to lossless forwarding classes).

  • You specify the interface name differently.

Congestion Management

The Interconnect device is a shared component for all of the connected Node devices. Configuring scheduling on the external fabric interfaces (fte) and the internal Clos fabric interfaces (bfte) enables you to ensure predictable bandwidth usage for traffic flows across the Interconnect device.

Although minimal congestion is expected on the 40-Gbps fabric interfaces, you should configure congestion management to control packet drop during periods of congestion.

Lossy (Best Effort) Unicast Traffic

For unicast traffic that does not require lossless treatment, configure drop profiles (the standard Junos OS packet drop mechanism) to control packet drop during periods of congestion. (Drop profiles are not applied to multidestination traffic.)

A drop profile sets weighted random early detection (WRED) thresholds for dropping packets under different levels of congestion. Congestion levels for packet drop thresholds are fill levels of the output queue. When the output queue fills to a configured threshold, packet drop begins at the configured drop rate. When the output queue fills to a second configured threshold, packet drop reaches the configured maximum drop rate. You can apply different drop profiles to different types of traffic to achieve the desired pattern of packet loss during periods of congestion.

We recommend that you configure a relatively aggressive drop profile for traffic with a high loss priority and a less aggressive drop profile for traffic with a lower loss priority.

To create a drop profile and apply it to traffic of a certain loss priority:

  1. Set loss priorities (low, medium-high, high) for different types of lossy unicast traffic when you configure classifiers and apply them to Node device access interfaces.

  2. For each loss priority, configure at least one drop profile to define the WRED packet drop probability at different queue fill level thresholds. Create a more aggressive drop profile for traffic with a high loss priority, and progressively less aggressive drop profiles for traffic with medium-high and low loss priorities.

  3. As part of scheduler configuration, configure a drop profile map, which maps a drop profile to a loss priority. A scheduler drop profile map can include mapping each loss priority to a drop profile, so you can specify different drop profiles for different traffic loss priorities in one scheduler. The scheduler uses the configured drop profile map to apply different drop profiles to traffic of different loss priorities, and thus control packet drop during periods of congestion.

Lossless Traffic

Do not configure drop profiles for lossless traffic. If you intend to map a scheduler to a lossless fabric fc-set, do not configure a drop profile for that scheduler.

The QFabric system automatically applies flow control that is similar to priority-based flow control (PFC) to traffic in lossless fabric fc-sets to prevent packet loss. In addition, you must enable PFC on the IEEE 802.1p code points for the lossless traffic at the Node device ingress interface to support lossless transport across the QFabric system. (You should also configure PFC across the Ethernet network to support lossless transport across the rest of the network.)

Multidestination Traffic

Drop profiles are not supported for multidestination traffic. Do not configure a drop profile in schedulers that you want to use for multidestination traffic.