When an interface is operationally down, or ARP requests to the next-hop gateway are not consistently returned, the path in question will be removed from routing decisions. In concert with dynamic routing protocols such as BGP, network software is able to determine the best path to the destination. However in some deployments, routers can be pre-configured with all potentially available transports, or lacking in dynamic routing protocols to determine if the path to the final destination is available on a given route. As a result, the path selected by the router might effectively black-hole traffic.
A unique advantage of Session Smart Routing is that it can understand patterns and characteristics of the underlying traffic passing through the system. These heuristics can then be leveraged on a number of different parameters in order to make more intelligent routing and load balancing decisions.
- Per Service
- Per Path
- Per transport and application such as TCP, UDP, TLS etc
- Per destination (/32 address only)
- Per traffic-class (high, medium, low, best-effort)
The Session Smart Router (SSR) collects key performance metrics to assist in making load balancing decisions, such as:
- Number of TCP connection errors
- Number of UDP failures due to ICMP unreachable
- Time to first data packet for TCP and TLS
Service Health Learning can operate in a detection mode to gather the metrics listed above, and then enforce path selection based on the experienced values.
Detection Mode is configured on the service-route, and allows you to gather statistics based on the traffic profiles as described in Session Establishment Metrics. This mode defines the criteria the 128T router uses to apply load balancing decisions. Detection mode allows you to visualize the current network and understand the necessary configuration for enforcement.
Configure Detection Mode
Detection Mode is configured per service-route, and gathers statistics using Session Establishment Metrics.
To enable Detection Mode:
- Under Authority, select a router.
- Scroll down to Service Routes.
- Click ADD.
- Enter a Name for the new Service Route, and click SAVE.
- Choose the Service Name.
- Choose the Service Route Type and the Peer (if necessary).
The following is an example configuration from the PCLI.
Reachability Detection is now enabled and will gather service route information. Configure as many detection settings as necessary to gather information about the network connectivity. The next step is to set Enforcement Mode parameters for load balancing.
In Enforcement Mode, you define a traffic profile using the parameters from Detection Mode to apply criteria that determines the health of a path and subsequent selection by the load balancer.
Configure Enforcement Mode
It is easiest to use the Service Routes configured in Detection Mode and simply enable Reachability Detection Enforcement.
- Under Authority, select a Router.
- Scroll down to Service Routes.
- Select the Service Route.
- Under Reachability Detection Settings, enable Reachability Detection Enforcement.
- Define the Enforcement Detection Window (time). The
detection-window(default 5s) determines how often the stats will be aggregated in terms of min, max, and median.
- Define an Enforcement Hold-Down time. This determines how long the path stays down upon determining it is unusable.
- If you have previously configured a Reachability Profile, select the appropriate profile. If there is no existing Reachability Detection Profile, use the procedure below.
- Scroll to Reachability Detection Enforcement Probes and click ADD.
- Enter a name for the probe and click SAVE.
- In the Probe Type window, select a probe type from the drop-down. The default probe type is an ICMP probe.
- Click Validate, and then Commit.
Creating a Reachability Detection Profile
- Go to the Router level (up one level).
- Scroll down to Reachability Profile, and click ADD.
- Under Protocol, click ADD.
- Select the Protocol type from the drop-down (tcp, tls, udp) and click SAVE.
- Under the protocol class label, click ADD.
- Select a traffic class ID and click SAVE. The traffic-class configuration allows you to configure different treatments for different classes of traffic for the same service.
- In the Acceptable Error Threshold field, set the threshold.
- In the
time-to-establishment.labelfield, set Enabled to true. Define the max and mean times as necessary. This configures the time allowed to establish a session for the protocol.
- Return to the Router level.
- Select the Service Route where the Enforcement settings will be in affect.
- Scroll down to Reachability Detection settings. In the Reachability Detection Enforcement Profile select the Reachability Profile you just created.
- Click Validate, and then Commit.
The following is an example Enforcement configuration with a Reachability Profile entered from the PCLI.
When a traffic profile is configured, it allows for the enforcement of path characteristics.
traffic-profileallows the user to configure enforcement parameters for tcp, udp and tls.
traffic-classallows the user to configure different treatments for different classes of traffic for the same service. The same default values (as described below) apply to all the classes.
acceptable-error-threshold(expressed in percentage) is the amount of errors acceptable on the path before taking it offline. For TCP, this includes the session closed before establishment, any ICMP error that constitutes destination unreachable, and session timeout before establishment. For UDP, this includes the destination unreachable class of ICMP errors.
time-to-establishmentallows the user to configure a max and a mean time for a session to be established as defined for the protocol within the configured detection-window. The following table describes the default values.
|Config||Type||Default Value||How to disable?|
|- max||milliseconds||500 ms||0 ms|
|- mean||milliseconds||250 ms||0 ms|
Once a profile is configured, the above defaults are enforced. To turn off the specific enforcement, configure an appropriate value as listed above.
An ICMP probe, as described below is built into the system, providing the health probe method. Other application specific probes (HTTP, TLS, SIP, etc.) are made possible with external plugins utilizing the REST API added for this feature. The two main aspects are:
- Load-balancer API to declare select paths as down
- Mechanism to report activity or inactivity over a path to an external application
Health Probes REST API
The reachability manager provides a REST API for setting, getting, and deleting probe state. Note that the state is cleared when the probe-type becomes disabled.
If any probe sets state to down for the service-path, the path is marked as down. The path may still periodically become up to allow organic traffic for one interval as it exits the hold-down period. This can be seen through the
show service-path command.
For the REST API, either (service-route + service) or (network-interface) must be given. Network-interface may map to more than one service-route. One service-route may map to multiple service-paths.
The default probe is an ICMP based health probe to detect the liveliness of the remote endpoint. Below is a sample config:
The probe-name for the REST API is
The following table describes the configuration values above:
|router > icmp-probe-profile|
|name||string||key||Unique name to identify an ICMP probe profile|
|probe-interval||uint32 (seconds) range: 1-3600||default: 10||Duration (in seconds) of how often to perform a link test to the destination.|
|number-of-attempts||uint8 range: 1-20||default: 4||Number of consecutive ICMP ping requests to be sent within the probe-duration before deciding that destination is unreachable.|
|probe-duration||uint8 (seconds) range: 1-10||default: 1||Duration (in seconds) within which to reach the destination. Each attempt will be made in probe-duration / number-of-attempts interval.|
|max-loss||uint8 (percent)||default: 10%||The amount of acceptable loss on the link. The loss will be determined by sending number-of-attempts ICMP requests and waiting the probe-duration to get the replies back.|
|latency > max||milliseconds||default: 250||Maximum acceptable latency based on the ping test.|
|latency > mean||milliseconds||default: 100||The average latency on a path based on the ping test.|
Troubleshooting the ICMP Probe Profile
If any of the addresses listed in the profile do not meet the SLA requirements, all service-routes referencing this profile are marked as down.
Stats are available for number of failed-to-send, response-timeout, and ping success in total, and for ping exceeded-by-max-loss, exceeded-by-max-latency, and exceeded-by-mean-latency.
Ping state will be detected for each unique combination of probe-profile, service-name, and tenant.
If an ICMP probe fails to send requests, the PCLI ping command and highway logs can be used to debug. If
ping node <node-name> egress-interface <intf-name> gateway <gateway-ip> <address> succeeds where the node and egress-interface match the service-route next-hop and the gateway is from the routing FIB or the interface gateway, then the ICMP probe will be able to send a ping.
Probe-interval and probe-duration should be tuned so that probe-interval + probe-duration is less than the detection-window, allowing the ping to run to completion within the detection window. If the ping interval is longer, when the path is declared down, is held-down, and is brought up for one window, the probe may say the path is down and the service-path will be brought down again, bouncing the path repeatedly.
Custom Health Probe Plugins
The extensibility APIs described above enable plugins to be developed for performing additional probes using TCP, TLS, HTTP, SIP etc. The plugin workflow for these types of probes would be as follows:
- Augment the service-route > reachability-detection > probe choice statement to plugin specific probe configuration. In general, there is no restriction enforced on what and how many plugins run a reachability check in parallel.
- The probe plugin operates independently to determine if the path is up or down. Based on the specific business logic, the plugin triggers the path down using the load balancer APIs.
- The probe plugin uses the probe-status REST API to set service-paths up or down.
The algorithm for tracking errors, establishment times, etc. is run on a per-service, per-path and per-traffic-class basis. In addition, the following destination address are also tracked:
- Any fully qualified address (/32 for IPv4 and /128 for IPv6) configured in service > address is tracked for the above metrics.
- When there is a service-route > nat-target or service-route > next-hop > target-address configured, they are tracked for the metrics above.
Load Balancer Behavior
An algorithm is used to flag and recover path failures.
- The enforcement algorithm calculates the health of each path based on the profile at the configured detection interval.
- When any of the configured thresholds are exceeded, the path is flagged as unusable. The first test to fail causes the path to be taken out of consideration from the load balancer.
- Once the path is deemed to be not healthy: The subsequent sessions are used for routing. In the event that none of the paths are healthy, the best-effort algorithm picks one of the paths for routing sessions. This can be disabled by turning off best-effort configuration in the service-policy.
When a path becomes unavailable, all the existing sessions continue to be routed over this path. Since this is non-SVR, changing a destination or path might have unintended consequences.
The path is kept down for the configured hold-down time before being put back in service. At that time, the tracking and enforcement algorithm restarts.
If the value of the threshold or window is too small, the link will go in and out of service. If this is the case, adjust the threshold value in the configuration.
Alarms and events:
- During enforcement mode, if a path is taken out of service for a given service, protocol, or traffic class, an alarm is generated to indicate the event. The alarm is cleared when the path is put back in service.
show statscommand can be used for capturing the “current” snapshot of the relevant stats.
- The Meets-SLA field has been added to the
show load-balancercommand to indicate the result of the reachability detection.
show service-pathoutput indicates when a non-SVR path is no longer in-service and why. Two fields have been added to the show command: Meets-SLA and Reachability-Probes. Meets-SLA indicates if a service agent meets or satisfies SLA as a result of reachability detection. Reachability-Probes shows the probes status from reachability detection.