Defining and Measuring Network Availability

This topic includes the following sections:

Defining Network Availability

Availability of a service provider’s IP network can be thought of as the reachability between the regional points of presence (POP), as shown in Figure 1.

With the example above, when you use a full mesh of measurement points, where every POP measures the availability to every other POP, you can calculate the total availability of the service provider’s network. This KPI can also be used to help monitor the service level of the network, and can be used by the service provider and its customers to determine if they are operating within the terms of their service-level agreement (SLA).

Where a POP may consist of multiple routers, take measurements to each router as shown in Figure 2.

Measurements include:

• Path availability—Availability of an egress interface B1 as seen from an ingress interface A1.

• Router availability—Percentage of path availability of all measured paths terminating on the router.

• POP availability—Percentage of router availability between any two regional POPs, A and B.

• Network availability—Percentage of POP availability for all regional POPs in the service provider’s network.

To measure POP availability of POP A to POP B in Figure 2, you must measure the following four paths:

Path A1 => B1
Path A1 => B2
Path A2 => B1
Path A2 => B2

Measuring availability from POP B to POP A would require a further four measurements, and so on.

A full mesh of availability measurements can generate significant management traffic. From the sample diagram above:

• Each POP has two co-located provider edge (PE) routers, each with 2xSTM1 interfaces, for a total of 18 PE routers and 36xSTM1 interfaces.

• There are six core provider (P) routers, four with 2xSTM4 and 3xSTM1 interfaces each, and two with 3xSTM4 and 3xSTM1 interfaces each.

This makes a total of 68 interfaces. A full mesh of paths between every interface is:

[n x (n1)] / 2 gives [68 x (681)] / 2=2278 paths

To reduce management traffic on the service provider’s network, instead of generating a full mesh of interface availability tests (for example, from each interface to every other interface), you can measure from each router’s loopback address. This reduces the number of availability measurements required to a total of one for each router, or:

[n x (n1)] / 2 gives [24 x (241)] / 2=276 measurements

This measures availability from each router to every other router.

Monitoring the SLA and the Required Bandwidth

A typical SLA between a service provider and a customer might state:

A Point of Presence is the connection of two back-to-back provider edge routers to separate core provider routers using different links for resilience. The system is considered to be unavailable when either an entire POP becomes unavailable or for the duration of a Priority 1 fault.

An SLA availability figure of 99.999 percent for a provider’s network would relate to a down time of approximately 5 minutes per year. Therefore, to measure this proactively, you would have to take availability measurements at a granularity of less than one every five minutes. With a standard size of 64 bytes per ICMP ping request, one ping test per minute would generate 7680 bytes of traffic per hour per destination, including ping responses. A full mesh of ping tests to 276 destinations would generate 2,119,680 bytes per hour, which represents the following:

• On an OC3/STM1 link of 155.52 Mbps, a utilization of 1.362 percent

• On an OC12/STM4 link of 622.08 Mbps, a utilization of 0.340 percent

With a size of 1500 bytes per ICMP ping request, one ping test per minute would generate 180,000 bytes per hour per destination, including ping responses. A full mesh of ping tests to 276 destinations would generate 49,680,000 bytes per hour, which represents the following:

• On an OC3/STM1 link, 31.94 percent utilization

• On an OC12/STM4 link, 7.986 percent utilization

Each router can record the results for every destination tested. With one test per minute to each destination, a total of 1 x 60 x 24 x 276 = 397,440 tests per day would be performed and recorded by each router. All ping results are stored in the pingProbeHistoryTable (see RFC 2925) and can be retrieved by an SNMP performance reporting application (for example, service performance management software from InfoVista, Inc., or Concord Communications, Inc.) for post processing. This table has a maximum size of 4,294,967,295 rows, which is more than adequate.

Measuring Availability

There are two methods you can use to measure availability:

• Proactive—Availability is automatically measured as often as possible by an operational support system.

• Reactive—Availability is recorded by a Help desk when a fault is first reported by a user or a fault monitoring system.

This section discusses real-time performance monitoring as a proactive monitoring solution.

Real-Time Performance Monitoring

Juniper Networks provides a real-time performance monitoring (RPM) service to monitor real-time network performance. Use the J-Web Quick Configuration feature to configure real-time performance monitoring parameters used in real-time performance monitoring tests. (J-Web Quick Configuration is a browser-based GUI that runs on Juniper Networks routers. For more information, see the J-Web Interface User Guide.)

Configuring Real-Time Performance Monitoring

Some of the most common options you can configure for real-time performance monitoring tests are shown in Table 1.

Table 1: Real-Time Performance Monitoring Configuration Options

Field

Description

Request Information

Probe Type

Type of probe to send as part of the test. Probe types can be:

• http-get

• icmp-ping

• icmp-ping-timestamp

• tcp-ping

• udp-ping

Interval

Wait time (in seconds) between each probe transmission. The range is 1 to 255 seconds.

Test Interval

Wait time (in seconds) between tests. The range is 0 to 86400 seconds.

Probe Count

Total number of probes sent for each test. The range is 1 to 15 probes.

Destination Port

TCP or UDP port to which probes are sent. Use number 7—a standard TCP or UDP port number—or select a port number from 49152 through 65535.

DSCP Bits

Differentiated Services code point (DSCP) bits. This value must be a valid 6-bit pattern. The default is 000000.

Data Size

Size (in bytes) of the data portion of the ICMP probes. The range is 0 to 65507 bytes.

Data Fill

Contents of the data portion of the ICMP probes. Contents must be a hexadecimal value. The range is 1 to 800h.

Maximum Probe Thresholds

Successive Lost Probes

Total number of probes that must be lost successively to trigger a probe failure and generate a system log message. The range is 0 to 15 probes.

Lost Probes

Total number of probes that must be lost to trigger a probe failure and generate a system log message. The range is 0 to 15 probes.

Round Trip Time

Total round-trip time (in microseconds) from the Services Router to the remote server, which, if exceeded, triggers a probe failure and generates a system log message. The range is 0 to 60,000,000 microseconds.

Jitter

Total jitter (in microseconds) for a test, which, if exceeded, triggers a probe failure and generates a system log message. The range is 0 to 60,000,000 microseconds.

Standard Deviation

Maximum allowable standard deviation (in microseconds) for a test, which, if exceeded, triggers a probe failure and generates a system log message. The range is 0 to 60,000,000 microseconds.

Egress Time

Total one-way time (in microseconds) from the router to the remote server, which, if exceeded, triggers a probe failure and generates a system log message. The range is 0 to 60,000,000 microseconds.

Ingress Time

Total one-way time (in microseconds) from the remote server to the router, which, if exceeded, triggers a probe failure and generates a system log message. The range is 0 to 60,000,000 microseconds.

Jitter Egress Time

Total outbound-time jitter (in microseconds) for a test, which, if exceeded, triggers a probe failure and generates a system log message. The range is 0 to 60,000,000 microseconds.

Jitter Ingress Time

Total inbound-time jitter (in microseconds) for a test, which, if exceeded, triggers a probe failure and generates a system log message. The range is 0 to 60,000,000 microseconds.

Egress Standard Deviation

Maximum allowable standard deviation of outbound times (in microseconds) for a test, which, if exceeded, triggers a probe failure and generates a system log message. The range is 0 to 60,000,000 microseconds.

Ingress Standard Deviation

Maximum allowable standard deviation of inbound times (in microseconds) for a test, which, if exceeded, triggers a probe failure and generates a system log message. The range is 0 to 60,000,000 microseconds.

Displaying Real-Time Performance Monitoring Information

For each real-time performance monitoring test configured on the router, monitoring information includes the round-trip time, jitter, and standard deviation. To view this information, select Monitor > RPM in the J-Web interface, or enter the show services rpm command-line interface (CLI) command.

To display the results of the most recent real-time performance monitoring probes, enter the show services rpm probe-results CLI command:

`user@host> show services rpm probe-results`
```Owner: p1, Test: t1
Destination interface name: lt-0/0/0.0
Test size: 10 probes
Probe results:
Response received, Sun Jul 10 19:07:34 2005
Rtt: 50302 usec
Results over current test:
Probes sent: 2, Probes received: 1, Loss percentage: 50
Measurement: Round trip time
Minimum: 50302 usec, Maximum: 50302 usec, Average: 50302 usec,
Jitter: 0 usec, Stddev: 0 usec
Results over all tests:
Probes sent: 2, Probes received: 1, Loss percentage: 50
Measurement: Round trip time
Minimum: 50302 usec, Maximum: 50302 usec, Average: 50302 usec,
Jitter: 0 usec, Stddev: 0 usec
```