Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

High Availability for Network Monitoring

The type of Junos Space cluster you create determines how high availability for the network monitoring service functions. A Junos Space fabric without Fault Monitoring and Performance Monitoring (FMPM) nodes uses the two high availability (HA) nodes in the cluster to protect the network monitoring service against node failures. However, when a Junos Space fabric includes one or more FMPM nodes, network monitoring functionality is disabled on the Junos Space nodes and enabled on the FMPM nodes.

This topic includes the following sections:

High-Availability Fabric without FMPM Nodes

When a Junos Space fabric does not include FMPM nodes, the Junos Space cluster employs a hot-standby solution that uses the two high availability (HA) nodes in the cluster to protect the network monitoring service against node failures.

Figure 1 shows how network monitoring runs on two HA nodes in the cluster to protect the service in the event of node failure.

Figure 1: Linux High Availability ClusterLinux High Availability Cluster

The network monitoring service is automatically installed on all nodes in the cluster. However, at any time, the network monitoring service runs only on the node that currently owns the virtual IP (VIP) address, and the service is responsible for all fault management and performance management functionality for the entire cluster. Network monitoring uses PostgreSQL 9.1 database for its storage needs. As Figure 1 shows, real-time streaming replication with continuous archiving is set up between the two HA nodes (Node-1 and Node-2 in the cluster), which ensures that the network monitoring database on the standby node is continuously in sync with the network monitoring database on the active node. In addition, a cron job runs on the active node once a day at midnight to synchronize the network monitoring file system to the standby node, which ensures that all back-end configuration files that network monitoring uses are also synchronized between the two HA nodes.

When a VIP failover to the standby node occurs, network monitoring is automatically started on the node. The network monitoring service takes approximately 3 to 5 minutes to complete its initialization before it performs all fault monitoring and performance monitoring functionality for the cluster. Consequently, Junos Space users can expect a network monitoring outage to last approximately 3 to 5 minutes.

The watchdog service on the two HA nodes is responsible for ensuring that the network monitoring service is running on the HA node that owns the virtual IP address and is not running on the other (standby) HA node. As already noted, the watchdog service checks the status of all services on the node every second. If the watchdog service detects that the node owns the VIP address but does not run the network monitoring service, the watchdog service starts the network monitoring service and creates the cron job to synchronize fault management and performance management data to the other node. If the watchdog service detects that the node does not own the VIP address but is running the network monitoring service, the watchdog service shuts down the service and removes the cron job entry for data synchronization.

High-Availability Fabric with FMPM Nodes

If you manage a large or complex network, you might want to dedicate all your performance and network monitoring functionality to a special node called the Fault Monitoring and Performance Monitoring (FMPM) node. When you create a Junos Space fabric with one or more FMPM nodes, network monitoring functionality is disabled on all the Junos Space nodes and enabled on the FMPM nodes. When the first FMPM node is added to the fabric, network monitoring functionality is enabled on this node and the PostgreSQL 9.1 database runs on this node.

When you add a second FMPM node to the fabric, the first FMPM node functions as the primary node, and the second FMPM node functions as the standby node. The network monitoring service is automatically installed on both FMPM nodes in the FMPM team. However, at any time, the network monitoring service runs only on the FMPM node that currently owns the VIP address, and the service is responsible for all fault management (FM) and performance management (PM) functionality for the FMPM team. Network monitoring uses PostgreSQL 9.1 database for its storage needs.

Real-time streaming replication with continuous archiving is set up between the two FMPM nodes in the team, which ensures that the network monitoring database on the standby node is continuously in sync with the network monitoring database on the active node. In addition, a cron job runs on the active FMPM node once a day at midnight to synchronize the network monitoring file system to the standby FMPM node, which ensures that all back-end configuration files that network monitoring uses are also synchronized between the two FMPM nodes. When a VIP failover to the standby FMPM node occurs, network monitoring is automatically started on the second FMPM node. The network monitoring service takes approximately 3 to 5 minutes to complete its initialization before it performs all FM and PM functionality for the FMPM team. Consequently, Junos Space users can expect a network monitoring outage to last approximately 3 to 5 minutes.

The watchdog service on the two nodes is responsible for ensuring that the network monitoring service is running on the FMPM node which owns the virtual IP address and is not running on the other (standby) FMPM node. As already noted, the watchdog service checks the status of all services on the active FMPM node every second. If the watchdog service detects that the FMPM node owns the VIP address but does not run the network monitoring service, the watchdog service starts the network monitoring service and creates the cron job to synchronize fault management and performance management data to the other node. If the watchdog service detects that the FMPM node does not own the VIP address but is running the network monitoring service, the watchdog service shuts down the service and removes the cron job entry for data synchronization.