Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Understanding Disaster Recovery Failure Scenarios

The following sections explain failure scenarios such as the active and standby sites (with automatic failover enabled) going down due to a disaster, losing connectivity between sites, and losing connectivity with arbiter devices. The device arbitration algorithm is used for failure detection.

For the scenarios, assume that the active site is site1 and standby site is site2.

Active Site (site1) Goes Down Due to a Disaster or Is Powered Down

Detection

The disaster recovery watchdog at site2 does not receive replies to successive ping retries to site1. The disaster recovery watchdog at site2 initiates the device arbitration algorithm and finds that arbiter devices (all or most) are not managed by site1.

An e-mail is sent to the administrator with this information.

Impact

MySQL and PgSQL database replication to site2 is stopped. If you configured any file transfers through SCP during downtime, site2 may lose that version of configuration and RRD files.

MySQL and PgSQL databases at site2 now contain the latest data that was replicated in real time from site1 before it went down. This includes configuration, inventory, alarm-related data of all managed devices, and data maintained by Junos Space Platform and Junos Space applications. The latest version of configuration and RRD files available at site2 are from the most recent file transfer through SCP.

Junos Space users and NBI clients need to wait until site2 becomes active and use the VIP address of site2 to access all network management services.

Recovery

The disaster recovery watchdog at site2 initiates the process to become active. The complete process may take around 15 to 20 minutes. This can vary depending on the number of devices that are managed on your Junos Space setup.

When the failover is complete, site2 establishes connections with all devices and resynchronizes configuration and inventory data if required. site2 starts receiving alarms and performance management data from managed devices.

Note:

When you rebuild or power on site1, if the disaster recovery configuration is deleted, you must reconfigure disaster recovery between the sites.

No Connectivity Between the Active and Standby Sites and Both Sites Lose Connectivity with Arbiter Devices

Detection

The disaster recovery watchdog at both sites do not receive replies to successive ping retries. The disaster recovery watchdog at both sites initiates the device arbitration algorithm.

An e-mail is sent to the administrator regarding the failure of MySQL and PgSQL replication and file transfer through SCP between sites.

Impact

MySQL and PgSQL database replication to site2 is stopped. If you configured any file transfers through SCP during downtime, site2 may lose that version of configuration and RRD files.

Because both sites cannot connect to arbiter devices (all or most), both sites cannot determine the status of the other site. site1 starts to become standby and site2 remains standby to avoid a split-brain situation.

Even if connectivity between the two sites is restored, both sites remain standby because the sites cannot connect to arbiter devices.

The network management services are stopped at both sites until one of the sites becomes active.

If connectivity to arbiter devices is not restored within the grace period (by default, eight hours), automatic failover functionality is disabled at both sites. An e-mail is sent every hour to the administrator with this information.

Recovery

If connectivity to arbiter devices is restored within the grace period (by default, eight hours), site1 becomes active again. site2 remains standby.

If both sites are standby, enable disaster recovery by executing the jmp-dr manualFailover –a command at the VIP node of site1. To enable automatic failover at the sites, execute the jmp-dr toolkit watchdog status --enable-automatic-failover command at the VIP node of site1 and site2.

Fix connectivity issues between site1 and site2 to resume MySQL and PgSQL replication and file transfer through SCP.

No Connectivity Between the Active and Standby Sites

Detection

The disaster recovery watchdog at both sites do not receive replies to successive ping retries. The disaster recovery watchdog at both sites initiates the device arbitration algorithm and finds that arbiter devices (all or most) are managed by site1.

An e-mail is sent to the administrator regarding the failure of MySQL and PgSQL database replication and file transfer through SCP between sites.

Impact

MySQL and PgSQL database replication to site2 is stopped. If you configured any file transfers through SCP during downtime, site2 may lose that version of configuration and RRD files.

Recovery

site1 remains active and site2 remains standby. Fix connectivity issues between site1 and site2 to resume MySQL and PgSQL database replication and file transfer through SCP.

No Connectivity Between the Active and Standby Sites and the Active Site (site1) Loses Connectivity with Arbiter Devices

Detection

The disaster recovery watchdog at both sites do not receive replies to successive ping retries. The disaster recovery watchdog at both sites initiates the device arbitration algorithm.

An e-mail is sent to the administrator regarding the failure of MySQL and PgSQL database replication and file transfer through SCP between sites.

Impact

MySQL and PgSQL database replication to site2 is stopped. If you configured any file transfers through SCP during downtime, site2 may lose that version of configuration and RRD files.

Because site1 cannot connect to arbiter devices, site1 starts to become standby. Because site2 finds that arbiter devices (all or most) are not managed by site1, a failover is initiated. As part of becoming standby, all network management services are stopped at site1.

site2 now contains the latest MySQL and PgSQL data that was replicated in real time from site1. The latest version of configuration and RRD files available at site2 are from the most recent file transfer through SCP.

Junos Space users and NBI clients need to wait until site2 becomes active and use the VIP address of site2 to access all network management services.

Recovery

The disaster recovery watchdog at site2 initiates the process to become active. The complete process may take around 15 to 20 minutes. This can vary depending on the number of devices that are managed on your Junos Space setup.

When the failover is complete, site2 establishes connections with all devices and resynchronizes configuration and inventory data if required. site2 starts receiving alarms and performance management data from managed devices.

Fix connectivity issues between site1 and site2 to resume MySQL and PgSQL database replication and file transfer through SCP.

No Connectivity Between the Active and Standby Sites and the Standby Site (site2) Loses Connectivity With Arbiter Devices

Detection

The disaster recovery watchdog at both sites do not receive replies to successive ping retries. The disaster recovery watchdog at site1 initiates the device arbitration algorithm and finds that arbiter devices (all or most) are managed by site1. The disaster recovery watchdog at site2 initiates the device arbitration algorithm.

An e-mail is sent to the administrator regarding the failure of MySQL and PgSQL replication and file transfer through SCP between sites.

Impact

MySQL and PgSQL database replication to site2 is stopped. If you configured any file transfers through SCP during downtime, site2 may lose that version of configuration and RRD files.

Because site2 cannot connect to arbiter devices (all or most), site2 remains standby.

site2 retries to connect to arbiter devices, but does not become active again even if it can connect to enough arbiter devices within eight hours. During these eight hours, site2 requests disaster recovery runtime information of the remote site to ensure that the remote site is active and not in the process of a failover. If site2 cannot connect to enough arbiter devices within eight hours, site2 disables automatic failover permanently until you manually enable automatic failover. An e-mail is sent every hour to the administrator with this information.

Recovery

Fix connectivity issues between site1 and site2 to resume MySQL and PgSQL database replication and file transfer through SCP.

To enable automatic failover at the standby site, execute the jmp-dr toolkit watchdog status --enable-automatic-failover command at the VIP node of site2.

Standby Site (site2) Goes Down Due to Disaster or Is Powered Down

Detection

The disaster recovery watchdog at site1 does not receive replies to successive ping retries to site2. The disaster recovery watchdog at site1 initiates the device arbitration algorithm and finds that arbiter devices (all or most) are managed by site1.

An e-mail is sent to the administrator regarding the failure of MySQL and PgSQL replication and file transfer through SCP between sites.

Impact

MySQL and PgSQL database replication to site2 is stopped. If you configured any file transfers through SCP during downtime, site2 may lose that version of configuration and RRD files.

Recovery

site1 remains active. When you power on site2, site2 becomes standby. If you powered down or if the disaster recovery configuration is not deleted from site2, MySQL and PgSQL database replication and file transfer through SCP are initiated.

Note:

When you rebuild or power on site2, if the disaster recovery configuration is deleted, you must reconfigure disaster recovery between both sites.

No Connectivity Between the Active Site (site1) and Arbiter Devices

Detection

The arbiterMonitor service of the disaster recovery watchdog at site1 detects that the percentage of reachable arbiter devices is below the configured warning threshold. An e-mail is sent to the administrator with this information.

Impact

There is no impact on the disaster recovery solution until the percentage of reachable arbiter devices goes below the failover threshold.

Recovery

No recovery is required because network management services are available from site1.

No Connectivity Between the Standby Site (site2) and Arbiter Devices

Detection

The arbiterMonitor service of the disaster recovery watchdog at site2 detects that the percentage of reachable arbiter devices is below the configured warning threshold. An e-mail is sent to the administrator with this information.

Impact

There is no impact on the disaster recovery solution.

Recovery

No recovery is required because network management services are available from site1.

Both active and standby sites are active and do not lose connectivity with arbiter devices

Detection

The disaster recovery watchdog receives replies to successive pings at both active and standby sites and retries after recovery of automatic disaster recovery failover.

Impact

  • Both active and standby sites connect to all or most of the arbiter devices.
  • Both active and standby sites fails to determine the status of the other site. Site1 starts as an active site. where as site2 remains active.

  • Even when connectivity between active and standby sites is restored, both sites remain active.

  • The network management services starts at both active and standby sites until one of the sites becomes a standby site.

Recovery

If connectivity to arbiter devices is restored within the grace period (by default, eight hours), site1 becomes active and site2 remains as a standby site.

If both active and standby sites are active, enable disaster recovery by executing the jmp-dr manualFailover -s command at the VIP node of site1.

To enable automatic failover at the sites, execute the jmp-dr toolkit watchdog status --enable-automatic-failover command at the VIP node of site1 and site2.