High Availability and Failure Handling

High availability refers to the ability of the system to provide redundancy at both software and hardware levels to deal gracefully (with minimal loss of state) with hardware or software system failures.

The JUNOS system continues to expand its high availability capabilities with initiatives such as graceful Routing Engine switchover (GRES), nonstop active routing (NSR) and unified in-service software upgrade (ISSU).

The JUNOS SDK provides an infrastructure to allow third-party applications to use JUNOS high availability features. The JUNOS SDK provides monitoring functionality both on the Routing Engine and on the data and control CPUs on Multiservices PICs. On the PIC, you can monitor data as well as control threads. The information is aggregated and provided to the Routing Engine.

Summary of SDK High Availability Functionality

The JUNOS SDK provides the following functionality for high availability and failure handling.

Types of Failures

The JUNOS SDK allows you to address the following types of failures.

Failures and time to detection are summarized in the following table:

Type of Failure Time to Detection

PIC hardware failures Some PIC hardware failures are detected by the FPC driver in less than 1 second. If there are issues with memory, it could take the watchdog interval (3.5 seconds) to detect such failures
Kernel failure / crash The FPC driver checking for PIC watchdog strobes detects kernel lockup in a maximum of 3.5 seconds.

PIC packet path lockup (prolonged flow-control asserted by the PIC to the Packet Forwarding Engine) Detected by the hot-standby infrastructure; the system waits 3 seconds to avoid false alarms.

FPC failure The system detects this within 3 seconds through rdp keepalive loss.

PIC application lockup: Individual applications can use the Health Monitor and provide an application-specific way to detect that their application is functional. The Health Monitor process detects application crashes. The detection time depends on the configured values for keepalive_interval and no_of_missed_keepalive for the application. That is, a failure is detected only after (keepalive_interval * no_of_missed_keepalive).

MAC stuck 3 seconds.

mspmand failure approximately 2 seconds.

2007-2009 Juniper Networks, Inc. All rights reserved. The information contained herein is confidential information of Juniper Networks, Inc., and may not be used, disclosed, distributed, modified, or copied without the prior written consent of Juniper Networks, Inc. in an express license. This information is subject to change by Juniper Networks, Inc. Juniper Networks, the Juniper Networks logo, and JUNOS are registered trademarks of Juniper Networks, Inc. in the United States and other countries. All other trademarks, service marks, registered trademarks, or registered service marks are the property of their respective owners.
Generated on Sun May 30 20:26:47 2010 for Juniper Networks Partner Solution Development Platform JUNOS SDK 10.2R1 by Doxygen 1.4.5