High availability refers to the ability of the system to provide redundancy at both software and hardware levels to deal gracefully (with minimal loss of state) with hardware or software system failures.
The JUNOS system continues to expand its high availability capabilities with initiatives such as graceful Routing Engine switchover (GRES), nonstop active routing (NSR) and unified in-service software upgrade (ISSU).
The JUNOS SDK provides an infrastructure to allow third-party applications to use JUNOS high availability features. The JUNOS SDK provides monitoring functionality both on the Routing Engine and on the data and control CPUs on Multiservices PICs. On the PIC, you can monitor data as well as control threads. The information is aggregated and provided to the Routing Engine.
The JUNOS SDK provides the following functionality for high availability and failure handling.
- The Health Monitor APIs let applications monitor the heartbeat between the Multiservices PIC and the Flexible PIC Concentrator (FPC) at the hardware level, as well as the heartbeat between the application and the system at the software level. For details, see Health Monitoring.
- Data plane redundancy (hot standby) lets you configure primary and secondary (backup) Multiservices PICs to run simultaneously. Data is mirrored to the secondary PIC in real time, so that both contain identical information. APIs allow applications to trigger a switchover between a master and backup Multiservices PIC. For details, see Data-Plane Redundancy.
- Some GENCFG functions that communicate with the Kernel Communication (KCOM) system allow you to monitor the status of interfaces. For sample code that uses these functions, see KCOM Monitoring Example.
- Data Replication allows applications running on two different Routing Engines or Multiservices PICs to copy their runtime data from a master to a backup by placing the data in the libjunos-sync subsystem. For details, see Synchronizing Runtime Data.
- Reliable Configuration Download gives applications running on the Multiservices PIC access to the system configuration values that were set on the Routing Engine. For details, see Reliable Configuration Download.
The JUNOS SDK allows you to address the following types of failures.
junos_set_app_cb_mastership_switch() function specifies a callback function to notify your application of a Routing Engine switchover. For sample code that uses this functionality, see the
hellopics-mgmt_main.c file for the HelloPics sample application, in your development sandbox at
Multiservices PIC reboot, which translates to physical interface down.
Bringing the PIC offline, which translates to physical interface delete, or online, which translates to physical interface add.
Bringing the FPC offline, which translates to physical interface delete, or online, which translates to physical interface add.
Failures in the PIC itself, which translate to physical interface or health status down events.
Failures and time to detection are summarized in the following table:
|Type of Failure ||Time to Detection |
|PIC hardware failures ||Some PIC hardware failures are detected by the FPC driver in less than 1 second. If there are issues with memory, it could take the watchdog interval (3.5 seconds) to detect such failures |
|Kernel failure / crash ||The FPC driver checking for PIC watchdog strobes detects kernel lockup in a maximum of 3.5 seconds. |
|PIC packet path lockup (prolonged flow-control asserted by the PIC to the Packet Forwarding Engine) ||Detected by the hot-standby infrastructure; the system waits 3 seconds to avoid false alarms. |
|FPC failure ||The system detects this within 3 seconds through rdp keepalive loss. |
|PIC application lockup: Individual applications can use the Health Monitor and provide an application-specific way to detect that their application is functional. ||The Health Monitor process detects application crashes. The detection time depends on the configured values for |
no_of_missed_keepalive for the application. That is, a failure is detected only after
(keepalive_interval * no_of_missed_keepalive).
|MAC stuck ||3 seconds. |
|mspmand failure ||approximately 2 seconds. |
© 2007-2009 Juniper Networks, Inc. All rights reserved. The information contained herein is confidential information of Juniper Networks, Inc., and may not be used, disclosed, distributed, modified, or copied without the prior written consent of Juniper Networks, Inc. in an express license. This information is subject to change by Juniper Networks, Inc. Juniper Networks, the Juniper Networks logo, and JUNOS are registered trademarks of Juniper Networks, Inc. in the United States and other countries. All other trademarks, service marks, registered trademarks, or registered service marks are the property of their respective owners.
Generated on Sun May 30 20:26:47 2010 for Juniper Networks Partner Solution Development Platform JUNOS SDK 10.2R1 by Doxygen 1.4.5