The auto-recovery feature detects failure of an IDP engine and buffers packets while it shuts down and attempts to restart the IDP engine. Auto-recovery is enabled by default.
In the default implementation, the auto-recovery process stops the NIC bypass watchdog process, thereby temporarily changing state for all enabled forwarding interfaces to either Bypass state or NICs OFF state based on the configuration set with ACM. When the auto-recovery process restarts the IDP engine, it also brings up the NIC interfaces and restarts the NIC bypass watchdog. Link flapping can occur when these processes are repeated frequently in a brief period.
If the default implementation of the auto-recovery feature causes link flapping issues in your network, consider the alternative implementation described here. In the alternative implementation, you configure the auto-recovery feature to try to recover once without entering NIC bypass. In this implementation, the newly started IDP engine forwards traffic. The traffic is forwarded uninspected until the IDP policy is reloaded. A second IDP engine failure within the specified threshold period triggers NIC bypass. Note that the number of failed restarts for this implementation is different from the default auto-recovery behavior, which attempts to restart the IDP engine six times before settling into the NIC bypass state.
With the alternative auto-recovery implementation, there is delay to throughput before the newly started IDP engine is ready to forward traffic. The delay varies (even within different deployments of the same platform). The delay depends on the number of virtual routers enabled, the device configuration, load, and other factors. Table 1 lists delays observed during testing. We recommend you run tests to determine the delay for your devices to understand whether this auto-recovery implementation is right for your network.
Table 1: Throughput Delay Examples
Platform | Virtual Routers | Delay (Seconds) |
|---|---|---|
IDP75 | 1 | 20 |
IDP250 | 4 | 15 |
IDP800 | 5 | 19 |
IDP8200 | 8 | 20 |
![]() | Note: If auto-recovery attempts do not resolve the issue, the device enters a Bypass or Nics off state, depending on the setting you configured with ACM. At that point, you must take manuals steps to diagnose and resolve the issue and bring the IDP Series device back online. |
If you decide to use the alternate implementation
of auto-recovery, you must configure two values in the user_funcs file.
To enable the alternate implementation of auto-recovery:
/usr/idp/device/bin/user_funcs file in a text editor, such as vi.Implementing the alternate auto-recovery method requires you set pktprocess_afterpolicyload=0. If you have not already done so, take the following actions:
export pktprocess_afterpolicyload=1
To enable the alternate auto-recovery implementation, take the following actions:
# Support to enable/disable the logic of stopping and then later starting # nicBypass.sh when autorecovery feature detects idpengine has exited. # Setting the variable to 0(default) implies the behavior is unchanged. # Setting it to any non-zero value would enable the new behavior. The # non-zero value also indicates the threshold time(in minutes) within # which two successive autorecovery detections must not happen. If it so # happens, then IDP service would be stopped in the device. # Note: This configuration is bound to the configuration optoin # pktprocess_afterpolicyload. Hence, this configuration will not work if # the configuration option pktprocess_afterpolicyload is set to a non-zero value. export autorecovery_bypass=0
The default autorecovery_bypass=0 indicates the feature is disabled. When disabled, the standard auto-recovery process is followed.
export autorecovery_bypass=5
[root@defaulthost admin]# idp.sh restartRestarting the IDP engine can take several moments.
/usr/idp/device/var/sysinfo/logs/idpinit.date.The following example shows the debug log messages for IDP75, a single-core platform, when recovery is successful. Logs are similar for all single-core platforms.
IDP75 Autorecovery Log: Successful Recovery
Fri Aug 26 06:39:46 PDT 2011:Detected 0 to be terminated [06:39:47] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring [06:39:49] sc_network_outage_monitor:560: one or more engine is terminated or hung Adding recovered engine with pid 19844 into cgroup DP share Fri Aug 26 06:40:00 PDT 2011:Indicating signal-jnet to kjnetd IDP instance 0 successfully recovered Restarting all CP processes [06:45:08] ../../idpinit.c:idpinit_signals:84: SIGUSR1 delivered, will start network outage monitoring Fri Aug 26 06:45:09 PDT 2011: Done
The following example shows the debug log messages when a second terminated IDP engine instance is detected within the threshold period. The IDP service is stopped, triggering NIC bypass.
IDP75 Autorecovery Log: Bypass Triggered Upon Second Failure Within Threshold
Fri Aug 26 06:53:59 PDT 2011:Detected 0 to be terminated [06:53:59] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring Adding recovered engine with pid into cgroup DP share Fri Aug 26 06:54:02 PDT 2011:Indicating signal-jnet to kjnetd IDP instance 0 successfully recovered Restarting all CP processes [06:54:03] sc_network_outage_monitor:560: one or more engine is terminated or hung [06:55:38] ../../idpinit.c:idpinit_signals:84: SIGUSR1 delivered, will start network outage monitoring Fri Aug 26 06:55:38 PDT 2011: Done Fri Aug 26 06:55:39 PDT 2011:Detected 0 to be terminated [06:55:39] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring [06:55:40] Triggering nicBypass since IDP instance 0 restarted again in 140 seconds
(less than the configured value of 300 seconds), Stopping IDP service [06:55:40] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring [06:55:43] sc_network_outage_monitor:560: one or more engine is terminated or hung
The following example shows the debug log messages for IDP800, a dual-core platform, when recovery is successful. Logs are similar for all dual-core platforms.
IDP800 Autorecovery Log: Successful Recovery
Thu Aug 25 05:02:04 IST 2011:Detected 0 to be terminated [05:02:04] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring [05:02:06] sc_network_outage_monitor:560: one or more engine is terminated or hung Thu Aug 25 05:02:12 IST 2011:Indicating signal-jnet to kjnetd IDP instance 0 successfully recovered Restarting all CP processes [05:03:07] ../../idpinit.c:idpinit_signals:84: SIGUSR1 delivered, will start network outage monitoring Thu Aug 25 05:03:07 IST 2011: Done
The following example shows the debug log messages when a second terminated IDP engine instance is detected within the threshold period. The IDP service is stopped, triggering NIC bypass.
IDP800 Autorecovery Log: Bypass Triggered Upon Second Failure Within Threshold
Thu Aug 25 05:13:32 IST 2011:Detected 0 to be terminated [05:13:32] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring Thu Aug 25 05:13:35 IST 2011:Indicating signal-jnet to kjnetd IDP instance 0 successfully recovered Restarting all CP processes [05:13:35] sc_network_outage_monitor:560: one or more engine is terminated or hung [05:15:02] ../../idpinit.c:idpinit_signals:84: SIGUSR1 delivered, will start network outage monitoring Thu Aug 25 05:15:02 IST 2011: Done Thu Aug 25 05:15:03 IST 2011:Detected 0 to be terminated [05:15:03] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring [05:15:03] Triggering nicBypass since IDP instance 0 restarted again in 171 seconds
(less than the configured value of 300 seconds), Stopping IDP service [05:15:05] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring [05:15:05] sc_network_outage_monitor:560: one or more engine is terminated or hung
The following example shows the debug log messages for IDP8200, a multi-core platform, when recovery is successful.
IDP8200 Autorecovery Log: Successful Recovery
Thu Aug 25 06:01:17 IST 2011:Detected 0 to be terminated [06:01:17] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring [06:01:18] sc_network_outage_monitor:560: one or more engine is terminated or hung Thu Aug 25 06:01:32 IST 2011:Indicating signal-jnet to kjnetd mecd: Thu Aug 25 06:01:33 IST 2011: start idpengine_0 IDP instance 0 successfully recovered Restarting all CP processes [06:02:43] ../../idpinit.c:idpinit_signals:84: SIGUSR1 delivered, will start network outage monitoring Thu Aug 25 06:02:43 IST 2011: Done mecd: Thu Aug 25 06:02:43 IST 2011: Received SIGUSR1. Exiting now...
Note the mecd message. The mecd is the multi-engine crash detector particular to IDP8200. The mecd service monitors other idpengine instances during the autorecovery of the first engine in recovery. If the mecd service detects a second idpengine instance crash while the first is still in the process of recovering, the auto-recovery script stops the NIC bypass watchdog in order to trigger NIC bypass. (That event does not appear in the example.)
The final example shows the IDP8200 debug log messages when idpengine_1 fails twice within the threshold period. The IDP service is stopped, triggering NIC bypass.
IDP8200 Autorecovery Log: Bypass Triggered Upon Second Failure Within Threshold
Thu Aug 25 06:56:51 IST 2011:Detected 1 to be terminated Thu Aug 25 06:56:54 IST 2011:Indicating signal-jnet to kjnetd IDP instance 1 successfully recovered Restarting all CP processes Thu Aug 25 06:57:35 IST 2011: Done Thu Aug 25 06:57:37 IST 2011:Detected 1 to be terminated [06:57:37] ../../idpinit.c:idpinit_signals:91: SIGUSR2 delivered, will stop network outage monitoring [06:57:37] Triggering nicBypass since IDP instance 1 restarted again in 87 seconds
(less than the configured value of 900 seconds), Stopping IDP service [06:57:40] sc_network_outage_monitor:560: one or more engine is terminated or hung