Risks to Uptime within the Architecture
Risks can be classified under the following three categories:
Resources on which SBRC is dependent (such as external databases) can fail to function and cause the SBRC to do additional work to recover, extending the service outage until the entire system has reached a functional stasis again. For example, with a very large number of worker threads and a high transaction rate and high disconnect or reconnect rates, SBRC can take a longer time to process the backlogged transactions after a target back-end server recovers from failure than the duration of the initial outage.
Partial hardware-based failures can cause unexpected broader problems. For example, networking among the D nodes, consistent latencied transactions, especially those that are pinpoint or peer to peer (generated by an incorrect firewall rule, or router reconfiguration update operations) of certain values can cause a temporary cluster outage.
While most software errors are automatically recovered and affect one node only, being a complex system, there are a few forms of software errors that could cause an extended outage. To manage this risk, submit cases of transient errors with JTAC with the appropriate information as referenced in the troubleshooting section, as soon as possible, to avoid the problem being repeated or becoming worse.
You can lessen these sorts of errors by keeping track of the uptime of the system, monitor and diagnose transient rejects and errors, and implement the recommended final failover emergency NOC. This could be an SBRC running with full logging, by default, to trap the unexpected packets that have failed other processing.