Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 

Risks to Uptime within the Architecture

 

Risks can be classified under the following three categories:

  • Resources on which SBRC is dependent (such as external databases) can fail to function and cause the SBRC to do additional work to recover, extending the service outage until the entire system has reached a functional stasis again. For example, with a very large number of worker threads and a high transaction rate and high disconnect or reconnect rates, SBRC can take a longer time to process the backlogged transactions after a target back-end server recovers from failure than the duration of the initial outage.

  • Partial hardware-based failures can cause unexpected broader problems. For example, networking among the D nodes, consistent latencied transactions, especially those that are pinpoint or peer to peer (generated by an incorrect firewall rule, or router reconfiguration update operations) of certain values can cause a temporary cluster outage.

  • Software errors:

    • While most software errors are automatically recovered and affect one node only, being a complex system, there are a few forms of software errors that could cause an extended outage. To manage this risk, submit cases of transient errors with JTAC with the appropriate information as referenced in the troubleshooting section, as soon as possible, to avoid the problem being repeated or becoming worse.

    • The most dramatic software errors are often termed a “packet of death.” invariably, in concert with a complex level of functionality, this form of software error involves a packet that is well-formed from a RADIUS perspective and malformed from a processing perspective. Items such as zero or single-byte null fields can cause poorly defended customer-defined SQL statements, stored procedures, or JavaScripts to error seriously. Other items might trigger unexpected side effects within packet processing. Since a great deal of processing is done and given the vast complexity of use cases, enough corner cases are tested for within Juniper Networks this form of software error is still possible.

    • You can lessen these sorts of errors by keeping track of the uptime of the system, monitor and diagnose transient rejects and errors, and implement the recommended final failover emergency NOC. This could be an SBRC running with full logging, by default, to trap the unexpected packets that have failed other processing.