Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 

SBRC Architecture Overview

 

The following two key principles are primarily considered in a high availability system:

  • Redundancy—The function of each network component and the subsystem should be replicated completely, both in terms of functionality as well as throughput. In case of a failure, half of the SBRC system or subsystem must be able to support the work of the entire system.

  • Separation—Components that can fail individually should be permitted to fail individually. Thus, the NOC Area 1 and NOC Area 2, representing two different NOCs or two different sections of the same NOC (see Figure 5) should function independently when an environmental failure occurs.

Figure 5: SSR Redundancy Architecture
SSR Redundancy Architecture

Routing of the RADIUS traffic can be done using reliable networking techniques, either at OSI Layer 2 or 3. Since UDP traffic is retransmitted by the application layer at a comparatively slow retransmit rate, it is resilient to routing changes and packets being dropped during that reconfiguration.

In general, the Transmission Control Protocol (TCP) and Stream Control Transmission Protocol (SCTP) also have a relatively strong tolerance for additional latency due to routing changes, and traffic to the back-end servers (SBRC front-end application to authentication, accounting databases, or SSR) may be routed, although switching generally provides the lowest latency.

However, the intra-D node traffic, which attempts to keep two copies of the same data in lock-step, requires both low latency and high reliability, and should be managed through Layer 2.

IPMP, used to configure a single virtual address, which floats between one of the two physical adapters, is recommended. This is generally used in active-standby mode with active probing and failure detection time (which may be from 200 to 400 milliseconds), and can be varied based on the relevant heartbeat timeouts, in the case of NDB, and RADIUS retransmit semantics or database timeouts for other back ends.

See the Oracle whitepaper Highly Available and Scalable Oracle RAC Networking with Oracle Solaris 10 IPM (http://www.oracle.com/technetwork/articles/systems-hardware-architecture/ha-rac-networking-ipmp-168440.pdf) for more information.

In a red-black network that separates the outward facing components from the inner “back-end” components, each SBRC S node will have a single virtual address on two physical adapters connecting to the two externally facing network devices. Each SBRC S node will also have a second virtual address connecting to the two physical adapters, which in turn connects to the back ends (SSR or authentication/accounting databases), the latter of which should generally be switched (and may provide Layer 2 routed access to a failover server in another NOC).

The emergency NOC can be placed in a separate location as a server of last resort to lessen the impact of critical, multi-system failures. A determination can be made on how promiscuous this server is: in many cases, where there is no response from the primary system (rather than a response with a reject), from a business point of view, a carrier can permit temporary access freely instead of being rejected due to these central failures.

The third M node used by the SSR must have at least low-bandwidth medium latency (less than 100 milliseconds round trip) routed connectivity, possibly via a firewall, to each D node and to the other M nodes. This M node serves as the final arbiter to determine which side of the cluster should be considered ‘alive’ if both fiber connections between the two NOCs split entirely. If there are only two M nodes in two different NOCs, then on a cluster split, one set of nodes survives and the other fails, but the surviving NOC might also be undergoing a connectivity failure to the rest of the system during the failure. This third M node should be collocated with the most important exterior devices, if any, with which you wish to preferentially maintain connectivity, in the event of certain major network failures.