Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 

Supported SBR Carrier SSR Cluster Configurations

 

If all add-on products are added to the Starter Kit cluster, the maximum size of a data cluster is four data nodes, three management nodes, and up to 20 SBR Carrier nodes, as shown in Table 71.

Caution

Setting up an unsupported configuration can put data and equipment at risk and is not supported by Juniper Networks.

Also, note the latency limitation in Table 70. We do not support cluster configurations with latency between nodes that exceeds 20 ms, as errors can occur if servers are set up to spread a cluster across widely separated locations.

Table 71: Supported SBR Carrier SSR Cluster Configurations

Cluster Configuration

SBR Carrier Nodes (S)

Management Nodes (M)

Data Nodes (D)

Starter Kit

Two nodes.

Two nodes.

Two nodes, each on a discrete server.

Two servers required.

You may install the four Starter Kit nodes on four discrete servers (a s/s/m/m configuration), or combine one node of each type on two servers (a sm/sm configuration).

Starter Kit and one

Data Expansion Kit

You may install the four Starter Kit nodes on four discrete servers (a s/s/m/m configuration), or combine one node of each type on two servers (a sm/sm configuration).

Four nodes, each on a discrete server.

Four servers required

Either of the previously listed configurations and a Management Node Expansion Kit

Either Starter Kit configuration (s/s/m/m or sm/sm) may be enhanced with a third management node.

We recommend that the third management node be installed on a discrete server.

Either two or four nodes, each on a discrete server.

Any of the previously listed configurations and additional SBR Carrier Nodes (front ends)

One node on a discrete server.

Up to 18 additional servers supported, for a total of 20.

Any of the previously listed configurations.

Either two or four nodes, each on a discrete server.

Best Practice

To minimize the chance and impact of a service interruption, increase the number of data nodes. A Starter Kit can be vulnerable because the loss of one data node and one management node takes the cluster down to the critical 50 percent level. A Starter Kit with a data Expansion Kit (creating two-node groups) is much less likely to lose two or more of the data nodes simultaneously.

Failover Overview

Failover Overview

To continue functioning without a service interruption after a component failure, a cluster requires at least 50 percent of its data and management nodes to be functional. If more than 50 percent of the data nodes fail, expect a service interruption, but continued operation of the available nodes.

Because SBR Carrier nodes function as front ends to the data cluster, they are not involved in any failover operations performed by the data cluster. However, as an administrator, you need to ensure that the front end environment is configured, usually through a load balancer, so that it can survive the loss of SBR Carrier nodes.

A data cluster prepares for failover automatically when the cluster starts. During startup, two events occur:

  • One of the data nodes (usually the node with the lowest nodeid) becomes the primary of the node group. The primary node stores the authoritative copy of the database.

  • One data or management node is elected arbitrator. The arbitrator is responsible for conducting elections among the survivors to determine roles in case of node failures.

In a cluster, each management and data node is allocated a vote that is used during this startup election and during failover operations. One management node is selected as the initial arbitrator failover problems and of elections that result from them.

Within the cluster, data and management nodes monitor each other to detect communications loss and heartbeat failure. When either type of failure is detected, as long as nodes with more than 50 percent of the votes are alive, there is instantaneous failover and no service interruption. If exactly 50 percent of nodes and votes are lost, and if a data node is one of the lost nodes, the cluster decides which half of the database is to remain in operation. The half with the arbitrator (which usually includes the primary node) stays up and the other half shuts down to prevent each node or node group from updating information independently.

When a failed data node (or nodes) returns to service, the working nodes resynchronize the current data with the restored nodes so all data nodes are up to date. How quickly this takes place depends on the current load on the cluster, the length of time the nodes were offline, the number of sessions stored, the IO bandwidth on the D node, and other factors.

Failover Examples

Failover Examples

The following examples are based on the basic Starter Kit deployment setup with the recommended redundant network as shown in Figure 247. The cluster is set up in a single data center on a fully switched, redundant, layer 2 network. Each of the nodes is connected to two switches using the Solaris IP-multipathing feature for interface failover. The switches have a back-to-back connection.

Figure 247: Starter Kit SSR Cluster with Redundant Network
Starter Kit  SSR Cluster with Redundant
Network

Possible Failure Scenarios

With these basic configurations, a high level of redundancy is supported. So long as one data node is available to one SBR Carrier node, the cluster is viable and functional.

  • If either SBR Carrier Server 1 or 2, which also run the cluster’s management nodes, (s1 and m1 or s2 and m2) goes down, the effect on the facility and cluster is:

    • No AAA service impact.

    • NAS devices (depending on the failover mechanism in the device) switch to their secondary targets—the remaining SBR Carrier Server. Recovery of the NAS device when the SBR Carrier Server returns to service depends on NAS device implementation.

  • If either data node A or B goes down, the effect is:

    • No AAA service impact; both SBR Carrier nodes continue operation using the surviving data node.

    • The management nodes and surviving data node detect that one data node has gone down, but no action is required because failover is automatic.

    • When the data node returns to service, it synchronizes its NDB data with the surviving node and resumes operation.

  • If both management nodes (m1 and m2) go down, the effect is:

    • No AAA service impact because the all s and d nodes are still available. The data nodes continue to update themselves.

  • If both data nodes go down, the effect on:

    • The management nodes is minimal. They detect that the data nodes are offline, but can only monitor them.

    • The SBR Carrier nodes varies:

      • Authentication of users and accounting for users that do not require shared resources such as the IP address pool or concurrency continues uninterrupted.

      • Users that require shared resources are rejected.

      The carrier nodes continue to operate this way until the data cluster comes back online; the cluster resumes normal AAA operation using the data cluster automatically.

  • If one half of the cluster (SBR Carrier Server 1, management node 1, and data node A or SBR Carrier Server 2, management node 2, and data node B) go down, the effect is:

    • No AAA service impact because the SBR Carrier node, a management node, and a data node are all still in service. NAS devices using the failed SBR Carrier Server fail over SBR Carrier Server.

    • When the failed data node returns to service, it synchronizes and updates its NDB data with the surviving node and resumes operation.

    • When the failed SBR Carrier Server returns to service, the NAS devices assigned to use it as a primary resource return to service depending on the NAS device implementation.