Supported SSR Cluster Configurations

For small-scale deployments, the minimum SSR configuration requires three C series Controllers set up in the two-shared-data-node geometry. In this configuration, there are two data-client-nodes and two management servers. One management server must be running on the third C Series Controller and the other one can reside on any data-client-node. Each C Series Controller acts as a backup for the other. This solution requires you to configure the database memory size. This is necessary only under the two-shared-data-node cluster geometry. In two-data-node and four-data-node cluster geometry deployments, all available memory is allocated to the data node process. In the two-shared-data-node cluster geometry, memory is shared by all components.

For larger-scale deployments, we recommend you use dedicated C Series Controllers to host each node type. The minimum configuration for a redundant SSR cluster is two client nodes, one of which must host a management server, and two data nodes. A maximum of twenty-four client nodes is supported. For redundancy, at least two client nodes should be configured to host management server components. Data nodes must be added in pairs. The maximum number of data nodes in a cluster is four.

Table 20 lists the possible configurations.

Caution: Setting up an unsupported configuration can put data and equipment at risk and is not supported by Juniper Networks.

Also, note the latency limitation in SSR Cluster Network Requirements. We do not support cluster configurations with latency between nodes that exceeds 20 ms, as can occur if servers are set up to spread a cluster across widely separated locations.

Table 20: Supported Cluster Configurations

Data Nodes

Client Nodes

For small-scale deployments, the minimum SSR configuration requires three C Series Controllers set up in the two-shared-data-node geometry. In this configuration, there are two data-client-nodes and two management servers. One management server must be running on the third C Series Controller and the other one can reside on any data-client-node. Each C Series Controller is configured as a data-client node.

  • A data-client node runs a data node component, a client node component, and a management server. You must set the cluster geometry to two-shared-data-node for this configuration.
  • You can add up to 22 additional client nodes.
  • You must configure the database memory when running the two-shared-data-node cluster geometry because all components share the memory.
  • You cannot mix data-client nodes with data nodes.

Two

  • Each running on its own C Series Controller

Up to 24

  • Each running on its own C Series Controller
  • Up to three configured to run management servers
  • Each hosting SSR client components such as the SIC, SAE and so on
  • Minimum configuration is one client node/management server with one SSR client component (no redundancy)

Four

  • Each running on its own C Series Controller

Up to 24

  • Each running on its own C Series Controller
  • Up to three configured to run management servers
  • Each hosting SSR client components such as the SIC, SAE and so on
  • Minimum configuration is one client node/management server with one SSR client component (no redundancy)

Failover Overview

To continue functioning without a service interruption after a component failure, a cluster requires at least 50 percent of its data nodes and client nodes running the management server component to be functional. If more than 50 percent of the data nodes fail, expect a service interruption, but continued operation of the available nodes.

Because SSR client components function as front ends to the back-end data storage portion of the cluster, they are not involved in any failover operations performed by the back-end data components. However, as an administrator, you need to ensure that the front end environment is configured so that it can survive the loss of components.

A data cluster prepares for failover automatically when the cluster starts. During startup, two events occur:

In a cluster, each management server and data node are allocated a vote that is used during this startup election and during failover operations. One management server is selected as the initial arbitrator of failover problems and of elections that result from them.

Within the cluster, data nodes and any client nodes hosting management servers monitor each other to detect communications loss. When either type of failure is detected, as long as nodes with more than 50 percent of the votes are operating, there is instantaneous failover and no service interruption. If exactly 50 percent of nodes and votes are lost, and if a data node is one of the lost nodes, the cluster determines which half of the database is to remain in operation. The half with the arbitrator (which usually includes the master node) stays up, and the other half shuts down to prevent each node or node group from updating information independently.

When a failed data node (or nodes) returns to service, the working nodes resynchronize the current data with the restored nodes so all data nodes are up to date. How quickly this takes place depends on the current load on the cluster, the length of time the nodes were offline, and other factors.

Failover Examples

The following examples are based on the deployment of two client nodes each hosting a management server, and two data nodes set up with the recommended redundant network as shown in Figure 60. Each client node is running a client component. The cluster is set up in a single data center on a fully switched, redundant, layer 2 network. Each of the nodes is connected to two switches using Ethernet bonding for interface failover. The switches have a back-to-back connection.

Figure 60: SSR Cluster with Redundant Network

SSR Cluster with Redundant Network

Possible Failure Scenarios

With this basic configuration, a high level of redundancy is supported. So long as one data node is available to one client node/management server, the cluster is viable and functional.

Distributed Cluster Failure and Recovery

You can divide a cluster and separate two equal halves between two data centers. In this case, the interconnection is made by dedicated communications links (shown as red lines in Figure 61 and Figure 62) that may be either:

However, separating the cluster like this creates a configuration that is vulnerable to a catastrophic failure that severs the two halves of a dispersed cluster. We recommend adding a third client node/management server at a location that has a separate alternative communication route to each half. A third client node/management server:

With a third client node/management server in place, failover in the dispersed cluster is well managed because one side of the cluster does not have to determine what role to assume. Recovery is likely to be quicker when the data nodes are reunited because each node’s status is more likely to have been monitored by at least one management server that is in communication with each segment.

Figure 61: SSR Cluster Divided Between Two Sites with Tertiary Management Server

SSR Cluster Divided Between Two Sites
with Tertiary Management Server

Without a third client node/management server, the configuration shown in Figure 62 is vulnerable to data loss if both communication links are severed or if the nodes in the master half of the cluster all go offline simultaneously.

Figure 62: SSR Cluster Evenly Divided Between Two Sites

SSR Cluster Evenly Divided Between Two
Sites

If either of those calamities occur, exactly half the nodes in a side survive. If the master nodes are operating on one or both sides, the cluster continues to function. But the secondary side cannot determine whether the master side is really no longer available, because it only has two votes. It can take 10-15 minutes for the secondary side of the cluster to automatically restart, promote itself to master status, and resume cluster operations.

The SSR client nodes connecting to the secondary side continue to work. However, modifications made to the SSR database may create a divergence between the two copies of the database. The longer the cluster is split, the greater the divergence, and the longer it takes to resolve when recovery takes place.

To eliminate these problems, we recommend a proven alternative: adding another client node with a third management server in a third location that can communicate with each half of the dispersed cluster. Without the tertiary client node/management server, there is a possibility of down time in a dispersed cluster that suffers a catastrophic failure.

If you cannot add a third client node/management server, we recommend that you configure the secondary side of the cluster not to automatically restart, but to go out of service when it instantaneously disconnects from the master side nodes. Then you can determine the best course of action—to keep the cluster offline or to promote the secondary side of the cluster, relink the client node/management servers, and be aware that reconciling the divergence must be part of the recovery procedure.

When the cluster is reunited and goes into recovery mode, the master and secondary data nodes attempt to reconcile the divergence that occurred during separation. The moment they come in contact, transitory failures appear on the client nodes because the cluster configuration has changed; any transactions that are pending at that moment are aborted. The client nodes retry those transactions because they are classified as temporary failures; in most situations they are accepted on the first retry.

Related Documentation