Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 

Load Balancing on RADIUS Front-End Applications and Downstream Proxy Devices

 

In high-end cases of multiple front-end SBRCs to one SSR, or multiple SAs operating in parallel, it is important that the load balancers in SBRC (whether they are commercial load-balancing devices or versions of SBR SA running proxy and doing round-robin) have the properties described in the following sections.

Stateful and RADIUS-Aware

Stateful and RADIUS-Aware

Load balancers should be stateful and continue to send packets of the same transaction cycle to the same SBRC front-end application. In the case of EAP over RADIUS, a single EAP stream under RADIUS Auth-Request Challenge-Response pairs must continue on the same front-end application until that front-end application returns a success, reject, or exhibits a failure, in which case the failover of authentication to another SBRC in the middle of an EAP negotiation will become a reject and the user must retry the connection.

The State attribute in the RADIUS request and response describes which front-end application processed the request. Alternately, the NAS-IP-address and port for most devices are unique identifiers, and the Calling-Station-ID can be used in some use cases.

In most cases, accounting should be processed on the same front-end application as authentication. If the load-balancing device can parse attributes, the class attribute can be used to route starts, interims, and stops to the originating NAS.

In some cases, for example, multiple SBRC SA front-end applications might be doing heavy authentication work (for example, TTLS with separate inner authentication proxying to SBRC cluster front-end applications), and the devices can send the accounting directly to the SBRC cluster front-end application, bypassing the unnecessary proxy step.

While TTLS may be cluster-enabled and thus reauthentication optimized across multiple nodes, TLS reauthentication optimization for a device, on the other hand, should attempt to use the front-end application that the previous device used, either by adding return attributes for server biasing, or using Calling-Station-Id or the User-Name, if it represents the device ID. If the front-end application fails, this reauthentication optimization does not need to occur and a full authentication cycle against any working front-end application can take its place.

FastFail is a failover mechanism in SBRC SA for proxy. It performs failover of the transactions from one downstream to another until a strobe of a failed target succeeds. This can be used to optimize performance under conditions of temporary failure. If failover might occur to SBRC front-end applications in a different NOC, you should scope your system so that each NOC can handle the work of the entire system.

Network Configuration for SSR

Network Configuration for SSR

A key source of contention within the SSR is the lock that must be in use to keep the two mirrored D nodes (data nodes) operating in lock step, related to the “two-phase commit” performance transaction in all redundant SQL systems. If a database node needs to update a piece of data in the row (for example, the SSR CurrentSessions table), it must request its mirrored pair to halt any other changes to that row, update both copies of the row in parallel, then release the lock. The amount of time taken to place the lock and update the data is partly a function of the CPU and the latency of the total operation from the first lock to the last release. Aggregate operations such as session deletes for multiple sessions, or IP address allocation, must take one lock for several ndb-level operations within it, and each will be slower due to latency.

Why Latency Kills Performance: D-D and S-D

Why Latency Kills Performance: D-D and S-D

Latency between the data notes (D) significantly reduces the overall throughput of the SSR. In any system that requires the highest levels of performance and reliability, latency of even multi-milliseconds cannot be recommended. Simple cases that look up by primary key, do not do database updates via interims, or have no phantoms, are more resilient to latency. Since each use case is different, testing for sustained throughput (in a lab environment) with latency while measuring the performance of the system is strongly recommended.

Similarly, the latency between S and D nodes (SBR and data nodes) decreases the throughput of the node due to the present NDB limitation of 120 ndb handles per API node. Each handle has a request outstanding, and these outstanding requests will take longer to process to completion when latency is high. In this case, additional S nodes are required to make up for the latency to meet the same bandwidth.

Subscriber Scaling

Subscriber Scaling

Scalability for the SSR and SBRC front-end application is calculated based on the following dimensions:

  • Inserts, deletes, and searches in the SSR indexes become slower as simultaneous sessions are added by a factor of approximately O (log2n).

  • The number of TPS carried out by the system as a whole, required to sustain the user base, especially at the peak rates. Here, the number of auths, starts, and stops varies largely and is timed variably for mobile users, or items such as always-on devices, which may be rare.

The number of interim accounting transactions scales linearly with simultaneously connected subscribers and the interim period, and causes a significant amount of load especially with a short interim period. Reducing the interim period by half doubles the number of packets that must be managed; doubling the interim period reduces the amount of packets by half.

It is possible for devices that reconnect at set times (for example, at midnight or 2:00 a.m.) or otherwise manage to synchronize in time (due to, for example, a SGSN failure or a GGSN reset) to provide periodic spikes of traffic. Both the amount of traffic due to reconnection and the amount of traffic of future interims can match in time. You will often see something much closer to peak usage, as opposed to spreading the load out over the hour, which may prove an issue in deployment. Various solutions can be implemented, but the biggest benefit is seen by decreasing the interims that the SSR sees, followed by injecting a time skew in the reconnection process, either through the device, or through control points in SBRC.

Why Are Interims Hard, and When Can They Be Thrown Away?

Why Are Interims Hard, and When Can They Be Thrown Away?

Interim accounting is an update transaction managed by the SSR. In most cases, stops only need a primary key search and a delete operation, with limited data transmitted to or from the SSR. Interim transactions include the work of a primary key search and the database row update of an accounting start, plus returning most or all of the data for that row for reliable updating of indexed fields that changed, all under a row lock to keep the data consistent. Of all the per-RADIUS transaction operations against the SSR, the interim packet requires the most work to process.

Accounting interims may have been implemented as a way of doing a stop-loss for accounting purposes, in case the RADIUS server does not receive a stop. You can reliably bill the usage (for usage-based plans) for which you have received an interim and a stop was not received due to a failure. In this case, the SSR does not need to be updated by all interims, and the parameter UpdateOnInterm in radius.ini can be set to zero.

In certain cases, accounting interims are used to extend the session life. For example, if devices reauthenticate daily, by setting the Session-Timeout to a little over a 24-hour period, you can avoid the use of interims to extend the session life and also set the UpdateOnInterim parameter to zero. Interims will still be available for accounting-level processing, but need not hit the CST.