The upstream RADIUS devices (such as NAS devices, load balancers, or SBRC sending to downstream proxies) must reattempt the same RADIUS transaction to the same SBRC front-end applications at least three times (with a 5-second retry period) before trying another front-end application. Another common method is five retries with a 3-second interval before trying another front-end application.
RADIUS is a UDP-based protocol with application-level retransmit semantics that can easily be configured and tuned in the client, unlike the TCP protocol retransmissions which are generally tuned at the system level.
Within the network, many network devices and operating systems drop UDP packets preferentially under certain cases of heavy traffic. Specifically, both Linux and Solaris may drop UDP packets if one of the buffers in its network stack is nearing full capacity. In a lab environment, under heavy load, you will often see some packet loss at the OS level, which occurs in accordance with the protocol design.
The retransmit semantics of three retries x 5 seconds each (3x5) or five retries x 3 seconds each (5x3) provide a good likelihood of avoiding traffic spikes on the network. This gives a reasonable amount of resilience, so that the underlying network protocols can reroute traffic and failure recovery semantics can occur in failure conditions.
A retransmit to one SBRC front-end application activates the RADIUS packet cache, which short-circuits the processing of the in-process or recently processed transactions. This means that a server receiving a second request generally returns the result immediately (with less overhead) without repeating the work, if the result has been returned already.
Using a retransmit time in hundreds of milliseconds conflicts with even the most ambitious failure recovery settings, turning what would otherwise be a recoverable failure scenario (for example, the outage of part of a cluster, or the loss of connectivity on one network adapter) into an unrecoverable one as far as the customer experience is concerned. This causes the system to do more work since the same work might be attempted over multiple front-end applications, and conflict resolution adds additional overhead with each retried attempt.
Three to four retries of 1 to 2 seconds each before trying a second front-end application (similarly for the third) is probably safe under most circumstances with sub-millisecond network latency and few-millisecond back-end DB latency. However, lesser values are entirely unsupported. Larger is better to the point where failures might seriously impact the user. Frequent failures of this mode, in any event, should be diagnosed.
In summary, the retry semantics are a function of the system resilience and should be set higher rather than lower. This avoids making a transient problem appear to be a more permanent one by, for instance, incorrectly failing an authentication and allowing a failure situation to become worse in the process.
Thus, three retries of 5 seconds each are nominal, human-level speeds that will permit wide latitude for successful failover scenarios.