The performance of a transaction processing system such as SBRC, and each component of a system, can be characterized by various metrics. Using these metrics, performance can be classified under the following areas of analysis:
Throughput—Calculated based on the rate of system input and output. Often referred to as bandwidth, throughput is measured in work items per second.
Utilization—Calculated based on the impact of a given throughput on the internal components of a system, which can be in terms of CPU utilization, I/O utilization, or memory utilization. These metrics are used for scalability analysis to determine which components of the system are the most impacted for a given workload.
Latency—Measured as the difference in time between input and the corresponding output.
Throughput and latency are often linked to the amount of memory utilization (or in some cases, thread utilization) required to act as a buffer to store the state of the transaction waiting to be processed. For example, consider the following use case. A processor is idle for either 1 or 10 seconds before returning a result (such as when the user has entered an invalid password). The system does the same work in either case, but the result occurs either in 1 or 10 seconds after the request. The throughput and CPU utilization remain the same. The latency increases in the 10-second case. While the transaction is outstanding, the system must store the information about the remaining work that needs to be done when it becomes active, which results in an increase in the amount of memory used to buffer the information while the transaction is idle.
Throughput and utilization are the key metrics used for scalability analysis and planning. Latency does not play an important factor in analyzing the performance of SBRC itself, except in managing failure conditions at the RADIUS level. Latency, however, is a critical factor in analyzing the performance of the external systems and directly impacts memory utilization and thread utilization levels. SBRC is impacted by latency on external systems such as the network connecting to back-ends or between SSR nodes, latency of external authentication or accounting databases, or latency to proxy targets.
Performance Theory: Threads
Performance Theory: Threads
Before the advent of multitasking operating systems (OSs), a single CPU ran one program to handle one task. If the program required an input (disk I/O or user input), the CPU waited for the input and did not perform any other task.
With the advent of multitasking OSs, the concept of processes arose. Resources of a CPU are time-sliced so that multiple processes can run at approximately the same time. While one process is waiting for an input (such as disk or network I/O), another process performs another task. Each process is encapsulated, so that one process does not see the memory used by other processes without special system calls to share it.
In recent years, processes have been split into smaller units sharing the same memory space. The term used for this is “thread,” also known as lightweight process (LWP). Two threads run the same groups of instructions in their program (although they could be running in different places at the same time on different CPUs) within one process. Multiple threads are executed simultaneously and manipulate the same memory (which may give rise to memory contention. See Performance Theory on Contention for more information.)
A thread can be thought of as managing a part of the total state of the system. Items managed by a given thread are grouped together with items managed by other threads. This means that passing items from one thread to another is quick and does not require inefficient cross-process calls, which would be required when crossing a process boundary.
For example, a thread with all its state makes an external request for information it needs (perhaps from the network) and goes idle. Another thread reads efficiently from the network and dispatches a task back to the requester thread.
Three thread models are typically used in SBRC.
Dispatcher—One thread manages a resource and many threads wait for it to dispatch a request and return a result. This model is used, for example, in reading and writing to the network or the disk.
Buffering—If task completion is of high priority and you are not concerned about the result of an operation before continuing to process the rest of the transaction, then that part of the task is placed in a queue (buffer) and the rest of the task can continue. This is how logging works: multiple threads write to that log buffer and another thread drains the log buffer in memory and writes it to the disk.
Worker thread—Worker threads are managed by a manager. When a thread needs some other thread to do a particular task (for example, handling a new authentication request), the authentication worker thread manager assigns the task to a thread in an idle state, prompting the thread to execute the authentication code. The thread executes the authentication code while keeping the existing state of the transaction. If the manager cannot identify an idle thread, it creates a new thread and assigns the task. This model is often used in SBRC in a variety of cases.
The problem with too many threads is that only a certain number of tasks can be done on a system in a period of time. Starting new threads takes time and memory. Many threads means that a lot of state is being allocated. Too many threads (since each thread takes some amount of time to make) may lead to high memory and CPU overhead. This leads to memory contention, not enough CPU available to do the actual work, and the task may not be processed in time, which would not occur if the system had simply refused to start more threads and dropped the request, waiting for a retransmit.
The proper number of threads for an SBRC installation varies and has to do with the amount of external latency each thread needs to manage. In a lab environment, you can utilize an entire SBRC machine (to the limits of the CPU or disk speeds) with only 200 threads by doing normal user authentication and regular accounting. However, if you are using an external database that takes 200 ms to return, you might need 1000 threads or more to manage work for the latency of failover behavior.
Similar to other operating systems, both Linux and Solaris provide a method to set the relative priority. Relative priority is the amount and frequency of a time slice for executing individual threads. In certain cases, setting the priority improves performance. The overall value of the priority and the types of scheduler modes vary.
On Solaris 10, there are 160 priorities (0 through 159). The two key ranges of thread priority are the SYS range (60 through 99), in which most OS threads run, including network and socket buffer handlers; and the RealTime (RT) range (100 through 159). In the SYS range, the amount of time taken for activation is effectively infinite until interrupted or idle. In the RT range, threads are scheduled more frequently and preempt SYS threads, but operate on a fixed-time quantum, although they typically work until they return explicitly to an idle state. It is better to have only a few RT threads on a multi-CPU system. Starving SYS threads of CPU time may cause your network or disk to stop processing traffic into buffers for use by the rest of the system.
On Linux, the real-time range is from 0 through 100, with the non-real-time “nice” range from -20 through 19 mapping into the priority range from 100 through 139.
In SBRC, you can set three threads to RT priority (see Thread Control): the threads are the ones that can receive network buffers from sockets, populate a packet cache, and dispatch the requests to other threads (authentication thread, the accounting thread, and proxy thread). These can be configured in the [Configuration] section of the radius.ini or the Proxy.ini file, respectively. See the Juniper Networks Steel-Belted Radius Carrier Reference Guide for more information.
Locking Threads to a Processor
Locking Threads to a Processor
SBRC does not require that a thread be locked to a processor because it primarily depends on external sources of data, such as the network and disk I/O, and few threads are critical to the performance of the system. The rare processor drift will not impact the overall performance of SBRC.
However, for performance, ndbmtd does require locking threads to processors and this relates to SBRC with SSR configuration. This mitigates the latency during processing for very high memory bandwidth workloads, as only a few threads are doing the bulk of the work. This is controlled by the LockExecuteThreadToCPU directive of the config.ini file, stored on each M node.
An increase in performance should only be relevant when the CPU for one thread on the ndbmtd node is within 20 percent of the maximum available CPU usage. On an M3000 with half the virtual CPUs disabled, prstat -L will report a thread at 20 percent out of the 25 percent maximum (which represents a fourth of the total cores on the M3000 processor). Other architectures vary radically. For instance, on Linux, “Irix” mode in top is similar to prstat’s default mode. See prstat for more information about the prstat and top utilities.
Performance Theory on Contention
Performance Theory on Contention
On a uniprocessor system, you can analyze the system performance with only a few variables (CPU speed, network bandwidth, memory speed, and disk I/O bandwidth). You can look at the metrics of a running system (CPU utilization percentage, bits per second on the network, or bytes per second to disk) and have a reliable idea of how much extra capacity will result in an overall performance increase. For example, if you are not constrained by the network bandwidth, memory speed, and disk I/O bandwidth, then doubling the CPU speed will give you approximately twice as much throughput, until you reach other bandwidth limits.
A known mode of contention is seen in what is called the “von Neumann bottleneck” (in the von Neumann architecture that most modern systems utilize). The faster CPU and memory subsystems are separated by a hardware bus of limited data rates. This bottleneck is limited with the addition of CPU caches at different levels of the hardware architecture. This is one source of memory contention.
With the advent of multiprocessing machines (with multiple CPUs and multiple cores per CPU, and even multiple virtual CPUs per core), what were single-use resources (such as a memory location or waiting for the cache to fill from memory) can become contended in real time.
For example, two CPUs might want to operate on the same memory location at the same time. If both CPUs do this operation at the same time when the operation is flushed from the CPU cache back to memory, an update will occur from one CPU but be ignored from the other. This leads to two processors having a different view of the underlying data. The same holds true of two processes attempting to update the same disk location, or two database systems attempting to be in sync, both attempting to update the same row.
There are two ways to handle this at a system level. The first is for operations to go in order. This is usually done by locking or using compare and swap (CAS) instructions and retrying if the result changes between the start and end of the CAS instruction. The second is to optimize into a given relative operation, such as a transaction that can be rolled back and re-executed, using a single-instruction locked increment or locked decrement that can be taken out of order and has a known result at the end. Most systems (including SBRC) use locking, as you usually want to execute several operations against a reliable amount of data, not just one.
A lock is an in-memory data structure that uses special instructions to tell the CPU to start the negotiation with other CPUs for locking, or to wait for a lock to be released. The lock instruction permits only one processor to execute a set of instructions, or operate against a single, consistent in-memory data structure at a time. Under the lock you read, calculate, and write the result, then release the lock so other processors can perform their tasks.
A novel way of dealing with contention (widely used in the cloud computing paradigm) is eventual consistency. That is, for certain operations that do not need to be entirely accurate on a read of a single node, optimizations can permit a system to perform multiple inaccurate operations that will eventually produce an accurate result. SBRC does not presently rely on these semantics.
Consider a case of a very large SMP (symmetric multiprocessing) machine with 128 virtual CPUs and 1000 instructions to execute under lock, compared to 2000 instructions that do not need to be executed under the lock, in a loop. In order to finish, processor 1 completes 2000 unlocked instructions, then under lock it completes 1000 instructions and releases the lock. During this time, processor 2 completes the 2000 unlocked instructions, and then under lock completes 1000 instructions. Processors 3 through 128 also complete the 2000 unlocked instructions and are waiting for processor 2 to finish and release the lock so that processor 3 can complete the instructions. However, processors 4 through 128 and now processor 1 are also waiting for the lock. In effect, operations under the lock are executed as if there is only one CPU on the system. This is why faster CPUs are generally more preferable to more CPUs.
In most cases under lock, the processor completes a few instructions then releases the lock and spends most of its time performing unlocked work or locking other minimally contended resources. While one processor holds the lock, other processors can do some other task on other parts of the system, even if they are taking a different lock. However, even for small resources that are heavily used (for example, a global statistic that is incremented non-linearly but relatively often), you could run into contention. For long operations (for example, a locked read or write to a file system, or an update or insert of an SQL row by two SBRCs for managing IP addresses on a cluster), you could run into contention.
In SBRC, there are many different ways to manage contention. Locking is limited, in most cases, to short duration operations. For large operations—for example, during a search of an internally indexed table—you can do a certain number of operations, release the lock and respond to the OS with the task completion status so the OS can schedule other tasks, and then reacquire the lock and continue with the longer operation. This permits other threads that have queued up to run with a much shorter delay.
In SBRC, with various use cases, you can run performance testing and attempt to optimize out the greatest sources of contention. One mitigation technique is caching. For profiles, SBRC can be configured to store an in-memory cache of profile information from the database since profiles are used more frequently than user information and there are relatively fewer profiles than users.
SBRC also uses sharding, or arranging large data structures so they can be split apart. The stripes of a RAID disk or NDB D nodes are examples of sharding. In the case of a four-D node system, two D nodes are mirrors of the other two for reliability, but the data is striped between the first two, such that every other row is reliably stored on one or the other D node (and mirrored to its pair). This means that when you are doing an insert, update, or get by the primary key, you are only involving one of the two D nodes. However, when a query is executed, all the D nodes are queried as to whether the row fitting the criteria exists and each D node must do some work to handle the query. Sharding makes some items much faster in a tradeoff that may increase the overall system performance.
On both Solaris and Linux (Intel), the optimal thread count is defined based on the following cases:
Case1—When threads are not waiting for an external network event, a thread count of two to three times (real, not virtual) the core count limits the amount of contention each thread experiences. External network event could be the regular Standalone CST and local accounting to file, or a native user authentication, or Max-Acct-Threads in proxy cases with block=0. Enabling the appropriate flood queues permits the SBR Carrier to respond to spikes in a timely manner.
Case2—When the threads are waiting for an external network event (such as, authentication from or accounting to an external database, or any case that uses database accessors, SSR cases, and proxy cases with block=1), the number of threads can be much higher, taking into account downstream latency.
In a case of 10 milliseconds round-trip latency (inclusive of downstream processing), each thread will be able to handle a maximum of or slightly under 100 transactions a second. To handle 5,000 transactions per second, you would need certainly in excess of 50 threads, or probably closer to 100.
If the latency is 50 milliseconds in a proxy case, to handle 5,000 transactions a second, you need upwards of 1,000 threads.
In cases where there are excessive number of threads, decreasing the WorkerThreadStackSize might be required to avoid reaching the total memory utilization limit.
Several cases are presented in the following sections.
On Linux (Intel), the optimal CPU utilization is within multiple cores of one CPU (for example, the Intel Xeon E7-4870). The penalty for off-CPU thread execution is at least 12% of the throughput. For basic work such as native user authentication or local accounting to file, on a 10 core-count CPU, enabling the 11th core on a second CPU decreases the throughput by almost 40% owing to off-chip access. Enabling a 12th core returns to a 10% decrease from 10 core performance. Subsequently, enabling more cores decreases throughput mildly until adding more cores has no effect, but never returns to the maximum throughput at the 10 core per one socket rate.
In the TTLS/TLS and proxy cases, enabling two sockets worth of cores does give higher total throughput, but the increase is towards the addition of more sockets.
Case 1: More CPU versus More GHz per CPU
Case 1: More CPU versus More GHz per CPU
If you compare a 4 core (Quad Core) machine running at 3 GHz (for a total of 12 GHz) versus a machine with 12 cores each at 1 GHz (for a total of 12 GHz), you will generally see better performance on the higher GHz per CPU.
For example, consider the case for executing 1000 locked instructions. The machine running at 3 GHz executes these 1000 instructions three times faster than the machine running more CPUs at 1 GHz. The amount of time under lock will be one third, so another thread waiting for the resource will wait one-third of the time to do its locked work compared to the 1 GHz CPU.
Case 2: Virtual CPU versus Physical Cores
Case 2: Virtual CPU versus Physical Cores
The virtual CPU mechanism (known as CoolThreads in the Oracle SPARC Solaris-based system, controlled by psrinfo and psradm; and Hyper-Threading in an Intel-based system), in the broadest terms, is a way of getting more actual work done while waiting (for example, for memory latency) in much the same way threads wait and store state for disk and I/O latency. The state of a thread, in terms of registers and localized memory, is kept in a Layer 1 cache on the physical CPU while it is waiting for the memory to respond with the values from a particular location. This means that the pipeline inside a physical core that does the actual processing is kept as full as possible, at a minimum (but not zero) overhead. Switching registers in and out of the Layer 1 cache also takes time and sometimes threads migrate to different virtual processors, which takes longer, when the OS decides that a particular processor is underutilized. Threads usually stay with one processor unless there is a good reason for the OS to move them.
On Linux, the overhead for thread processing (which is accounted for by %SY time) is somewhat higher than for Solaris for a similar workload.
Unless you are constrained by memory lookups and cache misses, you will not be able to utilize the extra time well. SBRC in a highly multithreaded system, where much more time is spent waiting for disk and network resources, will see only slightly increased performance utilizing virtual CPUs for each core. When handling high CPU utilization authentication types (for example, high-encryption cases such as EAP-T/TLS) with very limited use sets, you might not see a major improvement in performance (unless you get more total GHz in terms of cores).
In real-time transaction processing, especially in the ndbmtd threads, you might want to lock a particular thread to a physical CPU and not permit it to be swapped out for another thread to avoid the associated latency. In both cases, this mechanism provides some configuration to lock certain important threads to a virtual CPU.
In the case of ndbmtd, which is very real-time sensitive and optimized to use the CPU effectively, turning off the virtual CPUs gives a dramatic increase in performance as it decreases potential latency. Further locking the threads to certain CPUs gives an additional minor improvement in performance.
For NDB, you are presently limited to approximately eight cores (with no virtual CPUs), with the execute threads taking up two cores (such as the M3000 with quad core CPUs) or four cores (such as the M4000 with two quad core CPUs); additional cores beyond eight do not provide a benefit for NDB versions 7.1 and earlier. A future NDB version may provide additional threading capabilities.
In general, more total GHz from your cores (with or without virtual processors enabled) is better, whether they come from many slow cores or fewer fast cores, until you reach a particular contention threshold for your use case, at which time faster is better.
Performance Theory on Memory Data Structures
Performance Theory on Memory Data Structures
The two most common underlying memory data structures are stack and heap.
Stack—A stack is a continuous chunk of memory that is preallocated and utilized linearly. Every time a procedure call is made, more space is created on the stack to save the data from the registers and automatic storage. Recursive procedures, for example, are stack-intensive. Each time a recursive procedure is called, stack space grows. In general, you must preallocate enough stack space for your workload. You must also limit the amount of memory for the stack of any given thread, so it does not take up too much space, resulting in an out of memory error with many threads.
Stack memory does not leak. When you return from a procedure, the stack is unwound.
On Solaris, libumem and libmtmalloc both do a good job at managing multiple threads, allocating and freeing information within the heap. On Linux, the default malloc offered by glibc is reasonably efficient, especially when the number of threads contended on allocation exceeds the number of CPUs.
But additional efficiencies might be available with specially crafted memory allocation schemes—for instance, Professor Berger’s HOARD memory allocator, tcmalloc, and the commercially available “Lockless, Inc.” memory allocators all show promise for various use cases.
Heap memory can leak. If you have allocated a piece of memory with malloc in C, or new in C++, and you lose the pointer to it by not storing it or never referencing it again, the memory remains blocked and will be unusable. The memory is erased only when the process is deleted.
Memory used by a process is rarely returned to the system until the process exits. After starting SBRC and running load against it, the SBRC process takes more memory for stack as the threads reach deeper into its call trees until a steady state is reached and the amount of memory allocated is sufficient to handle the load.
An increase in memory usage is often normal. For instance, as you add sessions and if you do not delete them with an accounting stop, you will use more memory to store the session information.
However, sometimes the memory continues to increase linearly, even on a steady state test, often by just a few bytes per transaction. Sometimes it is even less, increasing perhaps only on an error condition and these cases are more difficult to detect. This is an indication of a memory leak. In this case, you must contact JTAC and the engineering team for support.
The mechanisms of detecting these sorts of leaks are diverse and often difficult to implement. Enabling libumem “transaction” debug mode or the equivalent for other allocation schemes, followed by obtaining multiple cores over the course of a period of time, may be the only option in certain cases.
Virtual and physical memory—In all SBRC servers and SSR nodes, the working set of SBRC and the ndbmtd should fit well within main memory and should not be at risk of being swapped out to virtual paging due to other activities on the system. Main memory sizes that are smaller than the working set size of any process are not supported.