Understanding the Linux load average

From juliano.info

Jump to: navigation, search

I've seen many people (Computer Science majors) using Linux load average incorrectly when evaluating the performance of a Linux system or cluster. There are some misconceptions about what means and how are calculated the values reported by the /proc/loadavg pseudo-file. The most common misconception is that those values are percentages of how much the system is loaded. In this post, I'll try to summarize what these values actually mean, and how they should be used from a scientific point-of-view.

The proc filesystem is a pseudo-filesystem whose files provide an interface to kernel data structures, and is usually mounted as /proc. Instantaneous information about the system can be obtained from the files in this filesystem, and some can be changed by writing to these same files.

/proc/loadavg provides the current calculated average load of the system. When the file is read, it returns something like:

9.30 10.14 12.10 12/256 21132

These values represent the average system load in the last 1, 5 and 15 minutes, the number of active and total scheduling entities (tasks), and the PID of the last created process in the system.

It is worth noting that the current explanation in proc(5) manual page (as of man-pages version 3.21, March 2009) is wrong. It reports the first number of the forth field as the number of currently executing scheduling entities, and so predicts it can't be greater than the number of CPUs. That doesn't match the real implementation, where this value reports the current number of runnable threads.

The load of a Linux system is measured using the number of active scheduling entities, or tasks. Tasks are basically process threads. Tasks may be in one of a number of different states, like running, sleeping or blocked. Running tasks are, in fact, tasks that may be really running at that time, or in the run queue. This is better called active or ready tasks, to avoid confusion.

GKrellm-CPU-meter.png

The system have a number of CPUs available for servicing tasks. For the sake of simplicity, consider cores (in multi-core systems) and CPU threads (in hyper-threading systems) as individual CPUs. Each CPU may serve at most a single task at a time. Each CPU may have only one of two instantaneous possible states: either it is idle, or it is processing. This is important because a lot of people misunderstand the information provided by common system monitoring software, and think that the CPU usage at any time is calculated using percentages.

The first important system load metric available is the instant load. The instant load of a system is the number of tasks in the system task queue at the very instant it is measured. This is the first number of the forth field of the contents returned by /proc/loadavg:

9.30 10.14 12.10 12/256 21132

You should take into account that when this value is measured, the process taking this value is counted as running. In other words, this value should never be less than one in normal conditions. Unless your monitoring process is in a tight loop reading this value (it is a very bad idea) you should remember to subtract one from the result in order to ignore the activity of the monitoring process.

A good metric of instantaneous system load is the ratio of the number of active tasks to the number of available CPUs.

Generally speaking, if the number of active tasks is smaller than the number of CPUs available, the system is not using its full processing power, more tasks may be activated in order to make use of the full processing power available. If this number is equal to the number of CPUs available, the system is in its optimal load: all processors are continuously processing and all tasks are being serviced around 100% of the time. If this number is greater than the number of CPUs available, then the system is above the optimal load: tasks are not being serviced 100% of the time, they take more time than needed to process the same amount of information.

It should be clear that the assertions above are highly hypothetical, for an nonexistent "perfect" architecture. In practice, hyper-threading, cache invalidation, context switching and other issues will start to degrade the system performance before the number of active tasks is equal to the number of available CPUs.

The kernel calculates three averages of this number of active tasks, leading to the second important system load metric, known as load averages. These values are calculated periodically (every 5 seconds) using a moving average (or, more precisely, an exponential decay function) over their previous values and the current number of active tasks. They represent the average number of active tasks in the last 1, 5 or 15 minutes. These metrics are exponentially-damped moving averages, meaning that recent measurements weight more than older ones.

9.30 10.14 12.10 12/256 21132

These averages are much more useful for system workload and performance evaluation, but they need some caution when interpreted: it doesn't mean for example, that after one minute with exactly one task in a tight loop, the one-minute load average will be at 1.0. It will need more than three times that time to see the one-minute load average stabilizing at 1.0. The explanation is long, and there is some speculation about why Linux does things this way. The following series of articles by Dr. Neil Gunther is a good read about this issue:

Finally, a third system load metric is the CPU usage ratio. This is the metric that is usually displayed in common system monitoring software. It displays how much time each CPU of the system spent processing in a short time interval (usually one second). This value is usually a percentage, being 0% an idle system and 100% a fully busy system.

The CPU usage ratio has very little use for system load and performance evaluation. If you have either one or a hundred threads in tight loops in a single-CPU system, in both cases this metric will just return 100% CPU usage, even though in the latter case the system is a hundred times more loaded.

This value is calculated from values obtained from /proc/stat. This file has this format:

cpu  32324408 289613 6872365 314122880 2353643 387624 3666716 0 0
cpu0 17308810 171514 3515221 150168483 1838418 385903 3613224 0 0
cpu1 15015597 118099 3357143 163954396 515225 1721 53492 0 0
...

The first line presents counters for global system CPU usage, while the following lines present counters for each individual CPU. Other system stats follow. All numbers represent the amount of CPU time in units of USER_HZ, which is usually 1/100ths of a second. The numbers mean, in sequence:

  • user: Time spent executing user applications (user mode).
  • nice: Time spent executing user applications with low priority (nice).
  • system: Time spent executing system calls (system mode).
  • idle: Idle time.
  • iowait: Time waiting for I/O operations to complete.
  • irq: Time spent servicing interrupts.
  • softirq: Time spent servicing soft-interrupts.
  • steal, guest: Used in virtualization setups.

All these times are accumulated since the system boot time. In order to use these metrics, you have to collect at least two samples of them in an interval, and subtract each new value from its previous instance. The sum of all values is the total time the system is running. There are two common ways to calculate the CPU usage:

  • usage = (user + nice + system) / total
  • usage = (total - idle) / total

These are the most common metrics used when evaluating performance on Linux systems.

More information

Views