Updated: July 17, 2019
Good system troubleshooting tools are everything. Great tools, though, are harder to find. Luckily, Linux comes with a wealth of excellent programs and utilities that let you profile, analyze and resolve system behavior problems, from application bottlenecks to misconfigurations and even bugs. It all starts with a tool that can grab the necessary metrics and give you the data you need.
Health-check is a neat program that can monitor and profile processes, so you can identify and resolve excess resource usage - or associated problems. Where it stands out compared to the rest of the crowd - it aims to offer many useful facets of system data simultaneously, so you can more easily component-search your systems, troubleshoot performance issues and fix configuration mishaps in your environment. Rather than having to run five tools at the same time, or do five runs to get all the info you need, you just use health-check, and Bob's your distant relative. Good. All right, ready? Proceed.
Health-check in action
Before we run the utility in anger, a few small notes. One, you need sudo privileges to run this tool, although you can run the actual application in the context of other users on the system (with the -u flag). Two, you need some understanding of how Linux works to make use of the results - I have a whole bunch of articles on this topic, linked farther below.
In essence, as I've outlined just a moment ago, health-check blends functionality from a variety of programs under one umbrella. It nicely blends elements that you would get if you run netstat, lsof, vmstat, iostat, and examined various structs under /proc and /sys. This is somewhat like dstat, which combines the power of vmstat, iostat and ifstat. You can start with the simple run (-b flag):
sudo ./health-check -u "user" -b "binary"
There's going to be a lot of output, even in this "brief" mode, something like:
CPU usage (in terms of 1 CPU):
User: 34.24%, System: 13.30%, Total: 47.54% (high load)
First, you get basic CPU figures, normalized per core (100% = 1 full core). Health-check has internal thresholds by which it will indicate whether this is low, moderate or high load. This is just to give you a sense of what you should expect. The specifics will depend on the type of application and workload you're profiling. There will be differences between GUI and command-line tools, software that reads from a database to one that does not, software with a large number of shared libraries, the use of hardware, etc.
PID Process Minor/sec Major/sec Total/sec
1043 google-chrome 16156.46 0.25 16156.71
We talked about page faults at length in the past (links in the more reading section below). If you don't know what your application is supposed to be doing, the numbers won't tell you much on their own. But they can be very useful for comparative studies, like two different programs of the same type, or two different versions of the same program, or the same program running on two different platforms.
13687.53 context switches/sec (very high)
The context switch value indicates how often the kernel relinquishes the runqueue and switches between tasks. For interactive processes (like the browser), which have a user-interactive component, you do actually want as many context switches as possible (the tasks run as little as possible), because you don't want these tasks hogging the processor. In fact, long computation is a sign of batch jobs. Here, having few context switch could be an indication of an issue with an interactive (GUI) application like the browser.
File I/O operations:
I/O Operations per second: 312.80 open, 283.49 close, 768.71 read,
The I/O values are useful if you have a baseline, and they also depend on the underlying I/O stack, including the hardware, the bus, the driver, the filesystem choice, and any other disk operations running at the same time.
Polling system call analysis:
google-chrome (1043), poll:
1555 immediate timed out calls with zero timeout (non-blocking peeks)
1 repeated timed out polled calls with non-zero timeouts
1125 repeated immediate timed out polled calls with zero timeouts
(heavy polling peeks)
This section is another indicator of possible user-interactiveness of the profiled binary. Polling system calls are system calls that wait for file descriptions to become ready to perform I/O operations. Typically, this will indicate network connections. We will examine this in more detail when we do a full run.
Change in memory (K/second):
PID Process Type Size RSS PSS
1043 google-chrome Stack 32.51 27.59 27.59 (growing
1043 google-chrome Heap 67550.94 9092.51 9112.70 (growing very
1043 google-chrome Mapped 102764.33 27296.24 18781.80 (growing very
For most people and most application types, memory operations will not be that interesting. Memory-intensive workloads are not that usual in desktop software. They can be quite important for databases and complex computations, something you'd normally do on a server-class system. But this set of numbers can be used to examine differences between platforms, kernels, and software versions.
Open Network Connections:
PID Process Proto Send Receive Address
1043 google-chrome UNIX 531.52 K 35.16 K /run/user/1000/bus
1043 google-chrome UNIX 0.00 B 88.56 K /run/systemd/journal/...
1043 google-chrome 64.51 K 0.00 B socket:
1043 google-chrome 30.55 K 0.00 B socket:
1043 google-chrome 3.98 K 0.00 B socket:
The network connections set of numbers gives you results that are similar to what netstat and lsof do, but you also get the send/receive values, which can be quite useful. If you know what the program is meant to be doing network-wise, you can profile its execution, and look for possible misconfigurations in the network stack.
Longer (detailed) run
You can also opt for more statistics (for instance, with -c -f flags, no -b flag). You will get extended results for each of the sections we've discussed earlier, and this can give you additional insight into how your software is behaving. If you're tracing children processes and forks, then you can see the sequence of execution. CPU statistics will be listed based on usage, with the highest offenders at the top.
CPU usage (in terms of 1 CPU):
PID Process USR% SYS% TOTAL% Duration
1715 vlc 47.04 8.17 55.21 14.71 (high load)
1720 vlc 46.91 7.96 54.87 14.43 (high load)
1723 vlc 46.77 7.96 54.74 14.35 (high load)
1721 vlc 1.69 1.08 2.77 1.21 (light load)
1722 vlc 0.20 0.07 0.27 0.16 (very light load)
1726 vlc 0.07 0.00 0.07 0.02 (very light load)
1742 vlc 0.00 0.00 0.00 0.07 (idle)
1732 vlc 0.00 0.00 0.00 0.02 (idle)
1728 vlc 0.00 0.00 0.00 0.06 (idle)
1719 vlc 0.00 0.00 0.00 0.06 (idle)
Total 971.80 161.72 1133.52 (CPU fully loaded)
In the example above, running VLC (with an HD clip playback of about 14 seconds), we utilized 1,133% of CPU time, which translates into 11.33 CPU cores. That sounds a lot, but since the system has eight cores (threads), this effectively means only 1.5 cores were actually used for the video. It would also be interesting to actually know which cores were used.
PID Process Voluntary Involuntary Total
Ctxt Sw/Sec Ctxt Sw/Sec Ctxt Sw/Sec
1744 vlc 2500.09 1.15 2501.24 (high)
1723 vlc 1493.47 1.82 1495.29 (high)
1740 vlc 1224.03 3.31 1227.33 (high)
1717 vlc 947.43 0.40 947.84 (quite high)
1731 vlc 736.37 0.81 737.18 (quite high)
For page faults, there isn't anything new. With context switches, we also get a list of voluntary and involuntary CS. The latter bunch can be an indication of tasks exceeding their allocated slice, which would then result in them having a lower dynamic priority the next time they run (which is not good for interactive processes).
File I/O operations:
PID Process Count Op Filename
1715 vlc 176 R /home/roger/developers.webm
1715 vlc 48 C /etc/ld.so.cache
1715 vlc 48 O /etc/ld.so.cache
1715 vlc 34 R /usr/share/X11/locale/locale.alias
The File I/O now also shows the number of operations per process, the type of operation as well as the filename. This does not have to be an actual file on the disk, this could also be a bus. The available operations are printed at the bottom of this section. How exactly the read and write operations are done depends on many factors.
1715 vlc 1 C /lib/x86_64-linux-gnu/libnss_systemd.so.2
1715 vlc 1 OR /usr/bin/vlc
Op: O=Open, R=Read, W=Write, C=Close
You also get the frequency of these I/O operations:
File I/O Operations per second:
PID Process Open Close Read Write
1715 vlc 100.57 96.72 89.77 1.75
1719 vlc 3.24 4.99 3.04 0.00
The next section is all about system calls, and it's very detailed. The output is similar to strace. You will have the process ID, the process name, the system call, the count, rate, total time (in us), and the percentage each system call took out of the total execution time. You cannot interpret these numbers unless you know what the application is meant to be doing, or you can compare to a baseline.
System calls traced:
PID Process Syscall Count Rate/Sec Total uSecs % Call Time
1715 vlc stat 429 28.9555 3004 0.0011
1715 vlc mmap 240 16.1989 4912 0.0017
1715 vlc mprotect 203 13.7015 4739 0.0017
The information is more useful when you look at polling system calls. For brevity, I've slightly edited the output below. The last four fields all indicate timeouts, e.g.: Zero Timeout, Minimum Timeout, etc. Essentially, they give you an indicator of how much time it took for these system calls to finish. The Infinite count field refers to system calls that had infinite wait (for the duration of the application run). The information is also shown as a histogram per process, from zero to infinite, bucketed logarithmically, up to 10us, 10-99 us, 100-99 us, and so forth.
Top polling system calls:
PID Process Syscall Rate/Sec Inf Zero Min Max Avg
1715 vlc poll 3.2398 45 1 0.0 s 25.0 s 1.0 s
1715 vlc rt_sigtimed 0.1350 2 0 0.0 s 0.0 s 0.0 s
1717 vlc poll 124.7312 5 1 0.0 s 30.0 s 958.4 ms
The more detailed output log will also have the filesystem sync data.
PID fdatasync fsync sync syncfs total total (Rate)
1723 0 2 0 0 2 0.13
PID syscall # sync's filename
1723 fdatasync 1 /home/roger/.../vlc-qt-interface.conf.lock
1723 fdatasync 1 /home/roger/.../vlc-qt-interface.conf.XM1715
Lastly, the detailed output will also have memory and network connection information, but the main difference will be the results shown per process. As we've discussed earlier, the former will not typically be useful for most desktop workloads (unless you're the developer of the program), while the latter can be useful in finding issues in the network stack.
Health-check creates a pretty large set of results, but it does provide a lot of insight into how your applications behave. You can combine its usage with other software to get a full analysis of your software and troubleshoot any performance issues. Health-check can also profile running tasks (-p flag), which makes it quite handy as an addition to your problem solving toolbox.
If you're not happy with the version available in the repos, you can manually compile. Another reason to do this is to work around any possible problems that older versions may have, like for instance the timer_stats error, whereby the tools tries to access /proc/timer_stats, but this struct is no longer exposed in recent kernels:
Cannot open /proc/timer_stats.
Indeed, if you check, you get:
cat: /proc/timer_stats: No such file or directory
To compile, run:
git clone https://kernel.ubuntu.com/git/cking/health-check.git/
You may see the following error:
json.h:25:10: fatal error: json-c/json.h: No such file or directory
This means you're missing the development package for JSON, which the tool needs to successfully compile. The actual name of the package will vary from one distribution to another, but in my test on Kubuntu, the following resolved the compilation error:
sudo apt-get install libjson-c-dev
If you're interested in additional tools on system troubleshooting, then:Linux system debugging super tutorial
Linux cool hacks - parts one through four - linking just the last one.
Last but not the least, my problem solving book!
Health-check is a very useful, practical tool. It does not replace strace or netstat or perf, but it can sure help you get a very accurate multi-dimensional snapshot of whatever you're profiling. This is a very good first step that can point you in the right direction. You can then select a utility that specifically examines the relevant facet of the software run (maybe Wireshark for network or Valgrind for memory). In a way, this makes health-check into a Jack o' All Trades.
You do need some understanding of how Linux systems work - and the application you're running. But even if you don't have that knowledge, health-check can be used for comparative studies and troubleshooting of performance bottlenecks. If you know something isn't running quite as well as it should, you can trace it once on a good system, once on a bad (affected) system, and then compare the two. The many types of data that health-check provides will greatly assist in solving the issue. And that brings us to the end of this tutorial. With some luck, you have learned something new, and it was an enjoyable ride, too. Take care.