Updated: July 17, 2019
Good system troubleshooting tools are everything. Great tools, though, are harder to find. Luckily, Linux comes with a wealth of excellent programs and utilities that let you profile, analyze and resolve system behavior problems, from application bottlenecks to misconfigurations and even bugs. It all starts with a tool that can grab the necessary metrics and give you the data you need.
Health-check is a neat program that can monitor and profile processes, so you can identify and resolve excess resource usage - or associated problems. Where it stands out compared to the rest of the crowd - it aims to offer many useful facets of system data simultaneously, so you can more easily component-search your systems, troubleshoot performance issues and fix configuration mishaps in your environment. Rather than having to run five tools at the same time, or do five runs to get all the info you need, you just use health-check, and Bob's your distant relative. Good. All right, ready? Proceed.
Health-check in action
Before we run the utility in anger, a few small notes. One, you need sudo privileges to run this tool, although you can run the actual application in the context of other users on the system (with the -u flag). Two, you need some understanding of how Linux works to make use of the results - I have a whole bunch of articles on this topic, linked farther below.
In essence, as I've outlined just a moment ago, health-check blends functionality from a variety of programs under one umbrella. It nicely blends elements that you would get if you run netstat, lsof, vmstat, iostat, and examined various structs under /proc and /sys. This is somewhat like dstat, which combines the power of vmstat, iostat and ifstat. You can start with the simple run (-b flag):
sudo ./health-check -u "user" -b "binary"
There's going to be a lot of output, even in this "brief" mode, something like:
CPU usage (in terms of 1 CPU):
User: 34.24%, System: 13.30%, Total: 47.54% (high load)
First, you get basic CPU figures, normalized per core (100% = 1 full core). Health-check has internal thresholds by which it will indicate whether this is low, moderate or high load. This is just to give you a sense of what you should expect. The specifics will depend on the type of application and workload you're profiling. There will be differences between GUI and command-line tools, software that reads from a database to one that does not, software with a large number of shared libraries, the use of hardware, etc.
Page Faults:
PID
Process
Minor/sec Major/sec Total/sec
1043 google-chrome
16156.46 0.25 16156.71
We talked about page faults at length in the past (links in the more reading section below). If you don't know what your application is supposed to be doing, the numbers won't tell you much on their own. But they can be very useful for comparative studies, like two different programs of the same type, or two different versions of the same program, or the same program running on two different platforms.
Context Switches:
13687.53 context switches/sec (very high)
The context switch value indicates how often the kernel relinquishes the runqueue and switches between tasks. For interactive processes (like the browser), which have a user-interactive component, you do actually want as many context switches as possible (the tasks run as little as possible), because you don't want these tasks hogging the processor. In fact, long computation is a sign of batch jobs. Here, having few context switch could be an indication of an issue with an interactive (GUI) application like the browser.
File I/O operations:
I/O Operations per second: 312.80 open, 283.49 close, 768.71 read,
410.83 write
The I/O values are useful if you have a baseline, and they also depend on the underlying I/O stack, including the hardware, the bus, the driver, the filesystem choice, and any other disk operations running at the same time.
Polling system call analysis:
google-chrome (1043), poll:
1555 immediate timed out calls with zero timeout
(non-blocking peeks)
1 repeated timed out polled calls with
non-zero timeouts
(light polling)
1125 repeated immediate timed out polled calls with zero
timeouts
(heavy polling peeks)
This section is another indicator of possible user-interactiveness of the profiled binary. Polling system calls are system calls that wait for file descriptions to become ready to perform I/O operations. Typically, this will indicate network connections. We will examine this in more detail when we do a full run.
Memory:
Change in memory (K/second):
PID
Process Type
Size RSS PSS
1043 google-chrome Stack
32.51 27.59 27.59 (growing
moderately fast)
1043 google-chrome Heap 67550.94
9092.51 9112.70 (growing very
fast)
1043 google-chrome Mapped 102764.33 27296.24 18781.80 (growing
very
fast)
For most people and most application types, memory operations will not be that interesting. Memory-intensive workloads are not that usual in desktop software. They can be quite important for databases and complex computations, something you'd normally do on a server-class system. But this set of numbers can be used to examine differences between platforms, kernels, and software versions.
Open Network Connections:
PID
Process Proto
Send Receive Address
1043 google-chrome UNIX 531.52 K 35.16
K /run/user/1000/bus
1043 google-chrome UNIX 0.00 B
88.56 K /run/systemd/journal/...
1043 google-chrome 64.51
K 0.00 B socket:[2737924]
1043 google-chrome 30.55
K 0.00 B socket:[2746558]
1043 google-chrome
3.98 K 0.00 B socket:[2742865]
...
The network connections set of numbers gives you results that are similar to what netstat and lsof do, but you also get the send/receive values, which can be quite useful. If you know what the program is meant to be doing network-wise, you can profile its execution, and look for possible misconfigurations in the network stack.
Longer (detailed) run
You can also opt for more statistics (for instance, with -c -f flags, no -b flag). You will get extended results for each of the sections we've discussed earlier, and this can give you additional insight into how your software is behaving. If you're tracing children processes and forks, then you can see the sequence of execution. CPU statistics will be listed based on usage, with the highest offenders at the top.
CPU usage (in terms of 1 CPU):
PID Process
USR% SYS% TOTAL% Duration
1715 vlc
47.04 8.17 55.21 14.71 (high load)
1720 vlc
46.91 7.96 54.87 14.43 (high load)
1723 vlc
46.77 7.96 54.74 14.35 (high load)
...
1721
vlc
1.69 1.08 2.77 1.21 (light load)
1722 vlc
0.20 0.07 0.27 0.16 (very light load)
1726 vlc
0.07 0.00 0.07 0.02 (very light load)
1742
vlc
0.00 0.00 0.00 0.07 (idle)
1732
vlc
0.00 0.00 0.00 0.02 (idle)
1728
vlc
0.00 0.00 0.00 0.06 (idle)
1719
vlc
0.00 0.00 0.00 0.06 (idle)
Total
971.80 161.72 1133.52 (CPU fully
loaded)
In the example above, running VLC (with an HD clip playback of about 14 seconds), we utilized 1,133% of CPU time, which translates into 11.33 CPU cores. That sounds a lot, but since the system has eight cores (threads), this effectively means only 1.5 cores were actually used for the video. It would also be interesting to actually know which cores were used.
Context Switches:
PID Process
Voluntary Involuntary Total
Ctxt Sw/Sec Ctxt Sw/Sec Ctxt Sw/Sec
1744
vlc
2500.09 1.15 2501.24
(high)
1723
vlc
1493.47 1.82 1495.29
(high)
1740
vlc
1224.03 3.31 1227.33
(high)
1717
vlc
947.43 0.40 947.84
(quite high)
1731
vlc
736.37 0.81 737.18
(quite high)
For page faults, there isn't anything new. With context switches, we also get a list of voluntary and involuntary CS. The latter bunch can be an indication of tasks exceeding their allocated slice, which would then result in them having a lower dynamic priority the next time they run (which is not good for interactive processes).
File I/O operations:
PID Process Count Op Filename
1715 vlc 176 R
/home/roger/developers.webm
1715 vlc 48 C
/etc/ld.so.cache
1715 vlc 48 O
/etc/ld.so.cache
1715 vlc 34 R
/usr/share/X11/locale/locale.alias
The File I/O now also shows the number of operations per process, the type of operation as well as the filename. This does not have to be an actual file on the disk, this could also be a bus. The available operations are printed at the bottom of this section. How exactly the read and write operations are done depends on many factors.
...
1715 vlc 1 C
/lib/x86_64-linux-gnu/libnss_systemd.so.2
1715 vlc 1 OR
/usr/bin/vlc
Total 4352
Op: O=Open, R=Read, W=Write, C=Close
You also get the frequency of these I/O operations:
File I/O Operations per second:
PID
Process
Open Close Read Write
1715
vlc
100.57 96.72 89.77 1.75
1719
vlc
3.24 4.99 3.04 0.00
...
The next section is all about system calls, and it's very detailed. The output is similar to strace. You will have the process ID, the process name, the system call, the count, rate, total time (in us), and the percentage each system call took out of the total execution time. You cannot interpret these numbers unless you know what the application is meant to be doing, or you can compare to a baseline.
System calls traced:
PID Process Syscall
Count Rate/Sec Total uSecs % Call Time
1715 vlc
stat 429
28.9555 3004 0.0011
1715
vlc mmap
240 16.1989
4912 0.0017
1715 vlc mprotect
203 13.7015
4739 0.0017
...
The information is more useful when you look at polling system calls. For brevity, I've slightly edited the output below. The last four fields all indicate timeouts, e.g.: Zero Timeout, Minimum Timeout, etc. Essentially, they give you an indicator of how much time it took for these system calls to finish. The Infinite count field refers to system calls that had infinite wait (for the duration of the application run). The information is also shown as a histogram per process, from zero to infinite, bucketed logarithmically, up to 10us, 10-99 us, 100-99 us, and so forth.
Top polling system calls:
PID Process Syscall Rate/Sec Inf
Zero Min Max Avg
1715 vlc
poll 3.2398
45 1 0.0 s 25.0 s 1.0 s
1715 vlc rt_sigtimed
0.1350 2 0 0.0 s 0.0 s 0.0 s
1717 vlc poll
124.7312 5 1 0.0 s 30.0 s 958.4 ms
...
The more detailed output log will also have the filesystem sync data.
Filesystem Syncs:
PID fdatasync fsync sync
syncfs total total (Rate)
1723
0 2
0 0
2 0.13
Files Sync'd:
PID syscall # sync's filename
1723 fdatasync 1
/home/roger/.../vlc-qt-interface.conf.lock
1723 fdatasync 1
/home/roger/.../vlc-qt-interface.conf.XM1715
Lastly, the detailed output will also have memory and network connection information, but the main difference will be the results shown per process. As we've discussed earlier, the former will not typically be useful for most desktop workloads (unless you're the developer of the program), while the latter can be useful in finding issues in the network stack.
Health-check creates a pretty large set of results, but it does provide a lot of insight into how your applications behave. You can combine its usage with other software to get a full analysis of your software and troubleshoot any performance issues. Health-check can also profile running tasks (-p flag), which makes it quite handy as an addition to your problem solving toolbox.
Manual setup
If you're not happy with the version available in the repos, you can manually compile. Another reason to do this is to work around any possible problems that older versions may have, like for instance the timer_stats error, whereby the tools tries to access /proc/timer_stats, but this struct is no longer exposed in recent kernels:
Cannot open /proc/timer_stats.
Indeed, if you check, you get:
cat /proc/timer_stats
cat: /proc/timer_stats: No such file or directory
To compile, run:
git clone https://kernel.ubuntu.com/git/cking/health-check.git/
cd health-check
make
You may see the following error:
json.h:25:10: fatal error: json-c/json.h: No such file or directory
#include <json-c/json.h>
This means you're missing the development package for JSON, which the tool needs to successfully compile. The actual name of the package will vary from one distribution to another, but in my test on Kubuntu, the following resolved the compilation error:
sudo apt-get install libjson-c-dev
More reading
If you're interested in additional tools on system troubleshooting, then:
Linux super-duper admin tools: strace
Linux super-duper admin tools: lsof
Linux super-duper admin tools: gdb
Slow system? Perf to the rescue!
Linux system debugging super tutorialLinux cool hacks - parts one through four - linking just the last one.
Last but not the least, my problem solving book!
Conclusion
Health-check is a very useful, practical tool. It does not replace strace or netstat or perf, but it can sure help you get a very accurate multi-dimensional snapshot of whatever you're profiling. This is a very good first step that can point you in the right direction. You can then select a utility that specifically examines the relevant facet of the software run (maybe Wireshark for network or Valgrind for memory). In a way, this makes health-check into a Jack o' All Trades.
You do need some understanding of how Linux systems work - and the application you're running. But even if you don't have that knowledge, health-check can be used for comparative studies and troubleshooting of performance bottlenecks. If you know something isn't running quite as well as it should, you can trace it once on a good system, once on a bad (affected) system, and then compare the two. The many types of data that health-check provides will greatly assist in solving the issue. And that brings us to the end of this tutorial. With some luck, you have learned something new, and it was an enjoyable ride, too. Take care.
Cheers.