Linux system debugging super tutorial

Updated: June 1, 2012

In the last two years, I have introduced you to a number of so-called super-duper system administration tools. We learned how to properly analyze a range of problems by carefully studying symptoms and then isolating and fixing the root cause. We learned about strace and lsof, both immensely powerful tools. We focused on system profiling and auditing. We also worshiped the omnipotent GNU Debugger (gdb). Now it is time to combine the powers of all these into one mega tutorial.

Today, we will use all of these tools, plus the slew of tips and tricks listed in my hacking guides, all three of them, to try to resolve a seemingly impossible problem. We will attack the issue from several angles, each time using a different tool, each time exposing a new facet of the phenomenon until we resolve it like champs. This is not going to be easy, but it will probably be the best, most comprehensive and most unique all-purpose system administration debugging guide you have read in a long while, if not ever. Did I say ever?

Teaser

The problem we have

All right, let's assume that the startup of applications is slow. You wait 5-10 seconds for application windows to launch. Sometimes it works fast, but you're not really sure why. Usually, the problems occur after you log out of your session or following a reboot. The most annoying thing is that if you try to fire up those same programs with sudo or as root, there are no startup delays and all seems to run fine. So how do you debug a problem like this one?

Rules!

If you want to be an awesome system debugger, then there are several would-be rules you must follow whenever you are trying to resolve a difficult problem. If it's something trivial you can fix in two minutes, fine, but if this is a brand new issue that has stumped you and your colleagues for more than an hour, then you should pause, rethink and be methodical about it.

Be methodical

Having a formulaic approach may portray you as a noob - or a very efficient man. You choose. You should have a strategy that is uniform, generic and all-encompassing for every problem that may come up. This means that you will be able to handle the situation whether you're fixing airplanes or hacking the kernel.

Always begin with simple tools

In theory, you can try to fix any problem with gdb right away, but that's not a smart use of either the tools at hand or your intellect and time. You should always begin with simple things. Check the system logs for weird errors. Check top to assess the basic system health. Make sure you're not out of memory. Examine the disk and CPU load.

Comparison to a healthy system

This may not always be applicable, but if you have a machine with an identical hardware and software configuration that does not exhibit the same problems, or two workload scenarios that produce different results, then you should compare between them to try to find what might be causing the bad system or the component to misbehave. This may not always be easy, but component search is the most powerful tool in statistical engineering and it applies well for mechanical parts as well as software.

Now, we're ready to dig in.

Symptoms

Our healthy system fires up xclock, our sample graphical tool, in less than a second. The bad system does the same, except it takes 10 times longer. So we know something is slowed down but not necessarily broken.

Xclock launched

Strace

At this stage, running strace seems like a good idea. Rather than tracing every system call through endless logs, we will only do the summary run, specified with -c flag. We want to see if there's anything significant in the tool's run.

Strace log, good system

Let's see what strace gives us:

Strace, good machine

Strace log, bad system

The same thing for the misbehaving case:

Strace, bad machine

Strace comparison

I've placed the two summaries side by side. The font may be a little small, but it's identical to what you've see above. We can see that there's a significant difference in the behavior of the two cases.

On the good system, the top four system calls are read, mprotect, open, and fstat. We get system calls in the range of several tens to several hundreds, with some errors, probably stemming from seeking files in the $PATH. Another important element is the execution time, which takes ~1-2 milliseconds for each of the four system calls.

On the bad system, the top hitters are open, munmap, mmap, and stat. Moreover, we have system calls in the range of thousands, approx. 10-20x times more than previously. We also have a huge number of open errors. The execution time is in the range of 8-12 miliseconds, which is approx. 10 times longer than before.

Strace comparison, zoomed

So we know something is wrong, but we don't know what exactly.

Ltrace

We will now use ltrace and perhaps discover something else. System calls shows a marked difference in the behavior, but we do not know what is causing the problem. Maybe we will be able to determine the issue by looking at libraries, which is exactly what ltrace does. Again, we will execute the tool with -c flag, to see the summary.

Ltrace log, good system

A whole new plate of information. The most significant library function is the creation of the xclock widget, which takes 20 milliseconds and almost 40% of the total execution time. Second on the list is the XftFontOpenName.

Ltrace, good machine

Ltrace log, bad system

Here, we see something else. The XftFontOpenName is the leading function with a total of 18 seconds of execution time. Then, we have the creation of the widget, with no less impressive 16 seconds. After that, nothing else matters.

Ltrace, bad system

Ltrace comparison

We can clearly see that there's something wrong with fonts, it seems. On the good system, the relevant library call took a fraction of the same, while on the bad system, it lasted almost 20 seconds. Now, if we cross-reference the strace logs with these two, we start to get a clearer picture. The long duration of the function responsible for creating the widget definitely explain the slowness we experience. Moreover, it aligns nicely with the huge number of system calls and errors on the bad system. We now have a clue.

Ltrace comparison

Additional analysis - with vmstat

Some people could decide they already have enough information - something seems to be wrong with fonts. But let's say you are still not 100% convinced and you want to learn more. Let's try using vmstat.

Good system

vmstat provides useful information on the memory and CPU usage. We have discussed the tool in the past, so I will not be repeating myself. Today, the fields of interests are CPU metrics in the last five columns, specifically the two labeled us and sy, as well as the cs and in fields under system.

The us and sy labels indicate the percentage of CPU usage by user and system, respectively. The cs and in labels refers to context switches and interrupts. The first metric tells us roughly how often the running process had to relinquish its place in the runqueue in favor of another waiting process. In other words, there was a context switch from one process to another. Interrupts indicates the number of times the running process had to access the underlying hardware, for example access the disk to read data from the filesystem, poll the screen or the keyboard, etc.

Let's see what happens when we launch the xclock program. Ignore the first line, as it display the average values since the last uptime. We can see that there's a brief spike of user CPU, also reflected in the high number of interrupts and context switches. xclock is launched by the time the fourth line is displayed.

vmstat, good machine

Bad system

Here, the situation is a little different. We can see a significant increase in user CPU for almost 5-7 seconds. However, there's less context switches than before and many more interrupts. What does this mean? We can see that the launching of xclock, which is an interactive application and should be therefore treated as an interactive process, is behaving like a computational (batch) process.

For interactive processes, we want as much responsiveness as possible, which means lots of context switches and little CPU activity so that we get dynamic priority from the scheduler and spend as little time thinking, giving control to the user. However, in our case, the process does fewer context switches than before and uses a lot more CPU, which is typical behavior for batch processes. This means there's some thinking going on. What is happening is that our system is reading fonts and creating the cache while we're waiting for xclock. This also explains the increase in IO activity and some wa CPU values, as we have to do this operation against the local filesystem.

vmstat, bad machine

Now, it is important to emphasize that this kind of analysis is meaningless if you do not know how your process is supposed to behave and what it's supposed to do. However, if you have an inkling of an idea, then even a boring list of numbers as output by vmstat could provide you with a wealth of information.

Problem resolution

I believe we have sufficient information to solve this. Something is wrong with fonts. The fact applications take so long to start probably indicates the font cache is being rebuilt every time, which in turn indicates the programs started by user cannot read the system font cache. This would also explain why sudo and root are not affected.

Now, I know I'm being smart in retrospect, after this problem was identified and debugged and the tutorial written at leisure with cocky style and smartness. But the thing is, all the pieces are there.

The application is slow. Strace logs indicate open errors. We could now run a full strace session and learn the exact paths of these errors. We would use -e flag against open system call, to reduce verbosity, and we would increase the string length using -s flag so we could see all of the information about failing paths. We would learn about fonts not being readable by the user.

But we do not need to do that. We cross-reference our information with ltrace, which points out the problem to be in the opening of fonts. This library call blocks the creation of the widget, which is why the applications appear to be slow in launching. vmstat runs provide additional information, which help narrow down the issue further.

And if we check the system font cache, located under /var/cache/fontconfig, we learn that fonts are created with 600 permissions, i.e. they are only readable by root. If we change the values to 644, we resolve the problem and all our applications start quickly. We also save disk space, as we do not require each user to create its own copy of fonts in their own home directory.

Font cache

lsof comes to rescue

Now, I will briefly demonstrate another problem. This one is completely different from the previous example. Moreover, in this case strace and ltrace are completely useless, because the process is stuck in a D state. We do not really know what might be wrong, but all of the tools above are of no help. Enter lsof.

OK, let's assume you have some Java process that is stuck and not responding. Now, Java is always ugly and comes with hideous exception traces, but we must debug the issue and the conventional tools do not provide any useful information. In fact, this is what you see with strace:

futex(0x562d0be8, FUTEX_WAIT, 18119, NULL

But if you check the same process with lsof, you get something like this:

java 11511 user 32u ipv4 1920022313 TCP rogerbox:44332->rogerbox:22122 (SYN_SENT)

Ta-dam. We can see our Java is using two IPv4 ports to communicate internally. It has sent a packet from one port to another, initiating a TCP connection. This is indicated by the SYN_SENT state, the first sequence in the three-point TCP handshake. For some reason, the server, listening on port 22122, is not responding.

At this stage, you may want to check why the packet is not being rejected or accepted by the server. Do we have a routing problem? On a local machine, this probably does not apply, but maybe the destination is somewhere else? Do we have a firewall problem? Can you connect to the relevant machine and port using telnet? At this point, you still do not have all the tools to fix the issue, but you are fully aware of what is causing Java to misbehave. For all practical purposes, the problem is solved.

Conclusion

I told you you will like this tutorial, and if you do not, then I have utterly failed you as a human being. But I think this article has it all - rational thinking, methodical approach to problem solving, multi-dimensional use of system debugging tools to create a clear picture of the issue, including comparing system calls and library functions between a good and a bad system, use of vmstat to understand the computational behavior of processes, and more. As a bonus, we also learn about lsof usefulness for network-related stuff.

From now on, your system administration will never be the same. Seemingly complex problems will become simple, if you stick to the discipline of sorting out the basic checklist first, then slowly advancing to intermediate and advance tools and methods. If you ever get stuck, you can always refer to my hacking guides, which detail some 30+ utilities and scripts that should help you in your journey into the heart of the system. Well, that would be all. More goodness coming soon.

Cheers.