Updated: April 15, 2009
Having found the available information on system analysis rather sparse and/or written in such a fashion that is hardly of any use but to the people who wrote the actual documents, I have decided to write a series of articles on Linux system analysis - kernel crash debugging in particular.
Naturally, these articles will be highly specialized and will probably appeal to a very small group of my readers. However, this should not deter you from reading, even if you may not feel an immediate benefit of this topic. So, if you're interested in Kernel crashes, either professionally or out of pure curiosity, then this series of of articles might be of good use for you.
As always, the tutorials will be crisp-clear, step-by-step and accompanied by a plenty of screenshots. There's also a downloadable PDF version of this tutorial available. However, please note the web article will always be the most up to date one.
The series will cover several crash collection tools, including LKCD and Kdump, collection of crashes on a local system and across the network and the analysis and interpretation of crash dumps using specialized utilities, like lcrash, crash and gdb. We will also talk about other useful tools, like strace, ltrace, cscope, objdump, and others.
Our first candidate is LKCD.
LKCD stands for Linux Kernel Crash Dump. This tool allows the Linux system to write the contents of its memory when a crash occurs, so that they can be later analyzed for the root cause of the crash.
Ideally, kernels never crash. In reality, the crashes sometimes occur, for whatever reason. It is in the best interest of people using the plagued machines to be able to recover from the problem as quickly as possible while collecting as much data available. The most relevant piece of information for system administrators is the memory dump, taken at the moment of the kernel crash.
You won't notice LKCD in your daily work. Only when a kernel crash occurs will LKCD kick into action. The kernel crash may result from a kernel panic or an oops or it may be user-triggered. Whatever the case, this is when LKCD begins working, provided it has been configured correctly.
LKCD works in two stages:
This is the stage when the kernel crashes. Or more correctly, a crash is requested, either due to a panic, an oops or a user-triggered dump. When this happens, LKCD kicks into action, provided it has been enabled during the boot sequence.
LKCD copies the contents of the memory to a temporary storage device, called the dump device, which is usually a swap partition, but it may also be a dedicated crash dump collection partition.
After this stage is completed, the system is rebooted.
Once the system boots back online, LKCD is initiated. On different systems, this takes a different startup script. For instance, on a RedHat machine, LKCD is run by the /etc/rc.sysinit script.
Next, LKCD runs two commands. The first command is lkcd config, which we will review more intimately later. This commands prepares the system for the next crash. The second command is lkcd save, which copies the crash dump data from its temporary storage on the dump device to the permanent storage directory, called dump directory.
Along with the dump core, an analysis file and a map file are created and copied; we'll talk about these separately when we review the crash analysis.
A completion of this two-stage cycle signifies a successful LKCD crash dump.
Here's an illustration:
LKCD is a somewhat old utility. It may not work well with newer kernels.
All right, now that we know what we're talking about, let us setup and configure LKCD.
You will have to forgive me, but I will NOT demonstrate the LKCD installation. There are several reasons for this untactical evasion on my behalf. I do not expect you to forgive me, but I do hope you will listen to my points:
The LKCD installation requires kernel compilation. This is a lengthy and complex procedure that takes quite a bit of time. It is impossible to explain how LKCD can be installed without showing the entire kernel compilation in detail. For now, I will have to skip this step, but I promise you a tutorial on kernel compilation.Furthermore, the official LKCD documentation does cover this step. In fact, the supplied IBM tutorial is rather good. However, like most advanced technical papers geared toward highly experienced system administrators, it lacks actual usage examples.
Therefore, I will assume you have a working system compiled with LKCD. So the big question is, what now? How do you use this thing?
This tutorial will try to answer the questions in a linear fashion, explaining how to configure LKCD for local and network dumping of the memory core.
Most home users will probably not be able to meet this demand. On the other hand, when you think about it, the collection and analysis of kernel crashes is something you will rarely do at home. For home users, kernel crashes, if they ever occur within the limited scope of desktop usage, are just an occasional nuisance, the open-source world equivalent of the BSOD.
However, if you're running a business, having your mission-critical systems go down can have a negative business impact. This means that you should be running the "right" kind of operating system in your work environment, configured to suit your needs.
LKCD dumps the system memory to a device. This device can be a local partition or a network server. We will discuss both options.
The host must have the lkcdutils package installed.
The LKCD configuration is located under /etc/sysconfig/dump. Back this up before making any changes! We will have to make several adjustments to this file before we can use LKCD. So let us begin.
To be able to use LKCD when crashes occur, you must activate it.
You should be very careful when configuring this directive. If you choose the wrong device, its contents will
be overwritten when a crash is saved to it, causing data loss.
Therefore, you must make sure that the DUMPDEV is linked to the correct dump device. In most cases, this will be a swap partition, although you can use any block device whose contents you can afford to overwrite. Accidentally, this section partially explains why the somewhat nebulous and historic requirement for a swap partition to be 1.5x the size of RAM.
What you need to do is define a DUMPDEV device and then link it to a physical block device; for example, /dev/sdb1. Let's use the LKCD default, which calls the DUMPDEV directive to be set to /dev/vmdump.
Now, please check that /dev/vmdump points to the right physical device. Example:
/dev/sda5 should be your swap partition or a disposable crash partition. If the symbolic link does not exist, LKCD will create one the first time it is run and will link /dev/vmdump to the first swap partition found in the /etc/fstab configuration file. Therefore, if you do not want to use the first swap partition, you will have to manually create a symbolic link for the device configured under the DUMPDEV directive.
This is where the memory images saved previously to the dump device will be copied and kept for later analysis. You should make sure the directory resides on a partition with enough free space to contain the memory image, especially if you're saving all of it. This means 2GB RAM = 2GB space or more.
In our example, we will use /tmp/dump. The default is set to /var/log/dump.
And a screenshot of the configuration file in action, just to make you feel comfortable:
This directive defines what part of the memory you wish to save. Bear in mind your space restrictions. However, the more you save, the better when it comes to analyzing the crash root cause.
The flags define what type of dump is going to be saved. For now, you need to know that there are two basic dump device types: local and network.
Later, we will also use the network option. For now, we need local.
You can keep the dumps uncompressed or use RLE or GZIP to compress them. It's up to you.
I would call the settings above the "must-have" set. You must make sure these directives are configured properly for the LKCD to function. Pay attention to the devices you intend to use for saving the crash dumps.
There are several other directives listed in the configuration file. These other directives are all set to the the configuration defaults. You can find a brief explanation on each below. If you find the section inadequate, please email me and I'll elaborate.
In general, we're ready to use LKCD. So let's do it.
The first step we need to do is enable the core dump capturing. In other words, we need to sort of source the configuration file so the LKCD utility can use the values set in it. This is done by running the lkcd config command, followed by lkcd query command, which allows you to see the configuration settings.
The output is as follows:
To work properly, the LKCD must run on boot. On RedHat machines, you can use the chkconfig utility to achieve this:
After the reboot, your machine is ready for crashing ... I mean crash dumping. We can begin testing the functionality. However ...
Disk-based dumping may not always succeed in all panic situations. For instance, dumping on hung systems is a best-effort attempt. Furthermore, LKCD does not seem to like the md RAID devices, presenting another problem into the equation. Therefore, to overcome the potentially troublesome situations where you may end up with failed crash collections to local disks, you may want to consider using the network dumping option. Therefore, before we demonstrate the LKCD functionality, we'll study the netdump option first.
Netdump procedure is different from the local dump in having two machines involved in the process. One is the host itself that will suffer kernel crashes and whose memory image we want to collect and analyze. This is the client machine. The only difference from a host configured for local dump is that this machine will use another machine for storage of the crash dump.
The storage machine is the netdump server. Like any server, this host will run a service and listen on a port to incoming network traffic, particular to the LKCD netdump. When crashes are sent, they will be saved to the local block device on the server. Other terms used to describe the relationship between the netdump server and the client is that of source and target, if you will: the client is a source, the machine that generates the information; the server is the target, the destination where the information is sent.
We will begin with the server configuration.
The server must have the following two packages installed: lkcdutils and lkcdutils-netdump-server.
The configuration file is the same one, located under /etc/sysconfig/dump. Again, back this file up before making any changes. Next, we will review the changes you need to make in the file for the netdump to work. Most of the directives will remain unchanged, so we'll take a look only at those specific to netdump procedure, on the server side.
This directive defines what kind of dump is going to be saved to the dump directory. Earlier, we used the local block device flag. Now, we need to change it. The appropriate flag for network dump is 0x40000000.
This is a new directive we have not seen or used before. This directive defines on which port the server should listen for incoming connections from hosts trying to send LKCD dumps. The default port is 6688. When configured, this directive effectively turns a host into a server - provided the relevant service is running, of course.
This directive is extremely important. It defines the ability of the netdump service to write to the partitions / directories on the server. The netdump server run as the netdump user. We need to make sure this user can write to the desired destination (dump) directory. In our case:
You may also want to ls the destination directory and check the owner:group. It should be netdump:dump. Example:
You may also try getting away with manually chowning and chmoding the destination to see what happens.
We need to configure the netdump service to run on startup. Using chkconfig to demonstrate:
Now, we need to start the server and check that it's running properly. This includes both checking the status and the network connections to see that the server is indeed listening on port 6688.
Everything seems to be in order. This concludes the server-side configurations.
Client is the machine (which can also be a server of some kind) that we want to collect kernel crashes for. When kernel crashes for whatever reason on this machine, we want it to send its core to the netdump server. Again, we need to edit the /etc/sysconfig/dump configuration file. Once again, most of the directives are identical to previous configurations.
In fact, by changing just a few directives, a host configured to save local dumps can be converted for netdump.
Earlier, we have configured our clients to dump their core to the /dev/vmdump device. However, network dump requires an active network interface. There are other considerations in place as well, but we will review them later.
The target host is the netdump server, as mentioned before. In our case, it's the server machine we configured above. To configure this directive - and the one after - we need to go back to our server and collect some information, the output from the ifconfig command, listing the IP address and the MAC address. For example:
Therefore, our target host directive is set to:
Alternatively, it is also possible to use hostnames, but this requires the use of hosts file, DNS, NIS or other name resolution mechanisms properly set and working.
If this directive is not set, the LKCD will send a broadcast to the entire neighborhood, possibly inducing a traffic load. In our case, we need to set this directive to the MAC address of our server:
Please note that the netdump functionality is currently limited to the same subnet that the server runs on. In our case, this means /24 subnet. We'll see an example for this shortly.
We need to set this option to what we configured earlier for our server. This means port 6688.
Lastly, we need to configure the port the client will use to send dumps over network. Again, the default is 6688.
And image example:
This concludes the changes to the configuration file.
Perform the same steps we did during the local dump configuration: run the lkcd config and lkcd query commands and check the setup.
The output is as follows:
Once again, the usual procedure:
Start the utility by running the /etc/init.d/lkcd-netdump script.
Watch the console for successful configuration message. Something like this:
This means you have successfully configured the client and can proceed to test the functionality.
To test the functionality, we will force a panic on our kernel. This is something you should be careful about doing, especially on your production systems. Make sure you backup all critical data before experimenting.
To be able to create panic, you will have to enable the System Request (SysRq) functionality on the desired clients, if it has not already been set:
And then force the panic:
Watch the console. The system should reboot after a while, indicating a successful recovery from the panic. Furthermore, you need to check the dump directory on the netdump server for the newly created core, indicating a successful network dump. Indeed, checking the destination directory, we can see the memory core was successfully saved. And now we can proceed to analyze it.
As mentioned before, the netdump functionality seems limited to the same subnet. Trying to send the dump to a machine on a different subnet results in an error (see screenshot below). I have tested this functionality for several different subnets, without success. If anyone has a solution, please email it to me.
Here's a screenshot:
LKCD is a very useful application, although it has its limitations.
On one hand, it provides with the critical ability to perform indepth forensics on crashed systems post-mortem. The netdump functionality is particularly useful in allowing system administrators to save memory images after kernel crashes without relying on the internal hard disk space or the hard disk configuration. This can be particularly useful for machines with very large RAM, when dumping the entire contents of the memory to local partitions might be problematic. Furthermore, the netdump functionality allows LKCD to be used on hosts configured with RAID, since LKCD is unable to work with md partitions, overcoming the problem.
However, the limitation to use within the same network segment severely limits the ability to mass-deploy the netdump in large environments. It would be extremely useful if a workaround or patch were available so that centralized netdump servers can be used without relying on specific network topography.
Lastly, LKCD is a somewhat old utility and might not work well on the modern kernels. In general, it is fairly safe to say it has been replaced by the more flexible Kdump, which we will review in the next article.
This tutorial is a part of my Linux Kernel Crash Book. The book is available for free download, in PDF format. Please check the article linked in the image below for more details.