Updated: August 14, 2012
Here's a tricky tutorial. I've contemplated for a long while whether to write it at all, the chief reason being that troubleshooting hardware-related issues is probably the most difficult part of the domestic computer experience. Even highly experienced users will sometimes face a bumpy ride when trying to resolve a delicate, erratic, weird, not fully diagnosed mismatch between hardware and software. Hit the Web, and you'll find over 9,000 unrelated cases, dying forever alone in empty forum threads.
However, me being me, a highly pretentious geek self-deluded in my own importance and ability to write most cunningly, I will try to teach a handful of tips and methods that can help you understand, pinpoint and hopefully resolve hardware problems, on your Linux box no less. This tutorial is not guaranteed to be a 100% success, and the material will most likely be somewhat hard to follow, but you just might learn a few useful things. Follow me.
Before you even begin diagnosing hardware issues, it is important to align on expectations, as well as be fully aware of different types of hardware problems that you may encounter. Finally, you need to understand how hardware problems manifest.
A dead piece of electronic equipment is usually the simplest case. However, not always. If your power supply is gone, the machine will not turn on. Not a brainer. But then, you may have a bad graphics card, a bad audio card, or maybe a faulty memory stick. In this case, the system might get past the BIOS self-test and boot into the operating system, with this or that degree of success. In some cases, you may see the problem manifest, like the screen resolution reverting to a low setting because the graphics driver is no longer being used, or perhaps no sound when trying to listen to music. In some cases, the operating system may throw visible error messages.
This is probably the most difficult, most elusive type of problem. If you have a hardware part that is throwing errors only once in a while, you may not end up having sufficient data to correlate between these separate events and draw the right conclusion. Moreover, you may see several, seemingly unrelated symptoms affect your machine, even though they could be stemming from one source.
Some kind of errors may not cause a functionality problem, but they may cause data corruption, performance degradation or other phenomena that you might blame on your operating system or software. For example, what do you do if there's a handful of bad cells in your memory stick, which might trigger segfaults in your browser when those cells are accessed and used? What if you experience a kernel crash that seems to blame some software, but it is in fact caused by a memory glitch or a bus error on the mobo? A good example is the Wireless/laptop case issue I faced on an older T61 machine some three years back. Another classic one would be my monster gaming desktop case ground wiring issues.
Driver problems will usually appear similar to hardware malfunctions, although you may get a more consistent experience. Normally, you will hit the same problem in the software every time. In some cases, you will have bad drivers that won't communicate with the hardware at all, in others, you will be running a buggy driver that will cause your machine to misbehave in an unpredicted fashion. Sometimes, the problem may transform into a total loss of functionality, like kernel crashes, black screens, white screens, all kinds of weird effects.
You must also realize that some systems will have locked-down BIOS preventing you from making full use of hardware components or features, while others may have these components disabled on purpose. We will discuss BIOS in a bit more detail later.
Finally, you may be using different hardware components specifically not designed to work together. Some vendors produce hardware with only certain operating systems in mind, thus you will never have official drivers for your software. Other hardware yet might be usable by trying substitute generic drivers, e.g. Lexmark printers and PCL drivers.
There are literally hundreds of ways you can approach any given hardware problem and try to resolve them. It is extremely easy to get lost or overwhelmed with Internet examples, which almost always are one-man's woes. My sincere advice is that you approach every single hardware problem in a very methodical way in order to minimize false positive and distractions.
All right, let's assume you have a hardware problem. In reality, you may or may not have one, but we will work under the premise that you are convinced your hardware is buggy for some reason.
The first, most critical step would be to backup your data. You do not want to lose any precious personal stuff if your machine decides to go haywire any moment, especially if you plan on tinkering.
The second step is to fully update your machine. In Linux, this means downloading all available system updates, which may include important firmware and driver fixes for your particular hardware. You may get a new kernel with better support for your hardware, too. For example, SSD TRIM commands are only available since kernel 2.6.33. Likewise, Sandy Bridge support is only available in more modern releases of various distros. Nvidia driver 290.XX might contain some extra features or critical fixes that were not available in an earlier version beforehand.
If you have a dead or dying or hiccuping piece of metal in your box, you might want to see whether there's some problem during the boot sequence. To that end, you should consult your distro's boot log. In most cases, boot logs are kept under /var/log and named boot.log or boot.msg or similar. Here's an example:
Don't randomly search for errors!
You can see some red failed messages and yellow warnings there. Ignore them for now. They may or may not be related to your problem, but the fact you see some should not detract you from what you're trying to do. You want to resolve a specific problem related to your hardware. For the time being, you should only look for errors that clearly mention your hardware in some way. Otherwise, skip and leave for later. In fact, in some cases, errors are perfectly normal and even expected.
Another extremely valuable log is the kernel buffer log. It is usually invoked using the dmesg command. Sometimes the log is kept in the same-named file under /var/log. This command will display all kernel messages in the buffer, some of which may also be written to the standard system log - /var/log/messages by the syslog facility.
Among other things, dmesg will display a wealth of hardware initialization messages, so you might want to look there for possible problems or conflicts. Again, there might be a ton of weird stuff written, so you should not read too deeply for now. However, what you do want to pay attention are the names of modules and the hardware addresses, strings of numbers and letters delimited by the colon mark.
In the example below, we can see the initialization of the Nvidia module, which also happens to taint the kernel as the module is not GPL-ed, and we have the initialization of the sound card.
We have seen lsmod used on numerous occasions before. This command reports which modules are in loaded into the kernel and their use count. Before you dig deeper, you should check that you have the rudimentary driver support for your device. For example, if you're wondering why your Nvidia card might not be working, please check that the driver is loaded. Or that there are no conflicts. Now, as to why it may not be loaded, you might have to continue your education, but at least you will know at what stage the problem occurs.
Now we come to the really juicy part. The combination of boot messages and dmesg might give you some basic indication what might be wrong. But if they don't, you will want to look directly into the kernel structure and examine the loaded drivers.
In Linux, due to its architectural monolithic nature, components can be compiled directly into the kernel or made available as dynamically loadable modules. Modules that communicate with hardware are called drivers. In both cases, they eventually reside inside the kernel, which, for all practical purposes, is an abstract piece of software that the user cannot directly control.
However, partial indirect control is made possible by exposing some parts of kernel structures using pseudo-filesystems /proc and /sys. You can manipulate seemingly ordinary files to issue on-the-fly changes to kernel structures, causing a change in the behavior. The latter filesystem allows you to manipulate hardware as well as kernel modules.
Now, if you recall the numbers from before, we can now put them to some good use. We can browse through the directory tree under /sys/devices and examine the various hardware components connected to the listed interfaces.
Some of the modules will have writable parameters that allow root to make changes to how the hardware behaves. For example, USB5 device connected to the PCI slot on my LG laptop has a writable authorized parameters. If you issue different values into this file, you will enable/disable access to the particular USB port.
Now, in practice, being able to navigate /sys takes a lot of experience and knowledge, more so when you are trying to debug hardware problems. You may also not be familiar with different parameters and values. However, you should be aware that /sys can provide a lot of useful information.
There's a simpler way of scanning through your connected hardware components and their corresponding drivers. The system command lspci will list all devices connected to the PCI bus, although you will see all devices, including legacy hardware.
The question is, where does lspci get all its information? Well, if you really want to know, we can run the tool with strace and find out. Not surprisingly, lspci scans the /sys tree for all connected devices, including the connection port, vendor ID, device type and class, etc.
Finally, lspci consults /usr/share/hwdata/pci.ids file containing a static list of hardware vendors, which translates vendor ID numbers into names, allowing you to see the human-readable entries in the lspci output.
Some distributions also have graphic frontends for the lspci command, allowing you to see your system information in a manner similar to Windows. But it is easier to consult the command line print, especially if you're debugging.
Last but not the least, we can also consult the system log. Again, you should look for errors that are relevant to your particular issue. To demonstrate, let's insert a thumb drive and see what the system has to tell us. We can check the message in real-time by using the tail command.
Pay attention to the enumeration. In this particular case, we can see that the system recognizes the drive properly. However, this does not mean that we can use it. You may discover the drive is not auto-mounted, that you do not have permissions to use and many other problems. Or the thumb drive might be faulty. But you will know that the device is correctly identified by the kernel, so you can focus your efforts elsewhere.
Let's see a few cases where this knowledge can be put to some good use. A suspected case of a hardware/driver problem I encountered recently was with the Nvidia card in Ubuntu, where the system complained the driver was activated but not in use.
On a few occasions, I was unable to use my hardware, Wireless drivers to be more exact, because the relevant firmware was not included in the kernel due to licensing and ideological conflicts. This happened both with Debian and Trisquel distros.
If and only if you've exhausted all of the options above should you go about the Internet, prowling, searching for answers. However, while there's absolutely no guarantee that any of the tools mentioned earlier will give you any indication what the problem might be, there's a decent chance that they might.
Now, you should check online resources and compare to your problem. In most cases, you will be able to dismiss the irrelevant topics the moment you glance upon them. If someone faces the same error but different hardware, walk away. If someone uses a different flavor of Linux, look elsewhere. Do not focus on error messages. Focus on your hardware. Sometimes, multiple issues may narrow down to the same kernel errors, because after all, the number of messages the kernel can print is finite. You are bound to see similar symptoms caused by many different problems.
Some useful resources where you might find answers to your woes: Phoronix, where they be testing and benchmarking, but there's a forum, too; Linux drivers is a useful compilation portal; and there's the linuxquestions.org site, with some very decent stuff available.
This is entirely up to you. But if all else fails, you may want to flash your BIOS. In general, this procedure should be safe, but if it goes wrong, your box will turn into a brick. This is the main reason why I left the BIOS upgrade as the last resort. Naturally, you should make sure you've fully exhausted all other options, like testing hardware compatibility with other distributions or operating systems.
BIOS changes may also include enabling/disabling features, like FireWire, Bluetooth, RAID controllers, and other parts, which will directly impact how the operating system behaves and what hardware it can see or use.
The following articles are also quite important and should teach you much more about system management and administration. In particular, how to work with sources and compile kernel modules, how to change system behavior, how to trace problems, and more.
Linux system debugging super tutorial (see all my super-duper admin guides)
Linux hacking tutorial part 4 (another three parts waiting for you out there)
Linux hardware troubleshooting requires a fair amount of knowledge and familiarity with the command line and system messages. But then, the relatively high level comes with a comfortable degree of flexibility and useful information that may not be available on proprietary operating systems. Your ability to tamper into the kernel space can help with the diagnosis and resolution of hardware-related problems.
Today, you've sort of learned how to use a wide range of tools and utilities, and how to work methodically. You understand different types of problems, you can consult system logs, you know how to run lspci and lsmod. On top of that, we also dabbled a little into BIOS, drivers and system debugging. Hopefully, your next hardware woe will be much easier to fix. However, never forget that despite your best efforts, you may never solve the problem. Sometimes, it could just be bad hardware, as simple as that. Take it easy and have fun.