Updated: September 6, 2017
If you've been reading Dedoimedo for the past elevenish years, you know that I'm not too fond of security software. But I do test security products, just to see how they behave in the wider scope of things. Such a selfless practice also allows me to compare and judge software, especially against the golden benchmark of Windows security programs, the most glorious, useful and no-nonsense EMET. There.
Anyway, a few months ago, I ran a scan with a MalwareBytes Anti-Malware (MBAM) on a Windows 7 machine, and in the middle of the scan, I had a Blue Screen of Death (BSOD). Not good. Upon recovery, I started the mother of all investigations. Follow me.
Before we start figuring out what happened, let me give you some more info. Otherwise, this while exercise is pointless. First, BSOD. In general, Windows is extremely stable in the home environment - servers are a separate topic - and there's really no reason why you should ever experience a critical system crash due to an internal Windows kernel bug.
In approximately 15 years of heavy Windows use, spanning roughly a dozen different systems, I've had BSOD only a few times - caused by a graphics card overheating, caused by buggy graphics drivers, a sad episode we talked about at length, and once when I connected a smartphone via USB cable. That's about it.
Indeed, going back to my previous claim, the only times you should see a kernel crash is due to hardware errors or faulty drivers, really, which aligns perfectly with my usage history. This also goes hand in hand with my rather extensive work experience with Linux, which I've documented in my Linux crash book. In essence, hardware, bad system calls, or pure kernel bugs. That's all there is, and the last are the least likely.
But now, I was facing an issue with ataport.sys driver - which is a Microsoft driver, and which should not be having any bugs. Then, do we have a hardware problem at hand? And if so, what kind? Still, before we rush forward, I'd like to draw your attention to the art of problem solving and how it must be done, slowly, carefully, methodically.
All right. Let's understand what happened. We know how to analyze BSOD, because I've shown you how to do that in my detailed BSOD guide. Using one of the available tools presented in the tutorial, we can analyze the crash dump. I opted for Nirsoft's BlueScreenView program. It turns out the crash was caused by:
And the specific error is KERNEL_DATA_INPAGE_ERROR. Checking some of the other arguments available in the memory core, this points to a hardware error during a kernel operation, which inevitably led to a system crash. So far so simple.
But why did the crash occur during a MBAM scan?
This is an interesting question? Indeed, we now need to figure out whether:
To ascertain whether claim a) or b) is correct, we need to repeat the circumstances of the crash. A few more details here: A Windows 7 machine, no extra security software other than EMET, with a mounted TrueCrypt container. This is important and relevant, because this is another software that uses storage layer drivers and could potentially affect the outcome of this situation.
I ran a second scan - but this time the system did not crash, TrueCrypt container and all. However, I did see an error in the Event Log, an Event ID 11 reading:
The driver detected a controller error on \Device\Ide\IdePort1.
I have never seen this type of errors before, and the timing with the MBAM scan is quite interesting. Now that we have additional info at hand, we must figure out the next step in our investigation.
If you search for this specific error, you will find a whole range of entries, from people asking the same question. In the end, it boils down to three major issues - bad SATA cable, bad disk, or bad chipset controller, which effectively means a new motherboard. Sounds quite sinister. However, the issue manifested only during a MBAM scan. And at this point, we need to understand if we're dealing with an isolated incident or a system problem.
I decide to examine the status of the machine on several levels, including hardware checks. Do note that you can never be 100% sure. Even if a specific check comes out clean, technically, your hardware can die the next day. Therefore, you cannot seek solace and guarantee in these checks. What they do tell you is that for the time being, in the best case, there are no symptoms that indicate a wider problem.
Even though I itched to do something, I decided not to do anything and let the computer run for a whole week, including stressful activities like - lots of games with heavy IO, data backups, system imaging, Windows Updates, and more. In all of these would-be tests, the system behaved predictably, in the same manner it had months and years earlier.
At this point, I decided that there is no imminent health issue with the hardware, at the very least, temporal statistics and superstition notwithstanding. This led me to believe that the problem was caused by MBAM. And it aligns well with my experience that it is never Windows that crashes, it's always something else that does it.
Now, we focus on this security software. Again, if you search online, you will find an endless amount of entries, with these major themes recurring: a) malware, like DUH, including some trojan that prevents security software from running and causes crashes of this type b) claims by users that MBAM 3 causes crashes whereas MBAM 2 did not c) that there might be an incompatibility with TrueCrypt that could lead to a BSOD.
I decided to test these hypotheses and see if they are true. First, regarding c), this may have been an issue once, but it certainly isn't anymore. In my experience, I always had TrueCrypt volumes mounted and they never caused a crash before.
Regarding a), this is nonsense, but I decided to follow a few quick recommendations to see if they check out. Of course they don't. Furthermore, MBAM did complete a subsequent scan without any BSOD but with the relevant controller errors, which narrows down the issue to the software.
Now, we can really search for relevant information. There is only a handful of entries in the official forum, and their story is very similar to mine. Several someones had a crash, and they figured out this was caused by a clash between MBAM and Intel Rapid Storage Technology (RST) drivers. Aha. I installed the latest supported drivers and rebooted. I discovered several things along the way:
And now, I ran a fresh new MBAM scan, and it completed without any issues. Nor were these any errors in the Event Log. So it does turn out that this was a problem caused by third-party software, and not a Microsoft's fault, nor a hardware issue. P.S. Other MBAM folks reported the issue going away AFTER a MBAM version upgrade.
Let me disclaim it, the errors are not caused by Windows or the underlying platforms, as far as you can tell. It's not within a user's reasonable ability to figure this out. Because on my old desktop, as you can read in my happy article on this topic, a disk with a spotless SMART data sheet died without any prior warning, while another, with a supposedly highly critical problem that should have announced imminent disk death, had survived for many months happily since the error showed up. Statistics work well until they turn against you.
But for all practical purposes, right here, right then, it was a 100% MBAM issue, and nothing to do with hardware. A clash between a security scanner with low-level system privileges and storage drivers. The question is why?
Without access to MBAM code sources, you can't really know for sure, but I do have a theory. Malware tries to hide itself on the disk sometimes using all sorts of clever schemes. One way of doing it is to force the system to use bogus I/O drivers, so that they report a clean disk. This means that a malware scanner may not be able to trust the system's functions to return a true value.
I believe that MBAM implements its own disk access functions, including controller commands, as well as its own seek, read, unlink, and other system calls and functions. For some reason, one of these commands clashed with Intel's RST, triggering an I/O error, which the system interpreted as a controller error. Hence, the BSOD or the system event. This is my theory, and it may be wrong, but it makes sense.
Since you can't ever be sure that your systems won't fail, you should actually plan for their failure. In other words, embrace the painful moments, be prepared for when they happen so that you can recover quickly, with minimal losses. For instance, I always have spare hardware lying about, including at least 2-3 fresh hard disks.
I also diligently backup data and perform system images, so if there's a need to replace hardware, I can quickly go back to productivity. And so it happens, several months ago, there was I was, facing this sudden disk failure on a different desktop. No worries. I was back in action within about an hour, without any loss of data or even system configuration.
Finally, I ranted about problem solving and how it should be done methodically. This is the essence of troubleshooting, especially hardware and software issues. If you do it correctly, you're less likely to make a mistake, saving time and money. Going about changing things blindly is never a good idea. I've got a whole book on this, too.
This was a tough, complicated problem, with multiple factors weighing in. At one point, we had possible hardware issues on several fronts, TrueCrypt, malware, and software bugs all vying for attention. Resolving this isn't easy. And yet, with slow, careful work, we were able to fully understand the problem, isolate the root cause, validate claims, and test potential resolutions. All without any drastic system changes.
I believe this is a valuable lesson. It happened to me, but I want you to take away the findings. Whenever something bad occurs, the Internet will blast you with garbage. Hardware, malware, take your pick. Everyone will have your problem, and yet, it will be ever so slightly different and not quite applicable. You can go mad this way. There is a strong temptation to try what all these people did. But the best way forward is to very carefully examine the symptoms, thoroughly analyze them and apply fixes, reversible, fully quantifiable fixes, starting with the simplest, least intrusive ones.
Which is how we went about this issue. Do we know what happened? Can we reproduce it? Does the newfound information make sense? Can we validate the possible causes? Can we disqualify some of them? With the ones we're left with, a fresh round of searches. New claims, new hypotheses, new checks. Clear, reproducible results. Resolution. Knowledge. Fun. I hope you enjoyed this. Remember, don't blame Windows, and be skeptical about what the Internet has to say about your hardware. It's all doom and gloom. But it does not have to be so. Happy computing.