Updated: April 2, 2018
My hardware arsenal, used for Linux distro testing, is quite varied. It includes some new machines as well as several relics. The oldest among them happens to be a 2009 LG RD510 dual-core box, the first one I ever bought for the sake of proper Linux testing. If you're in for a bit of nostalgia trip, check the original report. Since then, it's had its uses, and recently, I brought it back into the game as the test mule.
The laptop comes with an old Nvidia 9600M GS card, and what's special about it is that no Linux distro installed on it has ever really been able to resume from suspend successfully. I've decided to tackle issue with my recent set of distro checks, just to figure out why and what the underlying issue might be. For those coming through search engines - this article is the fix for suspend & resume problems in Linux on machines with Nvidia cards, caused by ATA link resets for devices running in AHCI mode. Now, shall we?
Press the power button, let the laptop slumber itself. Then, when it wakes, observe the errors that come up. In my case, a dark screen for the graphical console, and lots of scrolling text for the other virtual consoles, reading words like exception emask, ATA, read-only, filesystem error, inode, etc. Lots of textual garbage. Specifically:
exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
It takes a few quick seconds to search for some of the choicer keywords and come across 2007-2008 threads about a similar problem "resolved" with "won't fix" by letting it just go away. Like when you call tech support, they drag you over the coals, and when you get tired of their nonsense, they close the ticket saying "user no longer has the problem" or "user did not follow up" or similar crap.
On Nvidia forums and through various bugzillas, you will learn that in some cases, Nvidia and eSATA/SATA controllers may not support hotplug and such, but the Linux kernel assumes this, and it may reset your ATA devices when the computer wakes, resulting in many a problem.
Several proposed workarounds you will find - which most likely will not solve your issue include adding ACPI and queing changes as boot parameters. Specifically, libata.noncq and libata.noacpi=1. If you're lucky, these could help, but I do not think you will see a proper solution this way. You can verify you have booted with the correct parameters by checking the kernel command line:
BOOT_IMAGE=/boot/vmlinuz-4.13.0-16-generic root=UUID=46f914cb-1a38-4db6-8831-6705ba3eca1f ro quiet splash libata.noncq libata.noacpi=1
All right, we know what the problem is, let's work around it.
Go into BIOS and change the hard disk controller interface from AHCI to IDE. This is not ideal, as AHCI offers various technological and practical performance advantages with modern hardware, but for the sake of this exercise, do it, boot your Linux and then check if the problem occurs again after you suspend and resume.
In my case, this little tricked fixed a decade-old problem, and I can think of half a dozen different implementation types that Linux distros - and the kernel - could adopt to more gracefully handle systems with similar hardware and/or conditions without triggering resets and making the desktop unusable.
Caution: Changing to the IDE mode may not work well on dual-boot systems with Windows, and you will have to change registry values to allow Windows to be able to boot with the correct drivers - otherwise, you may end up with BSOD. Contrary to legends, you will not need to reinstall Windows if you change the controller type. But still, be aware and extra careful when you do this. Of course, you can always go back to AHCI without any data loss.
I tested this with Ubuntu 17.10 and Kubuntu 17.10, and indeed, this fixed the wake issue. Real-life performance differences aren't huge unless you start doing heavy stuff, then again, this is an old laptop. Lastly, it's not really about AHCI vs IDE, it's about showing that problems of this type can be resolved if there's so much as ten minutes of care. I guess the overall development community couldn't summon that much good will in the past decade. It's not as if the issues are unknown.
A good side effect is somewhat lower CPU noise in day to day operations. Memory usage remains identical, but the processor eats fewer cycles doing nothing. Aardvark Gnome crunched something like 10% on idle, and with this change, it's down to about 6-7%. Perhaps this is incidental, and I haven't done an in-depth investigation into this, however, I'm reporting on all my observations from the testing. On one hand, less CPU noise, on the other, slightly less speed in the IDE mode. You win some, you lose some.
If you're interesting, some additional use cases and possible suggestions:
When SATA fails under heavy load (kind of a reverse issue from what we have)
Well, here we are. Validation is the key to everything. Input and output. I don't think there's any serious mechanism in Linux to actually verify that devices have gracefully and correctly resumed operation after being suspended. Such checks, including non-volatile traces, would offer more robustness, allowing distro teams to tackle hardware issues and produce better systems, with higher quality and user satisfaction. It's definitely preferable to the blame evasion and shifting that exists today.
I find it hard to accept the "use friendly hardware" or "blame X vendor" as a lazy excuse for developing state-of-the-art drivers and business logic to allow seamless operations. It does not matter who's at fault if you have the technical knowhow and capacity to identify and maybe stop the problems from manifesting. In my case, the sad reality is the ancient problem will remain around until the RD510 machine is no more, and then it will no longer be a problem. But that mentality won't make the Linux desktop into a perfect product. Lastly, for those of you who have come here for technical guidance and not philosophy, if you have an Nvidia card and resumes are botched, try IDE vs AHCI, just to see what gives. After that, there might be some tweaks and workaround to help mitigate the issue, but at least you will know where you stand. Hopefully, this was of use. See you around.