Updated: January 9, 2020
Remember, I told you about a messed up laptop? Well, let's elaborate, shall we. I was doing some testing with imaging & recovery software, and once I was done, I wanted to see how well the process had gone. Not well, it turned out. GRUB was there, but no entry in the menu worked initially. Once I had that promptly fixed, I saw that Windows 10 wouldn't boot, and wouldn't auto-repair, and half the distros on the system (out of the total eight) in the multi-boot setup wouldn't start either, going into emergency mode. We're talking the full share of distros, take your pick.
Now, the GRUB recovery was quite tricky - none of the methods I could think of worked, and I ended up installing a test distro just to get the bootloader configured properly. Then, I started one of the distros that DID work, and noticed there was no data loss. Everything was there, all the partitions were sane and whole, and the files were in their right place, Linux and Windows included. In this article, I'd like to show you how I went about this problem, and how I fixed it - and in the sequel, we shall do the same for Windows 10. A useful exercise. Follow me.
Problem in more detail
I did the following exercise with neon as a starter, but it applies to pretty much every distro since roughly 2011-2012. So what happens is, KDE neon starts booting and then drops into the emergency shell. I am told by the system I should run journalctl -xb to see the startup log and figure out what was wrong. Now, this ain't the first time we've encountered this. I handled a similar issue with Fedora not that long ago.
But here, the issue was slightly different. Yes, the boot log showed problems with /boot/efi, but why this happened stemmed from another issue. If you're wondering how I obtained and copied the logs in the emergency session (if you need something like that), you can copy the contents to a file and then retrieve it through a live session (with redirect > or using the tee command), or once you've fixed the problem, you can dump the last minus one log with journalctl -x --boot=-1. That's modern technology for you.
Dec 06 12:57:09 tester systemd: Dependency failed for File System Check on
-- Subject: Unit systemd-fsck@dev-disk-by\x2duuid-641C\x2d39CE.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- Unit systemd-fsck@dev-disk-by\x2duuid-641C\x2d39CE.service has failed.
-- The result is RESULT.
Dec 06 12:57:09 tester systemd: Dependency failed for /boot/efi.
-- Subject: Unit boot-efi.mount has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- Unit boot-efi.mount has failed.
I was wondering what could have gone wrong. The whole dev-disk-by was a big hint. Remember when I showed you how to fix slow boot issues when the swap partition UUID got changed? This seemed quite like that, except /boot/efi is critical in the system startup process.
I opened /etc/fstab, and indeed, the listed UUID for the swap partition (dev/sda10) was NOT correct. The file even has a comment that says the partition entry used to be the device number, and then it was changed to the supposedly "modern" and "helpful" UUID nonsense that only causes problems.
# swap was on /dev/sda10 during installation
UUID=8140c8d0-1e33-42c1-8c3c-828449adfe08 none swap swap 0 0
I changed the UUID stuff to /dev/sda10, rebooted, and everything was peachy!
Commands you will need to do this
Now, let's slow down a second. So this is how you can verify whether your system is using the correct device numbers or identifiers. First, you can list partitions with fdisk -l. This command will give you an overview of all the different partitions and their filesystems. This way you can get a basic understanding of your system. Notably, you need the root partition (/), you may have a separate /boot partition, which is often the case on UEFI systems, you might be using a separate /home, and you may also have an optional and separate swap partition. These will be marked with device numbers, like /dev/sda2, /dev/sda8, etc.
The problem starts in how more recent releases of Linux distributions map devices into /etc/fstab, a file that is parsed on system startup for information on which devices should be auto-mounted. In the past, the devices were referenced like fdisk does, dev/XdYZ, where X would signify the type of disk (usually letters h or s), Y would be the device letter (as ordered by your system BIOS, e.g. a, b or similar), and Z would be the partition number. For instance, /dev/sdb3 means third partition on second disk with the SATA/SCSI/PCI-E connection type.
Indeed, the "modern" distributions use UUID - this is a human-meaningless string of numbers and letters, something like a45f-cc9a, meant to uniquely map partitions, so if the disk order somehow gets changed, the system can still boot. Perhaps makes sense in the enterprise, but in the home environment, this is absolute and complete nonsense - like pretty much EVERY "modern" solution, like systemd, new network interface naming convention, new network management tools, and so forth. More on philosophy later.
Now, if you have a wrong UUID listed in /etc/fstab, the system cannot mount these partitions - the mechanism is super flaky, because it has no ability to check, search or heal a broken configuration - you may end up with an unbootable system. This situation is also impossible to troubleshoot quickly, because the UUID strings are totally useless to humans.
You can verify the list of device UUID with the blkid command, for example:
/dev/mapper/sda3_crypt: UUID="TpZGKB-31Lq-U1Ap-BZJCcX" TYPE="LVM2_member"
/dev/mapper/kubuntu--vg-root: UUID="dcae17fd-7cfe-c0b721e" TYPE="ext4"
You need to visually compare and then figure out what's missing. And then manually fix /etc/fstab.
Modern, you keep using that word ...
The simple reality is as follows: I've been working with Linux for more than 15 years. At some point in life, I was handling a beautiful HPC setup with roughly 50,000 physical servers distributed across some 40 data centers. I had my share of business and home systems, with old and new technologies.
I have encountered broken systems mostly since we moved from the old BIOS + MBR + GRUB + init to the new UEFI + GPT + GRUB2 + systemd. There's nothing magical, resilient or improved in these solutions. They do solve some real technical limitations in the industry, true. But they also introduce a far less robust setup that is infinitely harder to debug and troubleshoot by humans. For example, I have NEVER EVER seen a busted system because the disk order got messed up. But I have seen MANY examples of systems busted by the use of UUID instead of the simple device number notation.
Fixing GRUB was a trivial thing of copying 512 bytes of data here or there in the worst case, and editing a text file. Now, it's almost scientific religion accessing the new interfaces, working with EFI and such. Fixing broken Linux boot up wasn't an issue, and now, I'm having to parse binary logs into text, and then figure out why my systems may be wonky, only to discover it's all because I'm forced to use "solutions" to problems that do not exist. For instance, the UUID thing. The ip versus ifconfig thing. Why is enp0s0, or whatever the new network card identifier may be, is better or smarter or more intuitive than eth0? Used to be logical, ethernet = eth.
What's the scenario where home users have to frequently add or remove hard disks in and out of their boxes or play with BIOS settings? It's not. So why do home systems come with solutions that are inadequate? Because Linux is inherently not meant to be a home solution, and I'm feeling like an idiot.
And this is only the beginning. As time goes by, we will have more abstraction, abstraction, abstraction, automation, machine-optimized crap, and you will have to rely and depend on the mercy of whatever entity is pumping out their latest Whatever as a Service. There will come a point when this whole nonsense explodes, because it cannot be debugged. Or it will be left to machines to manage on their own, using their ugly protocols that humans were never meant to see or experience in the first place. Talking about poor design.
First, I hope you find this article useful. In most cases where a Linux system isn't booting, you are probably facing issues with, well, something to do with /boot or /boot/efi. If that happens, you should be somewhat confident reading the logs, and then try to figure out if you have missing or broken components, like I've shown you with the initrd and initramfs bugs (linked above), or the system configuration is wrong, and it is trying to reference non-existing resources. In our case, like the slow boot issue, the use of bogus UUID for partitions was the culprit.
This should be solved on a system level. But as I've noted many times before, validation just isn't a thing in Linux. Developers do their thing and move on. No one bothers to think about the bigger picture, about philosophy. But that's software thinking: function -> output. No one cares what lives outside the function and encapsulates the functionality.
A robust system would actually examine the existing devices and filesystems and try to figure out if there's a problem in the configuration. A robust system would have a backup of the critical files and try to reference them. A robust system would attempt something in several different ways and correlate things before failing in a meaningful way. None of that exists today, not in Linux or any other system, because it's cheaper to maintain a horde of poor technicians than invent smart, self-healing mechanisms. And if that happens to you in your home, well you're just a collateral. So when people talk about freedom and open-source, they are talking about the wrong thing. What's open-source good for if it's used to develop obfuscated solutions? I'm sad. Off I got to cry.