Linux, AMD GPU, black screen on boot

Updated: December 23, 2021

Recently, I encountered a brand new hardware-related problem with a Linux distro. In Linux Mint 20.2, while booting under battery power, i.e., no wall socket juice, the boot process would stop at some point, with an unresponsive black screen shown. The only resolution is to reboot, or power the host on with the charger plugged in.

What is interesting is that this happened on a relatively new IdeaPad 3 laptop with AMD Vega 8 graphics. And it annoyed me a lot, because there always seems to be some problem with hardware. Wireless on this machine, graphics on this one, I/O control here, camera there, and so on. Always problems, always excuses. Well, let's see what we can do here, and how to fix this.

Problem in more detail

I encountered the issue with Linux Mint. But I suspect the problem affects a much wider base. Indeed, if you search for "AMD boot black screen", you will get tons of results for forum threads, be they Ubuntu, Mint, Arch, Manjaro, or Gentoo, dating back to 2019, with tons of recommendations and very few actual solutions. Why? Because fixing issues with drivers takes expertise, and if your kernel and/or drivers don't offer the right kind of functionality, there isn't much you can do. This also brings into focus the question of open-source versus closed-source drivers, as if that makes any difference. It doesn't, because expertise is expertise.

Mini-rant aside, the IdeaPad 3 machine has a triple-boot configuration, including also MX-21 KDE and Windows. Since these other two systems work without any problem, I could rule out a hardware issue, and focus on what's specifically wrong (and different) with the Mint boot sequence.

To that end, I took the dmesg, kern.log, X.org.log and system log files from Mint and MX-21 and compared them, side by side, doing actual diffs. The only real diff is in the kernel log, where Mint stops booting while the other distro merrily continues. The error reads as follows:

...
kernel: [] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 790 thread Xorg:cs0 pid 824
kernel: [] amdgpu 0000:03:00.0: GPU reset begin!
kernel: [] amdgpu 0000:03:00.0: GPU reset succeeded, trying to resume
kernel: [] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
kernel: [] [drm] PSP is resuming...
kernel: [] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
kernel: [] [drm] psp command failed and response status is (0x7)
kernel: [] [drm] VCN decode and encode initialized successfully (under SPG Mode).
kernel: [] amdgpu 0000:03:00.0: ring gfx uses VM inv eng 0 on hub 0
...

Eventually, the GPU reset succeeds, but it doesn't help. The screen remains black. Now, let me show you how you can resolve or work around the problem. We have a few options at our disposal.

Solutions

OK, so here's what you can do:

Install a new kernel (if available)

Upgrade the system kernel and/or firmware. In Linux Mint, which normally pins kernels, you can manually download a new one through the System Update utility. It will warn you, and then you can select the desired version and configure it. For Mint 20.2 Uma, you can go up from kernel 5.4 to kernel 5.13.

Warning

Get new kernel

When I installed the new kernel and looked at the configuration output, I also noticed a set of warning messages during the generation of the initramfs file:

...
W: Possible missing firmware /lib/firmware/amdgpu/vangogh_vcn.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/navy_flounder_vcn.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/navi12_vcn.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/aldebaran_vcn.bin for module amdgpu
...

You can ignore these IF your AMD GPU architecture does not show in this list. In my case, Vega 8 was correctly supported (i.e. not in this list). How does one know? Well, you can run the command lspci -v, which will list all your different hardware components. You need the entry that matches the right kernel driver in use, in this case amdpu.

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c2) (prog-if 00 [VGA controller])
Subsystem: Lenovo Picasso
...

This way, I discovered that my Vega 8 graphics actually corresponds to an architecture model called Picasso. I guess that explains the names used, in general. This output is just untidy noise telling you about new kernels not having support for certain GPU models. Again, this opens a wider question of Linux backward compatibility and such, but we're not going to discuss that now. Reboot, and this should, hopefully, do the job.

Start the host with power plugged in

This is annoying, but it is a simple workaround for if you're not comfortable with making any system changes, or if you do not want to do anything special until your Linux distro fixes the problem. However, the issue does highlight one (small) downside of Mint's kernel policy, and a generic, wider phenomenon of hardware support in Linux. Because, if your distro doesn't have an updated kernel available, you can't do much.

The reason this "trick" works is because a system under full power (as opposed to battery power) uses different power profiles. If you're really savvy, you can play with your BIOS power performance options, if they are available, or tweak the GPU power settings, but this is only meant as a temporary stopgap measure.

Change the boot parameters

Continuing what I mentioned just earlier, you can start the system by passing on a range of different parameters to the AMD GPU (amdgpu) kernel module. You can check what sort of parameters and options the module supports by running the modinfo command:

modinfo amdgpu

filename:       /lib/modules/5.13.0-22-generic/kernel/drivers/gpu/drm/amd/amdgpu
/amdgpu.ko
license:        GPL and additional rights
description:    AMD GPU
author:         AMD linux driver team
...
parm:           audio:Audio enable (-1 = auto, 0 = disable, 1 = enable) (int)
parm:           disp_priority:Display Priority (0 = auto, 1 = normal, 2 = high) (int)
parm:           hw_i2c:hw i2c engine enable (0 = disable) (int)
parm:           pcie_gen2:PCIE Gen2 mode (-1 = auto, 0 = disable, 1 = enable) (int)
parm:           msi:MSI support (1 = enable, 0 = disable, -1 = auto) (int)
...

For instance, some of the available options you can try - but do NOT unless you understand what you're doing!

amdgpu.noretry=0
amdgpu.dc=1

These need to be appended to the kernel boot line in the boot menu. With most recent Linux distributions that use the GRUB2 bootloader, then the sequence of commands is as follows:

sudo update-grub

Or, on systems that do not use the wrapper script above:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Reboot your system and see if your issue is resolved. You can check how the system booted by examining the kernel commandline - or rather, if it boots under battery just fine, ha ha!

cat /proc/cmdline

Now, the big question is, which amdgpu options should you add?

There is no simple answer to this, I'm afraid. In most cases, short of an actual kernel/firmware fix, you will be guessing, based on the error message you see in the kernel log, and hope that the specific option can do the trick. This is because error messages are often generic, and without expertise in the graphics stack and the particular driver, you can't really nail it down with a handful of kernel module options.

Making these edits can potentially lead to additional problems and complications, which is why you shouldn't blindly apply them, or just copy any which suggestion from a forum. My testing shows that no option really makes any big difference. The two listed above are just for reference. Still, if kernel updates don't work, and you must be able to use the laptop under battery power, then I guess you have nothing to lose, and you might as well experiment and see what gives.

Conclusion

There we go. Hopefully, your AMD-graphics laptop running Linux is now behaving correctly, and you're no longer seeing the black screen issue on boot while using battery power (or any other scenario). My tutorial outlines three major approaches: kernel upgrade, power usage workaround, and some hackery with kernel module parameters, which are risky and most likely won't give you the best results, but hey.

I don't like this kind of problems. They always remind me of how fragile Linux is. Yes, it runs on tons of hardware, and that's commendable, but it's always 95% or 91%, never 100% through and through. And that's annoying. Well, anyway, that's it. Now, off I go to my next Tuxy hurdle. See you around.

Cheers.

You may also like: