As you may have already guessed, I was currently building a new Kernel for my PC (a md8818), and while doing so I encountered a strange problem, which I didn’t quite understand.
Of course, first thing to do is to ask Google, and I found out that I wasn’t the only one having that trouble, but it also seems like no one really had a solution to it (except for suppressing the symptoms). It took me 11 Kernel builds, until I finally eliminated the problem, so to help you not go through all this mess yourself, I’d like to present the problem plus solution.
The initial problems
The problems I encountered are as follows:
- When choosing and starting the Kernel with grub, the first thing I get is a message, that the chosen vga mode is not supported. I am given the choice between 80×25, and variations of that. This appeared regardless of using nvidiafb or uvesafb, but nvidiafb managed to switch the mode later into a higher screen resolution.
- I got a lot of funny error messages, right at the boot process, and even after logging in, occasionally my VTs where flooded with that message (they always occurred in packs of some lines):
[ 11.560908] do_IRQ: 0.107 No irq handler for vector (irq -1) [ 11.560935] do_IRQ: 0.107 No irq handler for vector (irq -1) [ 11.560964] do_IRQ: 0.107 No irq handler for vector (irq -1)
- When looking up dmesg for that error I saw that at the booting process there was another error message flooding pages of screen width with following message:
[ 0.101560] ACPI: EC: Look up EC in DSDT [ 0.102151] ACPI Exception: AE_ERROR, Returned by Handler for [PCI_Config] 20090521 evregion-424 [ 0.102317] ACPI Error (psparse-0537): Method parse/execution failed [_SB_.PCI0.SATA._STA] (Node f70116f0), AE_ERROR [ 0.102557] ACPI Error (uteval-0256): Method execution failed [_SB_.PCI0.SATA._STA] (Node f70116f0), AE_ERROR
- From time to time my terminal would simply freeze. I.e. I wasn’t able to use Keyboard or Mouse (gpm) anymore, the screen just stayed as it was. Interestingly this was just limited to the Screen (I guess even keyboard and mouse where in fact still reacting). If you’d locked on via ssh from another machine you’d encounter that processes where running, the system was reacting, and everything seems to be working fine (even those do_IRQ messages didn’t show up on ssh).
First of all let me start with my hardware, because this is what the problem depends on. The problem originated in having made software choices (while making the configurations for the Kernel) that didn’t meet the hardware needs. So if you encounter similar problems while having similar hardware, you may have run into the same problem as I did.
So let us start with lspci (I stripped off all uninteresting parts of it):
telperion pygospa # lspci 00:00.0 Host bridge: VIA Technologies, Inc. P4M890 Host Bridge 02:00.0 VGA compatible controller: nVidia Corporation GeForce 7650 GS (rev a1) 04:00.0 PCI bridge: VIA Technologies, Inc. VT8251 PCIE Root Port
So we are dealing with a VIA Technologies chipset, model P4M890, there’s a PCIe port, again connected with a VIA Technologies chipset, model VT8251. And there seems to be a graphic card connected to it, a nVidia Corporation GeForce 7650 GS.
Now uname will tell us that it is a Intel Core 2 Duo, but I guess that is uninteresting.
telperion pygospa # uname -a Linux telperion 22.214.171.124-20091011-16 #14 SMP Mon Oct 26 02:36:16 CET 2009 i686 Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz GenuineIntel GNU/Linux
More interesting is rather the current Kernel version (as this problem doesn’t seem to occur with older Kernel versions), which is the 126.96.36.199, directly from The Linux Kernel Archives without any additional patching.
While searching the web for a possible solution to this problem, the only things I could find, where references to bug reports, as well as some pseudo cures, that suppressed e.g. the do_IRQ message from appearing on the screen.
So solutions such as changing boot parameters to either pci=nomsi, nosmp, pci=roteirq, noacpi or any given combination of these parameters, did in fact have an effect to some degree. Still they did not change the problem. Some suppressed the message, but where still showing the ACPI error message in dmesg, or freeze the machine. One of them just changed the error message into:
[ 10.893459] do_IRQ: 0.75 No irq handler for vector (irq -1) [ 10.893486] do_IRQ: 0.75 No irq handler for vector (irq -1) [ 10.893515] do_IRQ: 0.75 No irq handler for vector (irq -1) [ 10.893544] do_IRQ: 0.75 No irq handler for vector (irq -1)
So, it seemed like either being really a bug, or there wasn’t any solution found yet. I somehow dismissed the bug-theory, as I already had the system running some years ago. Also grml didn’t show any problems, having a quite current Kernel (2.6.28) as well – solutions to this problem suggested to change your Kernel to something like 2.6.1?
Finding the cure
My first thought was, that it had to be a problem with my graphics card. This thought was not only backed by the fact, that the system did only want to boot in 80×25, but also by the fact that it seemed to be only the screen that froze. Thus in mind my first ideas where to play around with those frame buffer settings. I first had uvesafb, I switched that to nvidiafb, but as it did not change the problem, I finally played around with the general settings
--- Support for frame buffer devices [*] Enable firmware EDID [ ] Framebuffer foreign endianness support ---> -*- Enable Video Mode Handling Helpers [ ] Enable Tile Blitting Support
But that didn’t make me happy either. Also changing the nvidia settings to any possible combination did not bring me any luck.
<M> nVidia Framebuffer Support Enable DDC Support (NEW) [ ] Lots of debug output (NEW) [*] Support for backlight control (NEW)
Even the standard drivers did not bring any cure.
< > VGA 16-color graphics support [ ] VESA VGA graphics support
So it couldn’t be that graphic card frame buffer driver, could it? I switched back to uvesafb, and really thouroughly followed the instructions as described by spock. Of course that didn’t do me any good, but I could rule it out for being responsible.
Now the second clue I had was ACPI. Of course I knew that my machine had ACPI capabilities, still you never know. Playing around with that didn’t show any effects either. Now, the last thing I could hang up to, was the IRQ handling. There are a lot of different IRQ options to choose – finally I found one that made a difference:
[ ] Message Signaled Interrupts (MSI and MSI-X)
Deactivating this suppressed the IRQ message at all (I think I had a similar effect when using nomsi as boot option – makes sense, doesn’t it 😉 ). Of course this made working much easier for me, as my VTs stayed clear now. Anyway, this just suppressed the IRQ message, the error was still there. E.g. the ACPI message stayed. But I got a new message and one that was somewhat more readable (at least to me):
[ 11.455532] +------ PCI-Express Device Error ------+ [ 11.455542] Error Severity : Uncorrected (Non-Fatal) [ 11.455544] PCIE Bus Error type : Transaction Layer [ 11.455545] Flow Control Protocol : First [ 11.455547] Receiver ID : 0010 [ 11.455548] VendorID=1106h, DeviceID=a327h, Bus=00h, Device=02h, Function=00h [ 11.455551] pcieport-driver 0000:00:02.0: broadcast error_detected message [ 11.455553] pcieport-driver 0000:00:02.0: broadcast mmio_enabled message [ 11.455555] pcieport-driver 0000:00:02.0: broadcast resume message [ 11.455563] pcieport-driver 0000:00:02.0: AER driver successfully recovered [ 11.455576] pcieport-driver 0000:00:02.0: can't find device of ID0010
Now this actually looks like a PCIe error (supprise, supprise) and now the circle is closed. Of course, what is connected to my PCIe port? Exactly! My graphics card – that thing that gave me my first suspicions.
After playing a bit around with the Kernel options for this device I suddenly managed to get an even stranger error (i.e. changing the PCI access mode to ‘any’ made my machine get flooded with messages at boot time, such as being in an infinite loop) that encouraged me, to keep at it, and then there we go:
If you encounter such problems as described in the beginning of this entry, you may solve it by just changing your Kernel option from
PCI access mode (MMConfig) --->
PCI access mode (Any) --->
PCI access mode (Direct) --->
and you’ll not only end up with a neat and clean dmesg output, but also your system will instantly boot up in the desired and configured screen resolution using your framebuffer. It took me 11 Kernel builds, so I hope with that I could save you some time. If so, I’d be glad about a comment 😉