[Linux] “0.107 No irq handler for vector” aka “Problems with my PCIe Port”


As you may have already guessed, I was currently building a new Kernel for my PC (a md8818), and while doing so I encountered a strange problem, which I didn’t quite understand.

Of course, first thing to do is to ask Google, and I found out that I wasn’t the only one having that trouble, but it also seems like no one really had a solution to it (except for suppressing the symptoms). It took me 11 Kernel builds, until I finally eliminated the problem, so to help you not go through all this mess yourself, I’d like to present the problem plus solution.


The initial problems

The problems I encountered are as follows:

  1. When choosing and starting the Kernel with grub, the first thing I get is a message, that the chosen vga mode is not supported. I am given the choice between 80×25, and variations of that. This appeared regardless of using nvidiafb or uvesafb, but nvidiafb managed to switch the mode later into a higher screen resolution.
  2. I got a lot of funny error messages, right at the boot process, and even after logging in, occasionally my VTs where flooded with that message (they always occurred in packs of some lines):
    [   11.560908] do_IRQ: 0.107 No irq handler for vector (irq -1)
    [   11.560935] do_IRQ: 0.107 No irq handler for vector (irq -1)
    [   11.560964] do_IRQ: 0.107 No irq handler for vector (irq -1)
  3. When looking up dmesg for that error I saw that at the booting process there was another error message flooding pages of screen width with following message:
     [    0.101560] ACPI: EC: Look up EC in DSDT
    [    0.102151] ACPI Exception: AE_ERROR, Returned by Handler for  
    [PCI_Config] 20090521 evregion-424
    [    0.102317] ACPI Error (psparse-0537): Method parse/execution  
    failed [_SB_.PCI0.SATA._STA] (Node f70116f0), AE_ERROR
    [    0.102557] ACPI Error (uteval-0256): Method execution failed  
    [_SB_.PCI0.SATA._STA] (Node f70116f0), AE_ERROR
  4. From time to time my terminal would simply freeze. I.e. I wasn’t able to use Keyboard or Mouse (gpm) anymore, the screen just stayed as it was. Interestingly this was just limited to the Screen (I guess even keyboard and mouse where in fact still reacting). If you’d locked on via ssh from another machine you’d encounter that processes where running, the system was reacting, and everything seems to be working fine (even those do_IRQ messages didn’t show up on ssh).

My hardware

First of all let me start with my hardware, because this is what the problem depends on. The problem originated in having made software choices (while making the configurations for the Kernel) that didn’t meet the hardware needs. So if you encounter similar problems while having similar hardware, you may have run into the same problem as I did.

So let us start with lspci (I stripped off all uninteresting parts of it):

 telperion pygospa # lspci
00:00.0 Host bridge: VIA Technologies, Inc. P4M890 Host Bridge
02:00.0 VGA compatible controller: nVidia Corporation GeForce 7650 GS (rev a1)
04:00.0 PCI bridge: VIA Technologies, Inc. VT8251 PCIE Root Port

So we are dealing with a VIA Technologies chipset, model P4M890, there’s a PCIe port, again connected with a VIA Technologies chipset, model VT8251. And there seems to be a graphic card connected to it, a nVidia Corporation GeForce 7650 GS.

Now uname will tell us that it is a Intel Core 2 Duo, but I guess that is uninteresting.

telperion pygospa # uname -a
Linux telperion 2.6.31.3-20091011-16 #14 SMP Mon Oct 26 02:36:16 CET 2009 i686 Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz GenuineIntel GNU/Linux

More interesting is rather the current Kernel version (as this problem doesn’t seem to occur with older Kernel versions), which is the 2.6.31.3, directly from The Linux Kernel Archives without any additional patching.


Suggested solutions

While searching the web for a possible solution to this problem, the only things I could find, where references to bug reports, as well as some pseudo cures, that suppressed e.g. the do_IRQ message from appearing on the screen.

So solutions such as changing boot parameters to either pci=nomsi, nosmp, pci=roteirq, noacpi or any given combination of these parameters, did in fact have an effect to some degree. Still they did not change the problem. Some suppressed the message, but where still showing the ACPI error message in dmesg, or freeze the machine. One of them just changed the error message into:

 [   10.893459] do_IRQ: 0.75 No irq handler for vector (irq -1)
[   10.893486] do_IRQ: 0.75 No irq handler for vector (irq -1)
[   10.893515] do_IRQ: 0.75 No irq handler for vector (irq -1)
[   10.893544] do_IRQ: 0.75 No irq handler for vector (irq -1)

So, it seemed like either being really a bug, or there wasn’t any solution found yet. I somehow dismissed the bug-theory, as I already had the system running some years ago. Also grml didn’t show any problems, having a quite current Kernel (2.6.28) as well – solutions to this problem suggested to change your Kernel to something like 2.6.1?


Finding the cure

My first thought was, that it had to be a problem with my graphics card. This thought was not only backed by the fact, that the system did only want to boot in 80×25, but also by the fact that it seemed to be only the screen that froze. Thus in mind my first ideas where to play around with those frame buffer settings. I first had uvesafb, I switched that to nvidiafb, but as it did not change the problem, I finally played around with the general settings

 --- Support for frame buffer devices
[*]   Enable firmware EDID
[ ]   Framebuffer foreign endianness support  --->
-*-   Enable Video Mode Handling Helpers
[ ]   Enable Tile Blitting Support

But that didn’t make me happy either. Also changing the nvidia settings to any possible combination did not bring me any luck.

<M>   nVidia Framebuffer Support
Enable DDC Support (NEW)
[ ]     Lots of debug output (NEW)
[*]     Support for backlight control (NEW)

Even the standard drivers did not bring any cure.

< >   VGA 16-color graphics support
[ ]   VESA VGA graphics support

So it couldn’t be that graphic card frame buffer driver, could it? I switched back to uvesafb, and really thouroughly followed the instructions as described by spock. Of course that didn’t do me any good, but I could rule it out for being responsible.

Now the second clue I had was ACPI. Of course I knew that my machine had ACPI capabilities, still you never know. Playing around with that didn’t show any effects either. Now, the last thing I could hang up to, was the IRQ handling. There are a lot of different IRQ options to choose – finally I found one that made a difference:

[ ] Message Signaled Interrupts (MSI and MSI-X)

Deactivating this suppressed the IRQ message at all (I think I had a similar effect when using nomsi as boot option – makes sense, doesn’t it 😉 ). Of course this made working much easier for me, as my VTs stayed clear now. Anyway, this just suppressed the IRQ message, the error was still there. E.g. the ACPI message stayed. But I got a new message and one that was somewhat more readable (at least to me):

 [   11.455532] +------ PCI-Express Device Error ------+
[   11.455542] Error Severity		: Uncorrected (Non-Fatal)
[   11.455544] PCIE Bus Error type	: Transaction Layer
[   11.455545] Flow Control Protocol 	: First
[   11.455547] Receiver ID		: 0010
[   11.455548] VendorID=1106h, DeviceID=a327h, Bus=00h, Device=02h,  
Function=00h
[   11.455551] pcieport-driver 0000:00:02.0: broadcast error_detected  
message
[   11.455553] pcieport-driver 0000:00:02.0: broadcast mmio_enabled  
message
[   11.455555] pcieport-driver 0000:00:02.0: broadcast resume message
[   11.455563] pcieport-driver 0000:00:02.0: AER driver successfully  
recovered
[   11.455576] pcieport-driver 0000:00:02.0: can't find device of ID0010

Now this actually looks like a PCIe error (supprise, supprise) and now the circle is closed. Of course, what is connected to my PCIe port? Exactly! My graphics card – that thing that gave me my first suspicions.
After playing a bit around with the Kernel options for this device I suddenly managed to get an even stranger error (i.e. changing the PCI access mode to ‘any’ made my machine get flooded with messages at boot time, such as being in an infinite loop) that encouraged me, to keep at it, and then there we go:


The solution

If you encounter such problems as described in the beginning of this entry, you may solve it by just changing your Kernel option from

      PCI access mode (MMConfig)  ---> 

or

      PCI access mode (Any)  ---> 

into

      PCI access mode (Direct)  ---> 

and you’ll not only end up with a neat and clean dmesg output, but also your system will instantly boot up in the desired and configured screen resolution using your framebuffer. It took me 11 Kernel builds, so I hope with that I could save you some time. If so, I’d be glad about a comment 😉

Advertisements

8 thoughts on “[Linux] “0.107 No irq handler for vector” aka “Problems with my PCIe Port”

  1. I had the same Problem like you and found an easier way to solve it:

    My Problem was a result of the ATI-Hardware Driver, it filled the Syslog with these messages. I used Boot Options for the Kernel in Grub instead of compiling the kernel itself many times:

    pci=nomsi,noaer

    I haven’t checked a possible impact on the performance yet, but my system runs like expected. So “Good Luck!”

  2. Hey Andur,

    first of all, thanks for your input.

    Now, as mentioned I finally found the problem to be the only change in the Kernel that is described under “Solution” (the rest before is just explaining how I found that problem – I think it might be interesting for anyone that experiences similar problems but for different reasons – it should show up a more or less sturctured way to locate such problems).

    Now when it comes to Kernel boot parameters, the thing what you do, is supress something that you built in in first place. Like the “nomsi” means that the Kernel tries MSI, because your Kernel parameter says so.

    It of course get’s you ending up with the same result, but still it’s not the root cause for it – I’d say it’s just repressing the symptoms rather than doing something to solve the problems. That’s why I’d rather like to have a neat and clean Kernel 😉

    But of course if it works for you and if you are happy with that solution, ther you’d go 😉

  3. WOW! Great, special thanx for this well doucumented problem solution. I have had exactly the same problem, thanks to your explanations I figured very fast the porblem out. The kernel config fixed it, but it was good to read that andor has tested the appropriate kernel appendix (pci=nomsi,noaer) So I was able to use this and compile the new kernel afterwards. before it was not possible to compile, most of the processes were crashing….

    summary:
    many thanks , for your explanations, this saves a lot of time for me!

    Regards Randolf Balasus

  4. Hi,

    I have the same IRQ issue here, in fact I have the same box (The medion), This puppy ran for 3 years (debian etch) and I now reinstalled it after some disk crashes since I decided it needed an overhaul asap.

    I do notice you don’t seem to be running 64bit on that machine, do you realize that that this machine is 64bit wide ? I’ve installed debian amd64 install on it and it seems to run like a champ. I do NOT get the ACPI error, I do get those do_IRQ messages. under 64bit debian.

    Also, maybe a nice need-to-know. It’s only normal you don’t see any console error messages on your remote sessions (ssh). Those only get sent to the logged in root user directly on the console (which is the screen/keyboard here). Standard installs do not send those to a logged in remote root user. This is defined in your syslog configuration file (or alternatives if you use another one like ksyslogd).

    I don’t have a use for PCIe nor for VGA, this is a server(and a good one actually). You should also replace your power supply soon (mine broke this week after 3 years), it’s only a 300watt supply anyway, which is just not enough to ‘serve’, I put a 500 watt version in there and the machine works again.

    Very nice post, I have to thank you for taking the time and jot this down. It explains why my ‘etch’ install didn’t show those but the ‘squeeze’ does. I’m about to test some grub parameters now 🙂

  5. Hi,
    Thanks for this post, it’s the only thing I can find with a solution for this problem. But… I’m quite new to linux/ubuntu server and I can’t find out how and where to change the pci access mode. Can you help me a bit further?
    Thanks!

  6. Pingback: Blog-Steckbrief: Wer seid ihr? | ~ PygoscelisPapua ~

Please comment. I really enjoy your thoughts!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s