Opened 12 years ago
Closed 12 years ago
#63 closed defect (fixed)
Instability of UFO Server
Reported by: | Suren A. Chilingaryan | Owned by: | Suren A. Chilingaryan |
---|---|---|---|
Priority: | critical | Milestone: | |
Component: | Infrastructure | Version: | |
Keywords: | Cc: | Matthias Balzer, Suren A. Chilingaryan, Tomas Farago, Patrik Vagovic, Tomy Rolo, thomas.vandekamp@gmx.de, Matthias Vogelgesang, WMEXNER |
Description (last modified by )
There is reports that UFO server behaves unstable and periodically hangs.
I can't really reproduce the problem. I ran a set of tests to verify CPUs, GPUs, Memory, Raid storage, and network. I get no problems. PyHST and UFO framework were run continuously for a week without any problems. I.e. either
1) The problem was due malfunctioning PSU and everything is fine now
2) The problem occurs under some external condition (overheating?)
3) There is misbehaving application causing crash or bring system into the instable state. Or either some of the used applications are misbehaving under certain circumstances (for instance, with certain options set or with specific data-set).
In any case, to investigate this problem I need detailed bug reports. If hangs/crashes are arising again. I need as much information as possible. Which application caused crash, along with options and data set (i.e. how I can execute it). Which applications were running in parallel. What was done before the crash, etc. Was were any error reports before crash? Which environment you were using: SSH or NX session?
Attachments (0)
Change History (28)
comment:1 Changed 12 years ago by
comment:2 Changed 12 years ago by
Thanks, Matthias. Had system crashed/hanged immediately or afterwards? Or you just get the wrong values?
Please, let me know when you have time - we can recheck this.
comment:3 Changed 12 years ago by
No, it was not crashing nor hanging. Just the CameraLink communication was affected.
comment:4 Changed 12 years ago by
According to IPMI, we get from power supply instead of -12V approx. -11.6 V. I don't know how harmful is this, but on ipepdvcompute1 all voltages are quiet precise. +12V, +5V, etc. is also precise on ufosrv1.
Anybody knows that is acceptable variance?
comment:5 Changed 12 years ago by
ufosrv1 kernel: [77620.431743] BUG: unable to handle kernel NULL pointer dereference at (null)
comment:6 Changed 12 years ago by
Looks like nvidia-smi can crash the UFO server if continuously run for 1-2 days. I had upgraded NVIDIA drivers to test if it is version dependent.
comment:7 Changed 12 years ago by
ufosrv1 was running 295.41 until Friday. I have tried this version with ipepdvcompute1 this weekend and it was also crashing. Both systems now updated to CUDA5 beta driver.
comment:9 Changed 12 years ago by
Fatal exception in interrupt, Bad RIP value, Backtrace:
do_softirq irq_ext smp_apic_timer_interrupt apic_timer_interrupt intel_idle __atomic_notifier_call_chain cpu_idle
Booted with "noapic" flag
comment:10 Changed 12 years ago by
- Kernel upgrade to 3.4
- Reverted to 295.xx driver family due to kernel incompatibility (295.59)
comment:11 Changed 12 years ago by
The same crash with the new kernel (and without frame grabber module loaded). Here is current list of modules:
binfmt_misc 17540 1 nfs 411631 1 lockd 85545 1 nfs fscache 61840 1 nfs auth_rpcgss 45721 1 nfs nfs_acl 12883 1 nfs sunrpc 261456 16 nfs,lockd,auth_rpcgss,nfs_acl af_packet 39810 0 ipmi_devintf 17707 0 ipmi_si 53468 0 ipmi_msghandler 50349 2 ipmi_devintf,ipmi_si w83795 52252 0 w83627ehf 43321 0 hwmon_vid 12827 1 w83627ehf lm75 13701 0 jc42 13947 0 cpufreq_conservative 13821 0 cpufreq_userspace 13162 0 cpufreq_powersave 12618 0 dm_mod 101260 0 xfs 926900 2 nvidia 12358288 0 joydev 17606 0 acpi_cpufreq 18857 1 mperf 12667 1 acpi_cpufreq coretemp 13692 0 crc32c_intel 12858 0 snd_hda_codec_hdmi 40651 24 ixgbe 220311 0 microcode 35998 0 pcspkr 12718 0 serio_raw 13371 0 sg 36594 0 snd_hda_codec_realtek 87227 1 i2c_i801 18013 0 iTCO_wdt 18039 0 iTCO_vendor_support 13718 1 iTCO_wdt e1000e 218340 0 snd_hda_intel 33874 0 snd_hda_codec 141096 3 snd_hda_codec_hdmi,snd_hda_codec_realtek,snd_hda_intel snd_hwdep 13613 1 snd_hda_codec snd_pcm 110316 3 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec button 13949 0 snd_timer 34085 1 snd_pcm i7core_edac 28102 0 snd 91780 7 snd_hda_codec_hdmi,snd_hda_codec_realtek,snd_hda_intel,snd_hda_codec,snd_hwdep,snd_pcm,snd_timer ioatdma 58876 64 edac_core 57842 4 i7core_edac soundcore 15091 1 snd dca 15232 2 ixgbe,ioatdma mdio 13770 1 ixgbe snd_page_alloc 14476 2 snd_hda_intel,snd_pcm autofs4 43331 2 raid456 74241 0 async_raid6_recov 17348 1 raid456 async_pq 13429 2 raid456,async_raid6_recov raid6_pq 88307 2 async_raid6_recov,async_pq async_xor 13082 3 raid456,async_raid6_recov,async_pq xor 12894 1 async_xor async_memcpy 12650 2 raid456,async_raid6_recov async_tx 13470 5 raid456,async_raid6_recov,async_pq,async_xor,async_memcpy raid10 39640 0 raid0 17969 0 raid1 40002 3 ata_piix 35206 6 processor 45839 1 acpi_cpufreq thermal_sys 25053 1 processor ata_generic 12937 0 arcmsr 41605 4
The following extra modules are loaded compared to ipepdvcompute1:
async_memcpy async_pq async_raid6_recov async_tx async_xor ata_generic ata_piix auth_rpcgss dm_mod fscache ixgbe jc42 lockd mdio nfs nfs_acl raid0 raid1 raid10 raid456 raid6_pq sunrpc xfs xor
comment:12 Changed 12 years ago by
}}}Ä67686.060886Ü BUG: unable to handle kernel NULL pointer dereference at (null) Ä67686.068722Ü IP: Ä< (null)>Ü (null) Ä67686.073770Ü PGD 0 Ä67686.075792Ü Oops: 0010 Ä#1Ü PREEMPT SMP Ä67686.079753Ü CPU 0 Ä67686.081584Ü Modules linked in: binfmt_misc nvidia(PO) nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 jc42 cpufreq_conservative cpufreq_userspace cpufreq_powersave dm_mod xfs joydev snd_hda_codec_hdmi sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec acpi_cpufreq snd_hwdep mperf snd_pcm iTCO_wdt coretemp snd_timer crc32c_intel microcode snd serio_raw pcspkr i7core_edac i2c_i801 iTCO_vendor_support e1000e ioatdma edac_core ixgbe soundcore button snd_page_alloc dca mdio autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr Älast unloaded: nvidiaÜ Ä67686.146730Ü Ä67686.148219Ü Pid: 0, comm: swapper/0 Tainted: P O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF Ä67686.157891Ü RIP: 0010:Ä<0000000000000000>Ü Ä< (null)>Ü (null) Ä67686.165359Ü RSP: 0018:ffff880cbfc03e20 EFLAGS: 00010082 Ä67686.170651Ü RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000027fa Ä67686.177757Ü RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff880c77e02008 Ä67686.184862Ü RBP: ffff880c696373b0 R08: 0000000000000001 R09: ffff880c76e1b494 Ä67686.191967Ü R10: 0000000000000000 R11: 0000000000000001 R12: ffff880c77e02008 Ä67686.199072Ü R13: ffff880c777d2008 R14: 0000000000000000 R15: ffff880c81390008 Ä67686.206179Ü FS: 0000000000000000(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000 Ä67686.214236Ü CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Ä67686.219960Ü CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000007f0 Ä67686.227074Ü DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Ä67686.234180Ü DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Ä67686.241294Ü Process swapper/0 (pid: 0, threadinfo ffffffff81c00000, task ffffffff81c13020) Ä67686.249523Ü Stack: Ä67686.251529Ü ffffffffa1800a72 ffff880c777d2008 ffff880c696343b8 ffff880cbfc03eb4 Ä67686.258964Ü 0000000000000010 0000000000000000 ffffffffa180b8bd 0000000100000086 Ä67686.266396Ü 0000000000000046 0000000000000010 ffff881877a18800 0000000000000000 Ä67686.273832Ü Call Trace: Ä67686.276270Ü Inexact backtrace: Ä67686.276271Ü Ä67686.280799Ü <IRQ> Ä67686.283009Ü Ä<ffffffffa1800a72>Ü ? _nv014592rm+0x8d/0xe4 ÄnvidiaÜ Ä67686.289241Ü Ä<ffffffffa180b8bd>Ü ? rm_isr+0x12d/0x236 ÄnvidiaÜ Ä67686.295211Ü Ä<ffffffffa182bdf1>Ü ? nv_kern_isr+0x21/0x70 ÄnvidiaÜ Ä67686.301370Ü Ä<ffffffff810cc8e5>Ü ? handle_irq_event_percpu+0x75/0x2a0 Ä67686.307876Ü Ä<ffffffff810ccb57>Ü ? handle_irq_event+0x47/0x70 Ä67686.313694Ü Ä<ffffffff810cfcc0>Ü ? handle_fasteoi_irq+0x60/0x100 Ä67686.319773Ü Ä<ffffffff810041e8>Ü ? handle_irq+0x18/0x30 Ä67686.325071Ü Ä<ffffffff81003e63>Ü ? do_IRQ+0x53/0xd0 Ä67686.330025Ü Ä<ffffffff815a4dea>Ü ? common_interrupt+0x6a/0x6a Ä67686.335841Ü <EOI> Ä67686.337952Ü Ä<ffffffff8100af78>Ü ? poll_idle+0x48/0x2b0 Ä67686.343251Ü Ä<ffffffff8100bd76>Ü ? cpu_idle+0x96/0xf0 Ä67686.348379Ü Ä<ffffffff81cbfbbd>Ü ? start_kernel+0x39e/0x3a9 Ä67686.354022Ü Ä<ffffffff81cbf6c2>Ü ? repair_env_string+0x57/0x57 Ä67686.359926Ü Ä<ffffffff81cbf140>Ü ? early_idt_handlers+0x140/0x140 Ä67686.366089Ü Ä<ffffffff81cbf433>Ü ? x86_64_start_kernel+0xd1/0xe0 Ä67686.372165Ü Code: Bad RIP value. Ä67686.375500Ü RIP Ä< (null)>Ü (null) Ä67686.380635Ü RSP <ffff880cbfc03e20> Ä67686.384111Ü CR2: 0000000000000000 Ä67686.387740Ü ---Ä end trace 4ac351a70eb9c3ec Ü--- Ä67686.392343Ü Kernel panic - not syncing: Fatal exception in interrupt
comment:13 Changed 12 years ago by
- Updated BIOS 2.0a to 2.0c
- With new BIOS, the system looks like to be affected by kernel bug #43282. Passing "ghes.disable=1" to kernel is proposed as temporary solution. However, for me bug was gone by selected Fail-safe settings and ACPI3 compatibility in BIOS.
- All NVIDIA GPUs share IRQ16, on ipepdvcompute1 some of GPUs use IRQ16 and others IRQ18. Some time ago, the following bug was fixed (NVIDIA Changelog)
Fixed an interrupt handling deficiency that could lead to performance and stability problems when many NVIDIA GPUs shared few IRQs.
- Similar problems on the net 1 2
- Booted with irqpoll kernel parameter
comment:14 Changed 12 years ago by
- Enabled native support of PCI Express in BIOS
- Updated nvidia driver to 304.22
comment:15 Changed 12 years ago by
- Documentation/PCI/pcieaer-howto.txt
- We got following errors sporadically reported (no crash)
Ä43908.631425Ü pcieport 0000:86:08.0: Ä 0Ü Receiver Error (First) Ä46313.504492Ü pcieport 0000:80:03.0: AER: Corrected error received: id=0000 Ä46313.511344Ü pcieport 0000:86:08.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8640(Receiver ID) Ä46313.521519Ü pcieport 0000:86:08.0: device Ä10b5:8648Ü error status/mask=00000001/00002000
- 10b5:8648 is PLX Technology, Inc. PEX 8648 48-lane, 12-Port PCI Express Gen 2 (5.0 GT/s) Switch, i.e. external GPU box
- pcieport 0000:86:08.0 refers one of the mentioned above bridges
- pcieport 0000:80:03.0 refers to Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 [8086:340a]
- This bug reports seems related 1, 2
- Especially the 2nd has very similar behaviour. Problem was solved by moving card in different PCIe slot.
- I.e. the problem may lay in PCIe communication. Possibilities are:
- Problem in NVIDIA drivers preventing devices from correct operation in certain slots.
- Damaged PCIe slot on motherboard
- Damaged PCIe cable
- Damaged slot in external box
- Damaged PCIe interface of GPU card
- According to PCIe tree all the problems are registered at the specific PCIe slot / GPU device. So, it seems that either one of the GPU box slots or one of the GPUs may cause problems.
| +-03.0-[83-8c]----00.0-[84-8c]--+-04.0-[85-88]----00.0-[86-88]--+ | | | | | | | \-08.0-[88]--+-00.0 nVidia Corporation GF110 [GeForce GTX 580] [10de:1080]
PCIe Tree
-+-[0000:ff]-+-00.0 Intel Corporation Xeon 5600 Series QuickPath Architecture Generic Non-core Registers [8086:2c70] | +-00.1 Intel Corporation Xeon 5600 Series QuickPath Architecture System Address Decoder [8086:2d81] | +-02.0 Intel Corporation Xeon 5600 Series QPI Link 0 [8086:2d90] | +-02.1 Intel Corporation Xeon 5600 Series QPI Physical 0 [8086:2d91] | +-02.2 Intel Corporation Xeon 5600 Series Mirror Port Link 0 [8086:2d92] | +-02.3 Intel Corporation Xeon 5600 Series Mirror Port Link 1 [8086:2d93] | +-02.4 Intel Corporation Xeon 5600 Series QPI Link 1 [8086:2d94] | +-02.5 Intel Corporation Xeon 5600 Series QPI Physical 1 [8086:2d95] | +-03.0 Intel Corporation Xeon 5600 Series Integrated Memory Controller Registers [8086:2d98] | +-03.1 Intel Corporation Xeon 5600 Series Integrated Memory Controller Target Address Decoder [8086:2d99] | +-03.2 Intel Corporation Xeon 5600 Series Integrated Memory Controller RAS Registers [8086:2d9a] | +-03.4 Intel Corporation Xeon 5600 Series Integrated Memory Controller Test Registers [8086:2d9c] | +-04.0 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Control [8086:2da0] | +-04.1 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Address [8086:2da1] | +-04.2 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Rank [8086:2da2] | +-04.3 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Thermal Control [8086:2da3] | +-05.0 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Control [8086:2da8] | +-05.1 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Address [8086:2da9] | +-05.2 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Rank [8086:2daa] | +-05.3 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Thermal Control [8086:2dab] | +-06.0 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Control [8086:2db0] | +-06.1 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Address [8086:2db1] | +-06.2 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Rank [8086:2db2] | \-06.3 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Thermal Control [8086:2db3] +-[0000:fe]-+-00.0 Intel Corporation Xeon 5600 Series QuickPath Architecture Generic Non-core Registers [8086:2c70] | +-00.1 Intel Corporation Xeon 5600 Series QuickPath Architecture System Address Decoder [8086:2d81] | +-02.0 Intel Corporation Xeon 5600 Series QPI Link 0 [8086:2d90] | +-02.1 Intel Corporation Xeon 5600 Series QPI Physical 0 [8086:2d91] | +-02.2 Intel Corporation Xeon 5600 Series Mirror Port Link 0 [8086:2d92] | +-02.3 Intel Corporation Xeon 5600 Series Mirror Port Link 1 [8086:2d93] | +-02.4 Intel Corporation Xeon 5600 Series QPI Link 1 [8086:2d94] | +-02.5 Intel Corporation Xeon 5600 Series QPI Physical 1 [8086:2d95] | +-03.0 Intel Corporation Xeon 5600 Series Integrated Memory Controller Registers [8086:2d98] | +-03.1 Intel Corporation Xeon 5600 Series Integrated Memory Controller Target Address Decoder [8086:2d99] | +-03.2 Intel Corporation Xeon 5600 Series Integrated Memory Controller RAS Registers [8086:2d9a] | +-03.4 Intel Corporation Xeon 5600 Series Integrated Memory Controller Test Registers [8086:2d9c] | +-04.0 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Control [8086:2da0] | +-04.1 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Address [8086:2da1] | +-04.2 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Rank [8086:2da2] | +-04.3 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Thermal Control [8086:2da3] | +-05.0 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Control [8086:2da8] | +-05.1 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Address [8086:2da9] | +-05.2 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Rank [8086:2daa] | +-05.3 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Thermal Control [8086:2dab] | +-06.0 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Control [8086:2db0] | +-06.1 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Address [8086:2db1] | +-06.2 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Rank [8086:2db2] | \-06.3 Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Thermal Control [8086:2db3] +-[0000:80]-+-00.0-[81]-- | +-01.0-[82]-- | +-03.0-[83-8c]----00.0-[84-8c]--+-04.0-[85-88]----00.0-[86-88]--+-04.0-[87]--+-00.0 nVidia Corporation GF110 [GeForce GTX 580] [10de:1080] | | | | \-00.1 nVidia Corporation GF110 High Definition Audio Controller [10de:0e09] | | | \-08.0-[88]--+-00.0 nVidia Corporation GF110 [GeForce GTX 580] [10de:1080] | | | \-00.1 nVidia Corporation GF110 High Definition Audio Controller [10de:0e09] | | \-08.0-[89-8c]----00.0-[8a-8c]--+-04.0-[8b]--+-00.0 nVidia Corporation GF110 [GeForce GTX 580] [10de:1080] | | | \-00.1 nVidia Corporation GF110 High Definition Audio Controller [10de:0e09] | | \-08.0-[8c]--+-00.0 nVidia Corporation GF110 [GeForce GTX 580] [10de:1080] | | \-00.1 nVidia Corporation GF110 High Definition Audio Controller [10de:0e09] | +-07.0-[8d]----00.0 Areca Technology Corp. ARC-1880 8/12 port PCIe/PCI-X to SAS/SATA II RAID Controller [17d3:1880] | +-13.0 Intel Corporation 5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller [8086:342d] | +-14.0 Intel Corporation 5520/5500/X58 I/O Hub System Management Registers [8086:342e] | +-14.1 Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers [8086:3422] | +-14.2 Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers [8086:3423] | +-14.3 Intel Corporation 5520/5500/X58 I/O Hub Throttle Registers [8086:3438] | +-16.0 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3430] | +-16.1 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3431] | +-16.2 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3432] | +-16.3 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3433] | +-16.4 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3429] | +-16.5 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342a] | +-16.6 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342b] | \-16.7 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342c] \-[0000:00]-+-00.0 Intel Corporation 5520 I/O Hub to ESI Port [8086:3406] +-01.0-[01]----00.0 Intel Corporation 82598EB 10-Gigabit AF Network Connection [8086:10c7] +-03.0-[02]--+-00.0 nVidia Corporation GF110 [GeForce GTX 580] [10de:1080] | \-00.1 nVidia Corporation GF110 High Definition Audio Controller [10de:0e09] +-07.0-[03]--+-00.0 nVidia Corporation GF110 [GeForce GTX 580] [10de:1080] | \-00.1 nVidia Corporation GF110 High Definition Audio Controller [10de:0e09] +-13.0 Intel Corporation 5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller [8086:342d] +-14.0 Intel Corporation 5520/5500/X58 I/O Hub System Management Registers [8086:342e] +-16.0 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3430] +-16.1 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3431] +-16.2 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3432] +-16.3 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3433] +-16.4 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3429] +-16.5 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342a] +-16.6 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342b] +-16.7 Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342c] +-1a.0 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 [8086:3a37] +-1a.1 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 [8086:3a38] +-1a.2 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 [8086:3a39] +-1a.7 Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 [8086:3a3c] +-1c.0-[04]----00.0 Silicon Software GmbH microEnable IV-FULL x4 [1ae8:0a44] +-1c.4-[05]----00.0 Intel Corporation 82574L Gigabit Network Connection [8086:10d3] +-1c.5-[06]----00.0 Intel Corporation 82574L Gigabit Network Connection [8086:10d3] +-1d.0 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 [8086:3a34] +-1d.1 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 [8086:3a35] +-1d.2 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 [8086:3a36] +-1d.7 Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 [8086:3a3a] +-1e.0-[07]----01.0 Matrox Graphics, Inc. MGA G200eW WPCM450 [102b:0532] +-1f.0 Intel Corporation 82801JIR (ICH10R) LPC Interface Controller [8086:3a16] +-1f.2 Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1 [8086:3a20] +-1f.3 Intel Corporation 82801JI (ICH10 Family) SMBus Controller [8086:3a30] \-1f.5 Intel Corporation 82801JI (ICH10 Family) 2 port SATA IDE Controller #2 [8086:3a26]
comment:16 Changed 12 years ago by
- 5 days without crashes with GPU box disconnected, no PCIe problems reported
- PCIe cable is banded under extreme angles. I think it could cause the problems on the bus.
- Booted with cable banding fixed.
comment:17 Changed 12 years ago by
Problems on PCIe bus are not registered any more However, the server crashed again with RIP=NULL in interrupt. So it should be unrelated problems:
[81650.651295] BUG: unable to handle kernel NULL pointer dereference at (null) [81650.659133] IP: [< (null)>] (null) [81650.664182] PGD 187855b067 PUD 187fbdc067 PMD 0 [81650.668832] Oops: 0010 [#1] PREEMPT SMP [81650.672793] CPU 0 [81650.674625] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 jc42 cpufreq_conservative cpufreq_userspace cpufreq_powersave dm_mod xfs joydev nvidia(PO) snd_hda_codec_hdmi sg snd_hda_intel snd_hda_codec acpi_cpufreq mperf snd_hwdep snd_pcm coretemp snd_timer crc32c_intel pcspkr serio_raw snd iTCO_wdt microcode i2c_i801 iTCO_vendor_support e1000e i7core_edac soundcore ixgbe ioatdma button snd_page_alloc edac_core dca mdio autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr [81650.735761] [81650.737246] Pid: 28536, comm: nvidia-smi Tainted: P O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF [81650.747343] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [81650.754812] RSP: 0018:ffff880cbfc03e30 EFLAGS: 00010082 [81650.760103] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000008779 [81650.767210] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff880c709d0008 [81650.774314] RBP: ffff880c70e0f730 R08: 0000000000000001 R09: ffff881880e2769c [81650.781420] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880c709d0008 [81650.788526] R13: ffff880c71806008 R14: 0000000000000000 R15: ffff880c70a02008 [81650.795632] FS: 00007f94c19c9700(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000 [81650.803689] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [81650.809413] CR2: 0000000000000000 CR3: 000000187fc35000 CR4: 00000000000007f0 [81650.816517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [81650.823623] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [81650.830730] Process nvidia-smi (pid: 28536, threadinfo ffff880c72404000, task ffff880c73a02500) [81650.839392] Stack: [81650.841396] ffffffffa13e9d2b ffff880c71806008 ffff880c70e0c738 ffff880cbfc03eb4 [81650.848831] 0000000000000010 0000000000000005 ffffffffa13f2a93 0000000000000010 [81650.856266] ffff88187b0a4800 0000000000000000 ffff880c724058e8 0000000000000009 [81650.863701] Call Trace: [81650.866138] Inexact backtrace: [81650.866138] [81650.870668] <IRQ> [81650.872881] [<ffffffffa13e9d2b>] ? _nv014839rm+0x8d/0xe2 [nvidia] [81650.879094] [<ffffffffa13f2a93>] ? rm_isr+0xb8/0x152 [nvidia] [81650.884962] [<ffffffffa14103f1>] ? nv_kern_isr+0x21/0x70 [nvidia] [81650.891126] [<ffffffff810cc8e5>] ? handle_irq_event_percpu+0x75/0x2a0 [81650.897632] [<ffffffff810ccb57>] ? handle_irq_event+0x47/0x70 [81650.903440] [<ffffffff810cfcc0>] ? handle_fasteoi_irq+0x60/0x100 [81650.909510] [<ffffffff810041e8>] ? handle_irq+0x18/0x30 [81650.914799] [<ffffffff81003e63>] ? do_IRQ+0x53/0xd0 [81650.919745] [<ffffffff815a4dea>] ? common_interrupt+0x6a/0x6a [81650.925560] <EOI> [81650.927715] [<ffffffffa0e21b21>] ? _nv014625rm+0x1c1/0x1c2 [nvidia] [81650.934087] [<ffffffffa0e2178c>] ? _nv014620rm+0x2b/0x8d [nvidia] [81650.940293] [<ffffffffa0e217c2>] ? _nv014620rm+0x61/0x8d [nvidia] [81650.946500] [<ffffffffa0e217c2>] ? _nv014620rm+0x61/0x8d [nvidia] [81650.952698] [<ffffffffa0e21875>] ? _nv014622rm+0x32/0x9e [nvidia] [81650.958896] [<ffffffffa0e21919>] ? _nv014616rm+0x38/0x46 [nvidia] [81650.965147] [<ffffffffa12927e3>] ? _nv004062rm+0x1be/0x1607 [nvidia] [81650.971664] [<ffffffffa1292640>] ? _nv004062rm+0x1b/0x1607 [nvidia] [81650.978112] [<ffffffffa11b6b22>] ? _nv004040rm+0x1289/0xae92 [nvidia] [81650.984734] [<ffffffffa11b623e>] ? _nv004040rm+0x9a5/0xae92 [nvidia] [81650.991269] [<ffffffffa11b9595>] ? _nv004040rm+0x3cfc/0xae92 [nvidia] [81650.997890] [<ffffffffa11b6096>] ? _nv004040rm+0x7fd/0xae92 [nvidia] [81651.004345] [<ffffffffa0dee9f7>] ? _nv009855rm+0x175/0x278 [nvidia] [81651.010735] [<ffffffffa13f7abb>] ? _nv014821rm+0x21a/0x389 [nvidia] [81651.017123] [<ffffffffa13f9061>] ? _nv001090rm+0xac/0x65e [nvidia] [81651.023423] [<ffffffffa13f1e24>] ? rm_init_adapter+0xac/0x146 [nvidia] [81651.030070] [<ffffffffa1411f0c>] ? nv_kern_open+0x45c/0x810 [nvidia] [81651.036493] [<ffffffff8116650d>] ? chrdev_open+0x9d/0x1b0 [81651.041963] [<ffffffff81166470>] ? cdev_put+0x30/0x30 [81651.047089] [<ffffffff8116006a>] ? __dentry_open+0x25a/0x330 [81651.052813] [<ffffffff811711b8>] ? do_last+0x408/0x750 [81651.058025] [<ffffffff8116dd05>] ? path_init+0x315/0x400 [81651.063409] [<ffffffff81171619>] ? path_openat+0xd9/0x400 [81651.068873] [<ffffffff81171a65>] ? do_filp_open+0x45/0xb0 [81651.074335] [<ffffffff8116d511>] ? getname_flags+0x31/0xf0 [81651.079888] [<ffffffff8117e11b>] ? alloc_fd+0xcb/0x120 [81651.085099] [<ffffffff81161608>] ? do_sys_open+0xf8/0x1d0 [81651.090563] [<ffffffff815abc39>] ? system_call_fastpath+0x16/0x1b [81651.096717] Code: Bad RIP value. [81651.100052] RIP [< (null)>] (null) [81651.105189] RSP <ffff880cbfc03e30> [81651.108662] CR2: 0000000000000000 [81651.112291] ---[ end trace d1e853d782dd9d8e ]--- [81651.116892] Kernel panic - not syncing: Fatal exception in interrupt [81651.442356] ------------[ cut here ]------------ [81651.446965] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.4.4/linux-3.4/arch/x86/kernel/smp.c:120 update_process_times+0x65/0x80() [81651.459609] Hardware name: X8DTG-QF [81651.463084] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 jc42 cpufreq_conservative cpufreq_userspace cpufreq_powersave dm_mod xfs joydev nvidia(PO) snd_hda_codec_hdmi sg snd_hda_intel snd_hda_codec acpi_cpufreq mperf snd_hwdep snd_pcm coretemp snd_timer crc32c_intel pcspkr serio_raw snd iTCO_wdt microcode i2c_i801 iTCO_vendor_support e1000e i7core_edac soundcore ixgbe ioatdma button snd_page_alloc edac_core dca mdio autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr [81651.524030] Pid: 28536, comm: nvidia-smi Tainted: P D O 3.4.4-2-desktop #1 [81651.531566] Call Trace: [81651.534010] [<ffffffff810043fa>] dump_trace+0xaa/0x2b0 [81651.539221] [<ffffffff8158b1b1>] dump_stack+0x69/0x6f [81651.544349] [<ffffffff8104010b>] warn_slowpath_common+0x7b/0xc0 [81651.550338] [<ffffffff81050525>] update_process_times+0x65/0x80 [81651.556328] [<ffffffff81093a1b>] tick_sched_timer+0x5b/0xc0 [81651.561974] [<ffffffff8106602e>] __run_hrtimer+0x6e/0x240 [81651.567444] [<ffffffff810667e5>] hrtimer_interrupt+0xe5/0x200 [81651.573264] [<ffffffff81021a93>] smp_apic_timer_interrupt+0x63/0xa0 [81651.579598] [<ffffffff815ac73a>] apic_timer_interrupt+0x6a/0x70 [81651.585583] [<ffffffff8158dd57>] panic+0x18f/0x1d2 [81651.590448] [<ffffffff815a5ccf>] oops_end+0xef/0xf0 [81651.595402] [<ffffffff815a8042>] do_page_fault+0x402/0x530 [81651.600960] [<ffffffff815a5075>] page_fault+0x25/0x30 [81651.606086] ---[ end trace d1e853d782dd9d8f ]---
comment:18 Changed 12 years ago by
- Removed irqpoll from boot line
- Upgraded nvidia driver to 304.33 (driver from CUDA5 RC)
- The backtrace is quite different from old ones. It have similarities with Debian bug #667884 which references discussion on NVIDIA forum
- Another report from OpenSuSE, Gentoo
- NVIDIA forums: 1, 2, 3,
- There is recommendation to try booting with pcie_aspm=off and multiple people confirming it helps.
comment:19 Changed 12 years ago by
[10744.359268] BUG: unable to handle kernel NULL pointer dereference at (null) [10744.367103] IP: [< (null)>] (null) [10744.372152] PGD c77c72067 PUD c787c3067 PMD 0 [10744.376629] Oops: 0010 [#1] PREEMPT SMP [10744.380590] CPU 0 [10744.382420] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod snd_hda_codec_hdmi snd_hda_intel snd_hda_codec xfs ixgbe nvidia(PO) ioatdma acpi_cpufreq snd_hwdep snd_pcm snd_timer e1000e snd sg iTCO_wdt serio_raw joydev i7core_edac dca mperf button coretemp pcspkr iTCO_vendor_support i2c_i801 edac_core soundcore crc32c_intel snd_page_alloc mdio microcode autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr [10744.443557][10744.445043] Pid: 15063, comm: nvidia-smi Tainted: P O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF [10744.455139] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [10744.462609] RSP: 0018:ffff880cbfc03e30 EFLAGS: 00010082 [10744.467900] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e293 [10744.475006] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff880c7589c008 [10744.482112] RBP: ffff880c7534b430 R08: 0000000000000001 R09: ffff880c73bb7b1c [10744.489216] R10: 0000000000000000 R11: 0000000000000010 R12: ffff880c7589c008 [10744.496323] R13: ffff880c77a88008 R14: 0000000000000000 R15: ffff880c74b6d008 [10744.503428] FS: 00007fef894e2700(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000 [10744.511486] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10744.517209] CR2: 0000000000000000 CR3: 0000000c8186c000 CR4: 00000000000007f0 [10744.524314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [10744.531419] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [10744.538527] Process nvidia-smi (pid: 15063, threadinfo ffff880c75582000, task ffff880c7552a3c0) [10744.547187] Stack: [10744.549193] ffffffffa1e5193b ffff880c77a88008 ffff880c75348438 ffff880cbfc03eb4 [10744.556627] 0000000000000010 0000000000000005 ffffffffa1e5a6d3 0000000000000010 [10744.564061] ffff88187fe7a000 0000000000000000 ffff880c75583918 0000000000000009 [10744.571496] Call Trace: [10744.573934] Inexact backtrace: [10744.573935] [10744.578464] <IRQ> [10744.580646] [<ffffffffa1e5193b>] ? _nv014871rm+0x8d/0xe2 [nvidia] [10744.586864] [<ffffffffa1e5a6d3>] ? rm_isr+0xb8/0x152 [nvidia] [10744.592730] [<ffffffffa1e78031>] ? nv_kern_isr+0x21/0x70 [nvidia] [10744.598894] [<ffffffff810cc8e5>] ? handle_irq_event_percpu+0x75/0x2a0 [10744.605403] [<ffffffff810ccb57>] ? handle_irq_event+0x47/0x70 [10744.611221] [<ffffffff810cfcc0>] ? handle_fasteoi_irq+0x60/0x100 [10744.617298] [<ffffffff810041e8>] ? handle_irq+0x18/0x30 [10744.622595] [<ffffffff81003e63>] ? do_IRQ+0x53/0xd0 [10744.627542] [<ffffffff815a4dea>] ? common_interrupt+0x6a/0x6a [10744.633358] <EOI> [10744.635470] [<ffffffff8114d532>] ? cache_grow+0x202/0x2b0 [10744.640985] [<ffffffffa1886821>] ? _nv014644rm+0x1ed/0x1ed [nvidia] [10744.647364] [<ffffffffa1886882>] ? _nv014650rm+0x61/0x8d [nvidia] [10744.653570] [<ffffffffa1886882>] ? _nv014650rm+0x61/0x8d [nvidia] [10744.659768] [<ffffffffa1886935>] ? _nv014652rm+0x32/0x9e [nvidia] [10744.665966] [<ffffffffa18869d9>] ? _nv014646rm+0x38/0x46 [nvidia] [10744.672215] [<ffffffffa1cf9e03>] ? _nv004066rm+0x1be/0x1607 [nvidia] [10744.678725] [<ffffffffa1cf9c60>] ? _nv004066rm+0x1b/0x1607 [nvidia] [10744.685173] [<ffffffffa1c1ca92>] ? _nv004044rm+0x1259/0xae8b [nvidia] [10744.691795] [<ffffffffa1c1c1b1>] ? _nv004044rm+0x978/0xae8b [nvidia] [10744.698330] [<ffffffffa1c1f505>] ? _nv004044rm+0x3ccc/0xae8b [nvidia] [10744.704951] [<ffffffffa1c1c009>] ? _nv004044rm+0x7d0/0xae8b [nvidia] [10744.711406] [<ffffffffa1853ab7>] ? _nv009864rm+0x175/0x278 [nvidia] [10744.717795] [<ffffffffa1e5f6fb>] ? _nv014853rm+0x21a/0x389 [nvidia] [10744.724184] [<ffffffffa1e60ca1>] ? _nv001095rm+0xac/0x65e [nvidia] [10744.730485] [<ffffffffa1e59a64>] ? rm_init_adapter+0xac/0x146 [nvidia] [10744.737131] [<ffffffffa1e79b4c>] ? nv_kern_open+0x45c/0x810 [nvidia] [10744.743553] [<ffffffff8116650d>] ? chrdev_open+0x9d/0x1b0 [10744.749023] [<ffffffff81166470>] ? cdev_put+0x30/0x30 [10744.754142] [<ffffffff8116006a>] ? __dentry_open+0x25a/0x330 [10744.759864] [<ffffffff811711b8>] ? do_last+0x408/0x750 [10744.765068] [<ffffffff8116dd05>] ? path_init+0x315/0x400 [10744.770445] [<ffffffff81171619>] ? path_openat+0xd9/0x400 [10744.775908] [<ffffffff81171a65>] ? do_filp_open+0x45/0xb0 [10744.781371] [<ffffffff8116d511>] ? getname_flags+0x31/0xf0 [10744.786924] [<ffffffff8117e11b>] ? alloc_fd+0xcb/0x120 [10744.792135] [<ffffffff81161608>] ? do_sys_open+0xf8/0x1d0 [10744.797607] [<ffffffff815abc39>] ? system_call_fastpath+0x16/0x1b [10744.803768] Code: Bad RIP value. [10744.807113] RIP [< (null)>] (null) [10744.812250] RSP <ffff880cbfc03e30> [10744.815724] CR2: 0000000000000000 [10744.819340] ---[ end trace 03a974d4317ca792 ]--- [10744.823947] Kernel panic - not syncing: Fatal exception in interrupt [10745.150950] ------------[ cut here ]------------ [10745.155556] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.4.4/linux-3.4/arch/x86/kernel/smp.c:120 update_process_times+0x65/0x80() [10745.168200] Hardware name: X8DTG-QF [10745.171676] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod snd_hda_codec_hdmi snd_hda_intel snd_hda_codec xfs ixgbe nvidia(PO) ioatdma acpi_cpufreq snd_hwdep snd_pcm snd_timer e1000e snd sg iTCO_wdt serio_raw joydev i7core_edac dca mperf button coretemp pcspkr iTCO_vendor_support i2c_i801 edac_core soundcore crc32c_intel snd_page_alloc mdio microcode autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr [10745.232621] Pid: 15063, comm: nvidia-smi Tainted: P D O 3.4.4-2-desktop #1 [10745.240159] Call Trace: [10745.242602] [<ffffffff810043fa>] dump_trace+0xaa/0x2b0 [10745.247811] [<ffffffff8158b1b1>] dump_stack+0x69/0x6f [10745.252931] [<ffffffff8104010b>] warn_slowpath_common+0x7b/0xc0 [10745.258920] [<ffffffff81050525>] update_process_times+0x65/0x80 [10745.264912] [<ffffffff81093a1b>] tick_sched_timer+0x5b/0xc0 [10745.270556] [<ffffffff8106602e>] __run_hrtimer+0x6e/0x240 [10745.276028] [<ffffffff810667e5>] hrtimer_interrupt+0xe5/0x200 [10745.281848] [<ffffffff81021a93>] smp_apic_timer_interrupt+0x63/0xa0 [10745.288181] [<ffffffff815ac73a>] apic_timer_interrupt+0x6a/0x70 [10745.294165] [<ffffffff8158dd57>] panic+0x18f/0x1d2 [10745.299031] [<ffffffff815a5ccf>] oops_end+0xef/0xf0 [10745.303985] [<ffffffff815a8042>] do_page_fault+0x402/0x530 [10745.309542] [<ffffffff815a5075>] page_fault+0x25/0x30 [10745.314667] ---[ end trace 03a974d4317ca793 ]---
- Booted with pcie_aspm=off iommu=soft
comment:20 Changed 12 years ago by
[20331.519822] IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 1, got netfn 29 cmd 11 [21927.342079] BUG: unable to handle kernel NULL pointer dereference at (null) [21927.349921] IP: [< (null)>] (null) [21927.354969] PGD c7c488067 PUD c8181b067 PMD 0 [21927.359448] Oops: 0010 [#1] PREEMPT SMP [21927.363407] CPU 0 [21927.365238] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod xfs snd_hda_codec_hdmi nvidia(PO) acpi_cpufreq mperf joydev coretemp crc32c_intel snd_hda_intel snd_hda_codec microcode snd_hwdep sg serio_raw pcspkr snd_pcm snd_timer iTCO_wdt ixgbe i2c_i801 i7core_edac e1000e iTCO_vendor_support snd ioatdma button soundcore dca edac_core mdio snd_page_alloc autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr [21927.426375] [21927.427861] Pid: 25135, comm: nvidia-smi Tainted: P O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF [21927.437958] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [21927.445427] RSP: 0018:ffff880cbfc03e30 EFLAGS: 00010082 [21927.450718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000c047 [21927.457823] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff880c6e81c008 [21927.464929] RBP: ffff880c81c4b5b0 R08: 0000000000000001 R09: ffff880c6e84f35c [21927.472034] R10: 0000000000000000 R11: 0000000000000010 R12: ffff880c6e81c008 [21927.479140] R13: ffff880c807a4008 R14: 0000000000000000 R15: ffff880c7bc3e008 [21927.486248] FS: 00007fd67c12e700(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000 [21927.494304] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [21927.500027] CR2: 0000000000000000 CR3: 0000000c80683000 CR4: 00000000000007f0 [21927.507132] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [21927.514239] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [21927.521344] Process nvidia-smi (pid: 25135, threadinfo ffff880c81800000, task ffff880c829a8840) [21927.530006] Stack: [21927.532010] ffffffffa1fb093b ffff880c807a4008 ffff880c81c485b8 ffff880cbfc03eb4 [21927.539446] 0000000000000010 0000000000000004 ffffffffa1fb96d3 0000000000000010 [21927.546880] ffff881882e3e000 0000000000000000 ffff880c818018f8 0000000000000009 [21927.554315] Call Trace: [21927.556752] Inexact backtrace: [21927.556753] [21927.561283] <IRQ> [21927.563478] [<ffffffffa1fb093b>] ? _nv014871rm+0x8d/0xe2 [nvidia] [21927.569692] [<ffffffffa1fb96d3>] ? rm_isr+0xb8/0x152 [nvidia] [21927.575558] [<ffffffffa1fd7031>] ? nv_kern_isr+0x21/0x70 [nvidia] [21927.581722] [<ffffffff810cc8e5>] ? handle_irq_event_percpu+0x75/0x2a0 [21927.588229] [<ffffffff810ccb57>] ? handle_irq_event+0x47/0x70 [21927.594046] [<ffffffff810cfcc0>] ? handle_fasteoi_irq+0x60/0x100 [21927.600117] [<ffffffff810041e8>] ? handle_irq+0x18/0x30 [21927.605413] [<ffffffff81003e63>] ? do_IRQ+0x53/0xd0 [21927.610359] [<ffffffff815a4dea>] ? common_interrupt+0x6a/0x6a [21927.616166] <EOI> [21927.618322] [<ffffffffa19e58a5>] ? _nv014650rm+0x84/0x8d [nvidia] [21927.624527] [<ffffffffa19e5882>] ? _nv014650rm+0x61/0x8d [nvidia] [21927.630725] [<ffffffffa19e5882>] ? _nv014650rm+0x61/0x8d [nvidia] [21927.636925] [<ffffffffa19e5935>] ? _nv014652rm+0x32/0x9e [nvidia] [21927.643131] [<ffffffffa19e59d9>] ? _nv014646rm+0x38/0x46 [nvidia] [21927.649380] [<ffffffffa1e58e03>] ? _nv004066rm+0x1be/0x1607 [nvidia] [21927.655888] [<ffffffffa1e58c60>] ? _nv004066rm+0x1b/0x1607 [nvidia] [21927.662338] [<ffffffffa1d7ba92>] ? _nv004044rm+0x1259/0xae8b [nvidia] [21927.668960] [<ffffffffa1d7b1b1>] ? _nv004044rm+0x978/0xae8b [nvidia] [21927.675495] [<ffffffffa1d7e505>] ? _nv004044rm+0x3ccc/0xae8b [nvidia] [21927.682117] [<ffffffffa1d7b009>] ? _nv004044rm+0x7d0/0xae8b [nvidia] [21927.688571] [<ffffffffa19b2ab7>] ? _nv009864rm+0x175/0x278 [nvidia] [21927.694959] [<ffffffffa1fbe6fb>] ? _nv014853rm+0x21a/0x389 [nvidia] [21927.701349] [<ffffffffa1fbfca1>] ? _nv001095rm+0xac/0x65e [nvidia] [21927.707649] [<ffffffffa1fb8a64>] ? rm_init_adapter+0xac/0x146 [nvidia] [21927.714295] [<ffffffffa1fd8b4c>] ? nv_kern_open+0x45c/0x810 [nvidia] [21927.720718] [<ffffffff8116650d>] ? chrdev_open+0x9d/0x1b0 [21927.726188] [<ffffffff81166470>] ? cdev_put+0x30/0x30 [21927.731305] [<ffffffff8116006a>] ? __dentry_open+0x25a/0x330 [21927.737029] [<ffffffff811711b8>] ? do_last+0x408/0x750 [21927.742233] [<ffffffff8116dd05>] ? path_init+0x315/0x400 [21927.747609] [<ffffffff81171619>] ? path_openat+0xd9/0x400 [21927.753073] [<ffffffff81171a65>] ? do_filp_open+0x45/0xb0 [21927.758535] [<ffffffff8116d511>] ? getname_flags+0x31/0xf0 [21927.764086] [<ffffffff8117e11b>] ? alloc_fd+0xcb/0x120 [21927.769291] [<ffffffff81161608>] ? do_sys_open+0xf8/0x1d0 [21927.774754] [<ffffffff815abc39>] ? system_call_fastpath+0x16/0x1b [21927.780917] Code: Bad RIP value. [21927.784262] RIP [< (null)>] (null) [21927.789396] RSP <ffff880cbfc03e30> [21927.792871] CR2: 0000000000000000 [21927.796488] ---[ end trace 5d3827f96f798ce2 ]--- [21927.801092] Kernel panic - not syncing: Fatal exception in interrupt [21928.126593] ------------[ cut here ]------------ [21928.131200] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.4.4/linux-3.4/arch/x86/kernel/smp.c:120 update_process_times+0x65/0x80() [21928.143843] Hardware name: X8DTG-QF [21928.147318] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod xfs snd_hda_codec_hdmi nvidia(PO) acpi_cpufreq mperf joydev coretemp crc32c_intel snd_hda_intel snd_hda_codec microcode snd_hwdep sg serio_raw pcspkr snd_pcm snd_timer iTCO_wdt ixgbe i2c_i801 i7core_edac e1000e iTCO_vendor_support snd ioatdma button soundcore dca edac_core mdio snd_page_alloc autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr [21928.208263] Pid: 25135, comm: nvidia-smi Tainted: P D O 3.4.4-2-desktop #1 [21928.215801] Call Trace: [21928.218245] [<ffffffff810043fa>] dump_trace+0xaa/0x2b0 [21928.223455] [<ffffffff8158b1b1>] dump_stack+0x69/0x6f [21928.228583] [<ffffffff8104010b>] warn_slowpath_common+0x7b/0xc0 [21928.234572] [<ffffffff81050525>] update_process_times+0x65/0x80 [21928.240563] [<ffffffff81093a1b>] tick_sched_timer+0x5b/0xc0 [21928.246209] [<ffffffff8106602e>] __run_hrtimer+0x6e/0x240 [21928.251679] [<ffffffff810667e5>] hrtimer_interrupt+0xe5/0x200 [21928.257498] [<ffffffff81021a93>] smp_apic_timer_interrupt+0x63/0xa0 [21928.263833] [<ffffffff815ac73a>] apic_timer_interrupt+0x6a/0x70 [21928.269826] [<ffffffff8158dd57>] panic+0x18f/0x1d2 [21928.274691] [<ffffffff815a5ccf>] oops_end+0xef/0xf0 [21928.279646] [<ffffffff815a8042>] do_page_fault+0x402/0x530 [21928.285202] [<ffffffff815a5075>] page_fault+0x25/0x30 [21928.290319] ---[ end trace 5d3827f96f798ce3 ]---
comment:21 Changed 12 years ago by
This time I have applied multiple changes to speed up tests
- Hardware, the configuration now replicates ipepdvcompute1
- slots used by extender card, sata controller, GPUs...
- 10 GBit ethernet and frame grabber are removed
- Few changes to the BIOS are done
- Disabled IRQ19 Capture (probably doesn't matter)
- Disabled PCIe slot containing SATA controller to preserve ROM space
- Enabled PnP OS (this may alter distribution of interrupt numbers and prevent crashes due to IRQ conflicts)
- Enabled APIC ACPI SCI IRQ (this option somehow affects distribution of interrupt numbers as well)
- It is not possible to disable APIC controller al together. Otherwise, only a single core is available to Linux. As well it is not possible to disable ACPI. There is some conflict with IPMI BMC and even grub does not appear.
- NVIDIA driver is instructed to use MSI interrupt
- NVreg_EnableMSI=1 is passed to the module
- Using this mode NVIDIA uses following interrupts:
124: 63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge nvidia 125: 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge nvidia 126: 0 0 0 0 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge nvidia 127: 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge nvidia 128: 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge nvidia 129: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge nvidia
- In standard mode, the interrupts are:
16: 32057 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-IO-APIC-fasteoi uhci_hcd:usb3, arcmsr, nvidia, nvidia, nvidia, nvidia, nvidia, nvidia
- On ipepdvcompute1:
16: 5806099 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3, arcmsr, arcmsr, nvidia, nvidia, nvidia, nvidia, nvidia 18: 44 0 0 0 0 2315994 0 0 0 0 0 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb8, nvidia, nvidia, nvidia, nvidia
This configuration seems rather stable. I have not done the full test, but no hangs during a full 24 hour run. Currently, I check if
- It will be worse with non-MSI interrupt
Later actions
- If there would be no crashes, we will check if there is conflicts with the removed hardware (first with standard interrupts and, then, with MSI)
- It seems the main difference between ipedvcompute1 and ufosrv1 is running X session. Because of VirtualGL, ipepdvcompute runs Xorg. As a result, the nvidia driver is always in active state and there is no initilization penalty for nvidia-smi. If we run any CUDA application on ufosrv1, the nvidia-smi will run as fast as on ipepdvcompute1 while the application is executed. It also seems that hangs are triggered by the driver initialization. Otherwise, it is not clear why nvidia-smi triggers crashes significantly more successful than heavy long-running CUDA applications. As a replacement to X-session, we may try to run GPUs in the persistent mode. In this case, there should be no initialization and hopefully no crashes.
- As last resort, we will get GPU box for investigations and install 3rd GTX580 into the ufosrv1. In a long run, GTX580 may be replaced with GTX590. Then, we will got configuration more-or-less comparable to the current state.
comment:22 Changed 12 years ago by
[94324.326043] BUG: unable to handle kernel NULL pointer dereference at (null) [94324.333905] IP: [< (null)>] (null) [94324.338971] PGD c7e6b8067 PUD c81d1e067 PMD 0 [94324.343476] Oops: 0010 [#1] PREEMPT SMP [94324.347461] CPU 0 [94324.349301] Modules linked in: nvidia(PO) nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod snd_hda_codec_hdmi xfs snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd e1000e soundcore iTCO_wdt acpi_cpufreq mperf snd_page_alloc joydev coretemp serio_raw iTCO_vendor_support ioatdma pcspkr sg i2c_i801 i7core_edac crc32c_intel microcode edac_core dca button autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr [last unloaded: nvidia] [94324.410142] [94324.411640] Pid: 5100, comm: nvidia-smi Tainted: P O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF [94324.421683] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [94324.429170] RSP: 0018:ffff880cbfc03e30 EFLAGS: 00010082 [94324.434468] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000f1e8 [94324.441583] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff880c7a78a008 [94324.448698] RBP: ffff880c79a4f8f0 R08: 0000000000000001 R09: ffff880c79b5bd5c [94324.455812] R10: 0000000000000000 R11: 0000000000000010 R12: ffff880c7a78a008 [94324.462927] R13: ffff880c8016a008 R14: 0000000000000000 R15: ffff880c7f756008 [94324.470041] FS: 00007f828267d700(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000 [94324.478106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [94324.483839] CR2: 0000000000000000 CR3: 0000000c7e5b2000 CR4: 00000000000007f0 [94324.490953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [94324.498068] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [94324.505181] Process nvidia-smi (pid: 5100, threadinfo ffff880c80b4c000, task ffff880c80852180) [94324.513764] Stack: [94324.515781] ffffffffa153393b ffff880c8016a008 ffff880c79a4c8f8 ffff880cbfc03eb4 [94324.523249] 0000000000000010 0000000000000005 ffffffffa153c6d3 0000000000000010 [94324.530709] ffff881881b21800 0000000000000000 ffff880c80b4d8e8 0000000000000009 [94324.538169] Call Trace: [94324.540617] Inexact backtrace: [94324.540618] [94324.545164] <IRQ> [94324.547373] [<ffffffffa153393b>] ? _nv014871rm+0x8d/0xe2 [nvidia] [94324.553589] [<ffffffffa153c6d3>] ? rm_isr+0xb8/0x152 [nvidia] [94324.559456] [<ffffffffa155a031>] ? nv_kern_isr+0x21/0x70 [nvidia] [94324.565619] [<ffffffff810cc8e5>] ? handle_irq_event_percpu+0x75/0x2a0 [94324.572127] [<ffffffff810ccb57>] ? handle_irq_event+0x47/0x70 [94324.577946] [<ffffffff810cfcc0>] ? handle_fasteoi_irq+0x60/0x100 [94324.584022] [<ffffffff810041e8>] ? handle_irq+0x18/0x30 [94324.589321] [<ffffffff81003e63>] ? do_IRQ+0x53/0xd0 [94324.594275] [<ffffffff815a4dea>] ? common_interrupt+0x6a/0x6a [94324.600090] <EOI> [94324.602263] [<ffffffffa0f68bed>] ? _nv014649rm+0xb/0x21 [nvidia] [94324.608383] [<ffffffffa0f68869>] ? _nv014650rm+0x48/0x8d [nvidia] [94324.614589] [<ffffffffa0f68882>] ? _nv014650rm+0x61/0x8d [nvidia] [94324.620795] [<ffffffffa0f68882>] ? _nv014650rm+0x61/0x8d [nvidia] [94324.626993] [<ffffffffa0f68935>] ? _nv014652rm+0x32/0x9e [nvidia] [94324.633201] [<ffffffffa0f689d9>] ? _nv014646rm+0x38/0x46 [nvidia] [94324.639458] [<ffffffffa13dbe03>] ? _nv004066rm+0x1be/0x1607 [nvidia] [94324.645976] [<ffffffffa13dbc60>] ? _nv004066rm+0x1b/0x1607 [nvidia] [94324.652434] [<ffffffffa12fea92>] ? _nv004044rm+0x1259/0xae8b [nvidia] [94324.659062] [<ffffffffa12fe1b1>] ? _nv004044rm+0x978/0xae8b [nvidia] [94324.665607] [<ffffffffa1301505>] ? _nv004044rm+0x3ccc/0xae8b [nvidia] [94324.672239] [<ffffffffa12fe009>] ? _nv004044rm+0x7d0/0xae8b [nvidia] [94324.678702] [<ffffffffa0f35ab7>] ? _nv009864rm+0x175/0x278 [nvidia] [94324.685090] [<ffffffffa15416fb>] ? _nv014853rm+0x21a/0x389 [nvidia] [94324.691478] [<ffffffffa1542ca1>] ? _nv001095rm+0xac/0x65e [nvidia] [94324.697779] [<ffffffffa153ba64>] ? rm_init_adapter+0xac/0x146 [nvidia] [94324.704425] [<ffffffffa155bb4c>] ? nv_kern_open+0x45c/0x810 [nvidia] [94324.710848] [<ffffffff8116650d>] ? chrdev_open+0x9d/0x1b0 [94324.716318] [<ffffffff81166470>] ? cdev_put+0x30/0x30 [94324.721444] [<ffffffff8116006a>] ? __dentry_open+0x25a/0x330 [94324.727178] [<ffffffff811711b8>] ? do_last+0x408/0x750 [94324.732390] [<ffffffff8116dd05>] ? path_init+0x315/0x400 [94324.737775] [<ffffffff81171619>] ? path_openat+0xd9/0x400 [94324.743246] [<ffffffff81171a65>] ? do_filp_open+0x45/0xb0 [94324.748718] [<ffffffff8116d511>] ? getname_flags+0x31/0xf0 [94324.754279] [<ffffffff8117e11b>] ? alloc_fd+0xcb/0x120 [94324.759490] [<ffffffff81161608>] ? do_sys_open+0xf8/0x1d0 [94324.764963] [<ffffffff815abc39>] ? system_call_fastpath+0x16/0x1b [94324.771124] Code: Bad RIP value. [94324.774479] RIP [< (null)>] (null) [94324.779631] RSP <ffff880cbfc03e30> [94324.783115] CR2: 0000000000000000 [94324.786988] ---[ end trace 623a6040af9ab866 ]--- [94324.791596] Kernel panic - not syncing: Fatal exception in interrupt
comment:23 Changed 12 years ago by
Actually, GPUBox has interesting diagnostic signals. In both normal and hanged states, no errors are reported and correct link to the server is indicated. However, all GPU leds are blinking (long on / short off). According to documentation this means that cards are operating at 2.5 GBit/s per lane instead of 5 GBit/s.
- bandwidthTest reports 5700 GB/s to device and 6355 MB/s which is pretty fine with full speed Gen2 x16 link.
- In hanged state, first two lights stop blinking and the second pair continuous.
- On ipepdvcompute1 no blinking is registered.
- If diagnosed with OSS SysMon, no alarms is indicated.
- However, voltage for -12V is reported at -13.51V. But this value never changes and it is exactly the same at ipepdvcompute1. I guess sensor or sysmon error.
- Only other difference to ipepdvcompute1, the Out 6 is green on ufosrv1 while it is not highlighted on ipepdvcompute1. However, there is no information about meaning of Out 6
comment:24 Changed 12 years ago by
- With the hardware back in the system, there is no changes to the list of devices using IRQ16.
16: 5479 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-IO-APIC-fasteoi uhci_hcd:usb3, arcmsr, nvidia, nvidia, nvidia, nvidia, nvidia
- OK. Supermicro is apparently not able to work with more than 9 GPU cores due to limitation of ROM space. Could it be there is also card-limit? May be current driver is not able to properly handle more than 5 cards sharing the interrupt. As it could be seen on ipepdvcompute1, in the case of double-core cards, one core uses IRQ16 and another IRQ18.
- So, all hardware is back and long test is executed with MSI-style interrupt to check if it is really solution to the problems.
comment:25 Changed 12 years ago by
No problems during 5 days. The following changes to configuration are required to avoid problems:
- pci_aspm=off kernel parameter disable PCIe power management and fixes the errors on PCIe bus.
- NVreg_EnableMSI=1 parameter of nvidia module enforces usage of MSI-style interrupts and prevents crashes in IRQ handler
- nvidia-smi -pm 1 enables persistent mode
Call for testing…
comment:26 Changed 12 years ago by
Description: | modified (diff) |
---|---|
ticket_due: | → 31/08/2012 |
comment:27 Changed 12 years ago by
Description: | modified (diff) |
---|
comment:28 Changed 12 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
No crashes and complaints during 2 weeks. I'm closing the ticket. Please, re-open if further stability problems arise.
The last annoying problem I had, was with the menable frame grabber together with pco cameras, however it is not clear to me if it's a server or frame grabber problem.
Summary: Most values read from both pco.edge and pco.4000 were corrupted (e.g. 40k by 40k sensor size) but consistent, some were still correct (5ms exposure time, 0ms delay time).
Reproduce: Run the
diagnose
tool fromlibpco
orgrab
fromlibuca
. Both failed miserably.Solution: We connected both cameras to two different PCs. They were working flawlessly.