Modify

Opened 12 years ago

Closed 12 years ago

#63 closed defect (fixed)

Instability of UFO Server

Reported by: Suren A. Chilingaryan Owned by: Suren A. Chilingaryan
Priority: critical Milestone:
Component: Infrastructure Version:
Keywords: Cc: Matthias Balzer, Suren A. Chilingaryan, Tomas Farago, Patrik Vagovic, Tomy Rolo, thomas.vandekamp@gmx.de, Matthias Vogelgesang, WMEXNER

Description (last modified by Suren A. Chilingaryan)

There is reports that UFO server behaves unstable and periodically hangs.

I can't really reproduce the problem. I ran a set of tests to verify CPUs, GPUs, Memory, Raid storage, and network. I get no problems. PyHST and UFO framework were run continuously for a week without any problems. I.e. either
1) The problem was due malfunctioning PSU and everything is fine now
2) The problem occurs under some external condition (overheating?)
3) There is misbehaving application causing crash or bring system into the instable state. Or either some of the used applications are misbehaving under certain circumstances (for instance, with certain options set or with specific data-set).

In any case, to investigate this problem I need detailed bug reports. If hangs/crashes are arising again. I need as much information as possible. Which application caused crash, along with options and data set (i.e. how I can execute it). Which applications were running in parallel. What was done before the crash, etc. Was were any error reports before crash? Which environment you were using: SSH or NX session?

Attachments (0)

Change History (28)

comment:1 Changed 12 years ago by Matthias Vogelgesang

The last annoying problem I had, was with the menable frame grabber together with pco cameras, however it is not clear to me if it's a server or frame grabber problem.

Summary: Most values read from both pco.edge and pco.4000 were corrupted (e.g. 40k by 40k sensor size) but consistent, some were still correct (5ms exposure time, 0ms delay time).

Reproduce: Run the diagnose tool from libpco or grab from libuca. Both failed miserably.

Solution: We connected both cameras to two different PCs. They were working flawlessly.

comment:2 Changed 12 years ago by Suren A. Chilingaryan

Thanks, Matthias. Had system crashed/hanged immediately or afterwards? Or you just get the wrong values?

Please, let me know when you have time - we can recheck this.

comment:3 Changed 12 years ago by Matthias Vogelgesang

No, it was not crashing nor hanging. Just the CameraLink communication was affected.

comment:4 Changed 12 years ago by Suren A. Chilingaryan

According to IPMI, we get from power supply instead of -12V approx. -11.6 V. I don't know how harmful is this, but on ipepdvcompute1 all voltages are quiet precise. +12V, +5V, etc. is also precise on ufosrv1.

Anybody knows that is acceptable variance?

comment:5 Changed 12 years ago by Suren A. Chilingaryan

ufosrv1 kernel: [77620.431743] BUG: unable to handle kernel NULL pointer dereference at           (null)

comment:6 Changed 12 years ago by Suren A. Chilingaryan

Looks like nvidia-smi can crash the UFO server if continuously run for 1-2 days. I had upgraded NVIDIA drivers to test if it is version dependent.

comment:7 Changed 12 years ago by Suren A. Chilingaryan

ufosrv1 was running 295.41 until Friday. I have tried this version with ipepdvcompute1 this weekend and it was also crashing. Both systems now updated to CUDA5 beta driver.

comment:8 Changed 12 years ago by Matthias Vogelgesang

It's broken again.

comment:9 Changed 12 years ago by Suren A. Chilingaryan

Fatal exception in interrupt, Bad RIP value, Backtrace:

do_softirq
irq_ext
smp_apic_timer_interrupt
apic_timer_interrupt
intel_idle
__atomic_notifier_call_chain
cpu_idle

Booted with "noapic" flag

comment:10 Changed 12 years ago by Suren A. Chilingaryan

  • Kernel upgrade to 3.4
  • Reverted to 295.xx driver family due to kernel incompatibility (295.59)

comment:11 Changed 12 years ago by Suren A. Chilingaryan

The same crash with the new kernel (and without frame grabber module loaded). Here is current list of modules:

binfmt_misc            17540  1 
nfs                   411631  1 
lockd                  85545  1 nfs
fscache                61840  1 nfs
auth_rpcgss            45721  1 nfs
nfs_acl                12883  1 nfs
sunrpc                261456  16 nfs,lockd,auth_rpcgss,nfs_acl
af_packet              39810  0 
ipmi_devintf           17707  0 
ipmi_si                53468  0 
ipmi_msghandler        50349  2 ipmi_devintf,ipmi_si
w83795                 52252  0 
w83627ehf              43321  0 
hwmon_vid              12827  1 w83627ehf
lm75                   13701  0 
jc42                   13947  0 
cpufreq_conservative    13821  0 
cpufreq_userspace      13162  0 
cpufreq_powersave      12618  0 
dm_mod                101260  0 
xfs                   926900  2 
nvidia              12358288  0 
joydev                 17606  0 
acpi_cpufreq           18857  1 
mperf                  12667  1 acpi_cpufreq
coretemp               13692  0 
crc32c_intel           12858  0 
snd_hda_codec_hdmi     40651  24 
ixgbe                 220311  0 
microcode              35998  0 
pcspkr                 12718  0 
serio_raw              13371  0 
sg                     36594  0 
snd_hda_codec_realtek    87227  1 
i2c_i801               18013  0 
iTCO_wdt               18039  0 
iTCO_vendor_support    13718  1 iTCO_wdt
e1000e                218340  0 
snd_hda_intel          33874  0 
snd_hda_codec         141096  3 snd_hda_codec_hdmi,snd_hda_codec_realtek,snd_hda_intel
snd_hwdep              13613  1 snd_hda_codec
snd_pcm               110316  3 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec
button                 13949  0 
snd_timer              34085  1 snd_pcm
i7core_edac            28102  0 
snd                    91780  7 snd_hda_codec_hdmi,snd_hda_codec_realtek,snd_hda_intel,snd_hda_codec,snd_hwdep,snd_pcm,snd_timer
ioatdma                58876  64 
edac_core              57842  4 i7core_edac
soundcore              15091  1 snd
dca                    15232  2 ixgbe,ioatdma
mdio                   13770  1 ixgbe
snd_page_alloc         14476  2 snd_hda_intel,snd_pcm
autofs4                43331  2 
raid456                74241  0 
async_raid6_recov      17348  1 raid456
async_pq               13429  2 raid456,async_raid6_recov
raid6_pq               88307  2 async_raid6_recov,async_pq
async_xor              13082  3 raid456,async_raid6_recov,async_pq
xor                    12894  1 async_xor
async_memcpy           12650  2 raid456,async_raid6_recov
async_tx               13470  5 raid456,async_raid6_recov,async_pq,async_xor,async_memcpy
raid10                 39640  0 
raid0                  17969  0 
raid1                  40002  3 
ata_piix               35206  6 
processor              45839  1 acpi_cpufreq
thermal_sys            25053  1 processor
ata_generic            12937  0 
arcmsr                 41605  4 

The following extra modules are loaded compared to ipepdvcompute1:

async_memcpy
async_pq
async_raid6_recov
async_tx
async_xor
ata_generic
ata_piix
auth_rpcgss
dm_mod
fscache
ixgbe
jc42
lockd
mdio
nfs
nfs_acl
raid0
raid1
raid10
raid456
raid6_pq
sunrpc
xfs
xor

comment:12 Changed 12 years ago by Suren A. Chilingaryan

}}}Ä67686.060886Ü BUG: unable to handle kernel NULL pointer dereference at           (null)
Ä67686.068722Ü IP: Ä<          (null)>Ü           (null)
Ä67686.073770Ü PGD 0 
Ä67686.075792Ü Oops: 0010 Ä#1Ü PREEMPT SMP 
Ä67686.079753Ü CPU 0 
Ä67686.081584Ü Modules linked in: binfmt_misc nvidia(PO) nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 jc42 cpufreq_conservative cpufreq_userspace cpufreq_powersave dm_mod xfs joydev snd_hda_codec_hdmi sg snd_hda_codec_realtek snd_hda_intel snd_hda_codec acpi_cpufreq snd_hwdep mperf snd_pcm iTCO_wdt coretemp snd_timer crc32c_intel microcode snd serio_raw pcspkr i7core_edac i2c_i801 iTCO_vendor_support e1000e ioatdma edac_core ixgbe soundcore button snd_page_alloc dca mdio autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr Älast unloaded: nvidiaÜ
Ä67686.146730Ü 
Ä67686.148219Ü Pid: 0, comm: swapper/0 Tainted: P           O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF
Ä67686.157891Ü RIP: 0010:Ä<0000000000000000>Ü  Ä<          (null)>Ü           (null)
Ä67686.165359Ü RSP: 0018:ffff880cbfc03e20  EFLAGS: 00010082
Ä67686.170651Ü RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000027fa
Ä67686.177757Ü RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff880c77e02008
Ä67686.184862Ü RBP: ffff880c696373b0 R08: 0000000000000001 R09: ffff880c76e1b494
Ä67686.191967Ü R10: 0000000000000000 R11: 0000000000000001 R12: ffff880c77e02008
Ä67686.199072Ü R13: ffff880c777d2008 R14: 0000000000000000 R15: ffff880c81390008
Ä67686.206179Ü FS:  0000000000000000(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000
Ä67686.214236Ü CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Ä67686.219960Ü CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000007f0
Ä67686.227074Ü DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Ä67686.234180Ü DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Ä67686.241294Ü Process swapper/0 (pid: 0, threadinfo ffffffff81c00000, task ffffffff81c13020)
Ä67686.249523Ü Stack:
Ä67686.251529Ü  ffffffffa1800a72 ffff880c777d2008 ffff880c696343b8 ffff880cbfc03eb4
Ä67686.258964Ü  0000000000000010 0000000000000000 ffffffffa180b8bd 0000000100000086
Ä67686.266396Ü  0000000000000046 0000000000000010 ffff881877a18800 0000000000000000
Ä67686.273832Ü Call Trace:
Ä67686.276270Ü Inexact backtrace:
Ä67686.276271Ü 
Ä67686.280799Ü  <IRQ> 
Ä67686.283009Ü  Ä<ffffffffa1800a72>Ü ? _nv014592rm+0x8d/0xe4 ÄnvidiaÜ
Ä67686.289241Ü  Ä<ffffffffa180b8bd>Ü ? rm_isr+0x12d/0x236 ÄnvidiaÜ
Ä67686.295211Ü  Ä<ffffffffa182bdf1>Ü ? nv_kern_isr+0x21/0x70 ÄnvidiaÜ
Ä67686.301370Ü  Ä<ffffffff810cc8e5>Ü ? handle_irq_event_percpu+0x75/0x2a0
Ä67686.307876Ü  Ä<ffffffff810ccb57>Ü ? handle_irq_event+0x47/0x70
Ä67686.313694Ü  Ä<ffffffff810cfcc0>Ü ? handle_fasteoi_irq+0x60/0x100
Ä67686.319773Ü  Ä<ffffffff810041e8>Ü ? handle_irq+0x18/0x30
Ä67686.325071Ü  Ä<ffffffff81003e63>Ü ? do_IRQ+0x53/0xd0
Ä67686.330025Ü  Ä<ffffffff815a4dea>Ü ? common_interrupt+0x6a/0x6a
Ä67686.335841Ü  <EOI> 
Ä67686.337952Ü  Ä<ffffffff8100af78>Ü ? poll_idle+0x48/0x2b0
Ä67686.343251Ü  Ä<ffffffff8100bd76>Ü ? cpu_idle+0x96/0xf0
Ä67686.348379Ü  Ä<ffffffff81cbfbbd>Ü ? start_kernel+0x39e/0x3a9
Ä67686.354022Ü  Ä<ffffffff81cbf6c2>Ü ? repair_env_string+0x57/0x57
Ä67686.359926Ü  Ä<ffffffff81cbf140>Ü ? early_idt_handlers+0x140/0x140
Ä67686.366089Ü  Ä<ffffffff81cbf433>Ü ? x86_64_start_kernel+0xd1/0xe0
Ä67686.372165Ü Code:  Bad RIP value.
Ä67686.375500Ü RIP  Ä<          (null)>Ü           (null)
Ä67686.380635Ü  RSP <ffff880cbfc03e20>
Ä67686.384111Ü CR2: 0000000000000000
Ä67686.387740Ü ---Ä end trace 4ac351a70eb9c3ec Ü---
Ä67686.392343Ü Kernel panic - not syncing: Fatal exception in interrupt

comment:13 Changed 12 years ago by Suren A. Chilingaryan

  • Updated BIOS 2.0a to 2.0c
  • With new BIOS, the system looks like to be affected by kernel bug #43282. Passing "ghes.disable=1" to kernel is proposed as temporary solution. However, for me bug was gone by selected Fail-safe settings and ACPI3 compatibility in BIOS.
  • All NVIDIA GPUs share IRQ16, on ipepdvcompute1 some of GPUs use IRQ16 and others IRQ18. Some time ago, the following bug was fixed (NVIDIA Changelog)
    Fixed an interrupt handling deficiency that could lead to performance 
    and stability problems when many NVIDIA GPUs shared few IRQs.
    
  • Similar problems on the net 1 2
  • Booted with irqpoll kernel parameter

comment:14 Changed 12 years ago by Suren A. Chilingaryan

  • Enabled native support of PCI Express in BIOS
  • Updated nvidia driver to 304.22

comment:15 Changed 12 years ago by Suren A. Chilingaryan

  • Documentation/PCI/pcieaer-howto.txt
  • We got following errors sporadically reported (no crash)
    Ä43908.631425Ü pcieport 0000:86:08.0:    Ä 0Ü Receiver Error         (First)
    Ä46313.504492Ü pcieport 0000:80:03.0: AER: Corrected error received: id=0000
    Ä46313.511344Ü pcieport 0000:86:08.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8640(Receiver ID)
    Ä46313.521519Ü pcieport 0000:86:08.0:   device Ä10b5:8648Ü error status/mask=00000001/00002000
    
  • 10b5:8648 is PLX Technology, Inc. PEX 8648 48-lane, 12-Port PCI Express Gen 2 (5.0 GT/s) Switch, i.e. external GPU box
  • pcieport 0000:86:08.0 refers one of the mentioned above bridges
  • pcieport 0000:80:03.0 refers to Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 [8086:340a]
  • This bug reports seems related 1, 2
  • Especially the 2nd has very similar behaviour. Problem was solved by moving card in different PCIe slot.
  • I.e. the problem may lay in PCIe communication. Possibilities are:
    • Problem in NVIDIA drivers preventing devices from correct operation in certain slots.
    • Damaged PCIe slot on motherboard
    • Damaged PCIe cable
    • Damaged slot in external box
    • Damaged PCIe interface of GPU card
  • According to PCIe tree all the problems are registered at the specific PCIe slot / GPU device. So, it seems that either one of the GPU box slots or one of the GPUs may cause problems.
     |           +-03.0-[83-8c]----00.0-[84-8c]--+-04.0-[85-88]----00.0-[86-88]--+
     |           |                               |                               |
     |           |                               |                               \-08.0-[88]--+-00.0  nVidia Corporation GF110 [GeForce GTX 580] [10de:1080]
    


PCIe Tree

-+-[0000:ff]-+-00.0  Intel Corporation Xeon 5600 Series QuickPath Architecture Generic Non-core Registers [8086:2c70]
 |           +-00.1  Intel Corporation Xeon 5600 Series QuickPath Architecture System Address Decoder [8086:2d81]
 |           +-02.0  Intel Corporation Xeon 5600 Series QPI Link 0 [8086:2d90]
 |           +-02.1  Intel Corporation Xeon 5600 Series QPI Physical 0 [8086:2d91]
 |           +-02.2  Intel Corporation Xeon 5600 Series Mirror Port Link 0 [8086:2d92]
 |           +-02.3  Intel Corporation Xeon 5600 Series Mirror Port Link 1 [8086:2d93]
 |           +-02.4  Intel Corporation Xeon 5600 Series QPI Link 1 [8086:2d94]
 |           +-02.5  Intel Corporation Xeon 5600 Series QPI Physical 1 [8086:2d95]
 |           +-03.0  Intel Corporation Xeon 5600 Series Integrated Memory Controller Registers [8086:2d98]
 |           +-03.1  Intel Corporation Xeon 5600 Series Integrated Memory Controller Target Address Decoder [8086:2d99]
 |           +-03.2  Intel Corporation Xeon 5600 Series Integrated Memory Controller RAS Registers [8086:2d9a]
 |           +-03.4  Intel Corporation Xeon 5600 Series Integrated Memory Controller Test Registers [8086:2d9c]
 |           +-04.0  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Control [8086:2da0]
 |           +-04.1  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Address [8086:2da1]
 |           +-04.2  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Rank [8086:2da2]
 |           +-04.3  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Thermal Control [8086:2da3]
 |           +-05.0  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Control [8086:2da8]
 |           +-05.1  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Address [8086:2da9]
 |           +-05.2  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Rank [8086:2daa]
 |           +-05.3  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Thermal Control [8086:2dab]
 |           +-06.0  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Control [8086:2db0]
 |           +-06.1  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Address [8086:2db1]
 |           +-06.2  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Rank [8086:2db2]
 |           \-06.3  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Thermal Control [8086:2db3]
 +-[0000:fe]-+-00.0  Intel Corporation Xeon 5600 Series QuickPath Architecture Generic Non-core Registers [8086:2c70]
 |           +-00.1  Intel Corporation Xeon 5600 Series QuickPath Architecture System Address Decoder [8086:2d81]
 |           +-02.0  Intel Corporation Xeon 5600 Series QPI Link 0 [8086:2d90]
 |           +-02.1  Intel Corporation Xeon 5600 Series QPI Physical 0 [8086:2d91]
 |           +-02.2  Intel Corporation Xeon 5600 Series Mirror Port Link 0 [8086:2d92]
 |           +-02.3  Intel Corporation Xeon 5600 Series Mirror Port Link 1 [8086:2d93]
 |           +-02.4  Intel Corporation Xeon 5600 Series QPI Link 1 [8086:2d94]
 |           +-02.5  Intel Corporation Xeon 5600 Series QPI Physical 1 [8086:2d95]
 |           +-03.0  Intel Corporation Xeon 5600 Series Integrated Memory Controller Registers [8086:2d98]
 |           +-03.1  Intel Corporation Xeon 5600 Series Integrated Memory Controller Target Address Decoder [8086:2d99]
 |           +-03.2  Intel Corporation Xeon 5600 Series Integrated Memory Controller RAS Registers [8086:2d9a]
 |           +-03.4  Intel Corporation Xeon 5600 Series Integrated Memory Controller Test Registers [8086:2d9c]
 |           +-04.0  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Control [8086:2da0]
 |           +-04.1  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Address [8086:2da1]
 |           +-04.2  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Rank [8086:2da2]
 |           +-04.3  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Thermal Control [8086:2da3]
 |           +-05.0  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Control [8086:2da8]
 |           +-05.1  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Address [8086:2da9]
 |           +-05.2  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Rank [8086:2daa]
 |           +-05.3  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Thermal Control [8086:2dab]
 |           +-06.0  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Control [8086:2db0]
 |           +-06.1  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Address [8086:2db1]
 |           +-06.2  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Rank [8086:2db2]
 |           \-06.3  Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Thermal Control [8086:2db3]
 +-[0000:80]-+-00.0-[81]--
 |           +-01.0-[82]--
 |           +-03.0-[83-8c]----00.0-[84-8c]--+-04.0-[85-88]----00.0-[86-88]--+-04.0-[87]--+-00.0  nVidia Corporation GF110 [GeForce GTX 580] [10de:1080]
 |           |                               |                               |            \-00.1  nVidia Corporation GF110 High Definition Audio Controller [10de:0e09]
 |           |                               |                               \-08.0-[88]--+-00.0  nVidia Corporation GF110 [GeForce GTX 580] [10de:1080]
 |           |                               |                                            \-00.1  nVidia Corporation GF110 High Definition Audio Controller [10de:0e09]
 |           |                               \-08.0-[89-8c]----00.0-[8a-8c]--+-04.0-[8b]--+-00.0  nVidia Corporation GF110 [GeForce GTX 580] [10de:1080]
 |           |                                                               |            \-00.1  nVidia Corporation GF110 High Definition Audio Controller [10de:0e09]
 |           |                                                               \-08.0-[8c]--+-00.0  nVidia Corporation GF110 [GeForce GTX 580] [10de:1080]
 |           |                                                                            \-00.1  nVidia Corporation GF110 High Definition Audio Controller [10de:0e09]
 |           +-07.0-[8d]----00.0  Areca Technology Corp. ARC-1880 8/12 port PCIe/PCI-X to SAS/SATA II RAID Controller [17d3:1880]
 |           +-13.0  Intel Corporation 5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller [8086:342d]
 |           +-14.0  Intel Corporation 5520/5500/X58 I/O Hub System Management Registers [8086:342e]
 |           +-14.1  Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers [8086:3422]
 |           +-14.2  Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers [8086:3423]
 |           +-14.3  Intel Corporation 5520/5500/X58 I/O Hub Throttle Registers [8086:3438]
 |           +-16.0  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3430]
 |           +-16.1  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3431]
 |           +-16.2  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3432]
 |           +-16.3  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3433]
 |           +-16.4  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3429]
 |           +-16.5  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342a]
 |           +-16.6  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342b]
 |           \-16.7  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342c]
 \-[0000:00]-+-00.0  Intel Corporation 5520 I/O Hub to ESI Port [8086:3406]
             +-01.0-[01]----00.0  Intel Corporation 82598EB 10-Gigabit AF Network Connection [8086:10c7]
             +-03.0-[02]--+-00.0  nVidia Corporation GF110 [GeForce GTX 580] [10de:1080]
             |            \-00.1  nVidia Corporation GF110 High Definition Audio Controller [10de:0e09]
             +-07.0-[03]--+-00.0  nVidia Corporation GF110 [GeForce GTX 580] [10de:1080]
             |            \-00.1  nVidia Corporation GF110 High Definition Audio Controller [10de:0e09]
             +-13.0  Intel Corporation 5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller [8086:342d]
             +-14.0  Intel Corporation 5520/5500/X58 I/O Hub System Management Registers [8086:342e]
             +-16.0  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3430]
             +-16.1  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3431]
             +-16.2  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3432]
             +-16.3  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3433]
             +-16.4  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:3429]
             +-16.5  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342a]
             +-16.6  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342b]
             +-16.7  Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device [8086:342c]
             +-1a.0  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 [8086:3a37]
             +-1a.1  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 [8086:3a38]
             +-1a.2  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 [8086:3a39]
             +-1a.7  Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 [8086:3a3c]
             +-1c.0-[04]----00.0  Silicon Software GmbH microEnable IV-FULL x4 [1ae8:0a44]
             +-1c.4-[05]----00.0  Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
             +-1c.5-[06]----00.0  Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
             +-1d.0  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 [8086:3a34]
             +-1d.1  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 [8086:3a35]
             +-1d.2  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 [8086:3a36]
             +-1d.7  Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 [8086:3a3a]
             +-1e.0-[07]----01.0  Matrox Graphics, Inc. MGA G200eW WPCM450 [102b:0532]
             +-1f.0  Intel Corporation 82801JIR (ICH10R) LPC Interface Controller [8086:3a16]
             +-1f.2  Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1 [8086:3a20]
             +-1f.3  Intel Corporation 82801JI (ICH10 Family) SMBus Controller [8086:3a30]
             \-1f.5  Intel Corporation 82801JI (ICH10 Family) 2 port SATA IDE Controller #2 [8086:3a26]

comment:16 Changed 12 years ago by Suren A. Chilingaryan

  • 5 days without crashes with GPU box disconnected, no PCIe problems reported
  • PCIe cable is banded under extreme angles. I think it could cause the problems on the bus.
  • Booted with cable banding fixed.

comment:17 Changed 12 years ago by Suren A. Chilingaryan

Problems on PCIe bus are not registered any more However, the server crashed again with RIP=NULL in interrupt. So it should be unrelated problems:

[81650.651295] BUG: unable to handle kernel NULL pointer dereference at           (null)
[81650.659133] IP: [<          (null)>]           (null)
[81650.664182] PGD 187855b067 PUD 187fbdc067 PMD 0 
[81650.668832] Oops: 0010 [#1] PREEMPT SMP 
[81650.672793] CPU 0 
[81650.674625] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 jc42 cpufreq_conservative cpufreq_userspace cpufreq_powersave dm_mod xfs joydev nvidia(PO) snd_hda_codec_hdmi sg snd_hda_intel snd_hda_codec acpi_cpufreq mperf snd_hwdep snd_pcm coretemp snd_timer crc32c_intel pcspkr serio_raw snd iTCO_wdt microcode i2c_i801 iTCO_vendor_support e1000e i7core_edac soundcore ixgbe ioatdma button snd_page_alloc edac_core dca mdio autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr
[81650.735761] 
[81650.737246] Pid: 28536, comm: nvidia-smi Tainted: P           O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF
[81650.747343] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[81650.754812] RSP: 0018:ffff880cbfc03e30  EFLAGS: 00010082
[81650.760103] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000008779
[81650.767210] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff880c709d0008
[81650.774314] RBP: ffff880c70e0f730 R08: 0000000000000001 R09: ffff881880e2769c
[81650.781420] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880c709d0008
[81650.788526] R13: ffff880c71806008 R14: 0000000000000000 R15: ffff880c70a02008
[81650.795632] FS:  00007f94c19c9700(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000
[81650.803689] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[81650.809413] CR2: 0000000000000000 CR3: 000000187fc35000 CR4: 00000000000007f0
[81650.816517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[81650.823623] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[81650.830730] Process nvidia-smi (pid: 28536, threadinfo ffff880c72404000, task ffff880c73a02500)
[81650.839392] Stack:
[81650.841396]  ffffffffa13e9d2b ffff880c71806008 ffff880c70e0c738 ffff880cbfc03eb4
[81650.848831]  0000000000000010 0000000000000005 ffffffffa13f2a93 0000000000000010
[81650.856266]  ffff88187b0a4800 0000000000000000 ffff880c724058e8 0000000000000009
[81650.863701] Call Trace:
[81650.866138] Inexact backtrace:
[81650.866138] 
[81650.870668]  <IRQ> 
[81650.872881]  [<ffffffffa13e9d2b>] ? _nv014839rm+0x8d/0xe2 [nvidia]
[81650.879094]  [<ffffffffa13f2a93>] ? rm_isr+0xb8/0x152 [nvidia]
[81650.884962]  [<ffffffffa14103f1>] ? nv_kern_isr+0x21/0x70 [nvidia]
[81650.891126]  [<ffffffff810cc8e5>] ? handle_irq_event_percpu+0x75/0x2a0
[81650.897632]  [<ffffffff810ccb57>] ? handle_irq_event+0x47/0x70
[81650.903440]  [<ffffffff810cfcc0>] ? handle_fasteoi_irq+0x60/0x100
[81650.909510]  [<ffffffff810041e8>] ? handle_irq+0x18/0x30
[81650.914799]  [<ffffffff81003e63>] ? do_IRQ+0x53/0xd0
[81650.919745]  [<ffffffff815a4dea>] ? common_interrupt+0x6a/0x6a
[81650.925560]  <EOI> 
[81650.927715]  [<ffffffffa0e21b21>] ? _nv014625rm+0x1c1/0x1c2 [nvidia]
[81650.934087]  [<ffffffffa0e2178c>] ? _nv014620rm+0x2b/0x8d [nvidia]
[81650.940293]  [<ffffffffa0e217c2>] ? _nv014620rm+0x61/0x8d [nvidia]
[81650.946500]  [<ffffffffa0e217c2>] ? _nv014620rm+0x61/0x8d [nvidia]
[81650.952698]  [<ffffffffa0e21875>] ? _nv014622rm+0x32/0x9e [nvidia]
[81650.958896]  [<ffffffffa0e21919>] ? _nv014616rm+0x38/0x46 [nvidia]
[81650.965147]  [<ffffffffa12927e3>] ? _nv004062rm+0x1be/0x1607 [nvidia]
[81650.971664]  [<ffffffffa1292640>] ? _nv004062rm+0x1b/0x1607 [nvidia]
[81650.978112]  [<ffffffffa11b6b22>] ? _nv004040rm+0x1289/0xae92 [nvidia]
[81650.984734]  [<ffffffffa11b623e>] ? _nv004040rm+0x9a5/0xae92 [nvidia]
[81650.991269]  [<ffffffffa11b9595>] ? _nv004040rm+0x3cfc/0xae92 [nvidia]
[81650.997890]  [<ffffffffa11b6096>] ? _nv004040rm+0x7fd/0xae92 [nvidia]
[81651.004345]  [<ffffffffa0dee9f7>] ? _nv009855rm+0x175/0x278 [nvidia]
[81651.010735]  [<ffffffffa13f7abb>] ? _nv014821rm+0x21a/0x389 [nvidia]
[81651.017123]  [<ffffffffa13f9061>] ? _nv001090rm+0xac/0x65e [nvidia]
[81651.023423]  [<ffffffffa13f1e24>] ? rm_init_adapter+0xac/0x146 [nvidia]
[81651.030070]  [<ffffffffa1411f0c>] ? nv_kern_open+0x45c/0x810 [nvidia]
[81651.036493]  [<ffffffff8116650d>] ? chrdev_open+0x9d/0x1b0
[81651.041963]  [<ffffffff81166470>] ? cdev_put+0x30/0x30
[81651.047089]  [<ffffffff8116006a>] ? __dentry_open+0x25a/0x330
[81651.052813]  [<ffffffff811711b8>] ? do_last+0x408/0x750
[81651.058025]  [<ffffffff8116dd05>] ? path_init+0x315/0x400
[81651.063409]  [<ffffffff81171619>] ? path_openat+0xd9/0x400
[81651.068873]  [<ffffffff81171a65>] ? do_filp_open+0x45/0xb0
[81651.074335]  [<ffffffff8116d511>] ? getname_flags+0x31/0xf0
[81651.079888]  [<ffffffff8117e11b>] ? alloc_fd+0xcb/0x120
[81651.085099]  [<ffffffff81161608>] ? do_sys_open+0xf8/0x1d0
[81651.090563]  [<ffffffff815abc39>] ? system_call_fastpath+0x16/0x1b
[81651.096717] Code:  Bad RIP value.
[81651.100052] RIP  [<          (null)>]           (null)
[81651.105189]  RSP <ffff880cbfc03e30>
[81651.108662] CR2: 0000000000000000
[81651.112291] ---[ end trace d1e853d782dd9d8e ]---
[81651.116892] Kernel panic - not syncing: Fatal exception in interrupt
[81651.442356] ------------[ cut here ]------------
[81651.446965] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.4.4/linux-3.4/arch/x86/kernel/smp.c:120 update_process_times+0x65/0x80()
[81651.459609] Hardware name: X8DTG-QF
[81651.463084] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 jc42 cpufreq_conservative cpufreq_userspace cpufreq_powersave dm_mod xfs joydev nvidia(PO) snd_hda_codec_hdmi sg snd_hda_intel snd_hda_codec acpi_cpufreq mperf snd_hwdep snd_pcm coretemp snd_timer crc32c_intel pcspkr serio_raw snd iTCO_wdt microcode i2c_i801 iTCO_vendor_support e1000e i7core_edac soundcore ixgbe ioatdma button snd_page_alloc edac_core dca mdio autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr
[81651.524030] Pid: 28536, comm: nvidia-smi Tainted: P      D    O 3.4.4-2-desktop #1
[81651.531566] Call Trace:
[81651.534010]  [<ffffffff810043fa>] dump_trace+0xaa/0x2b0
[81651.539221]  [<ffffffff8158b1b1>] dump_stack+0x69/0x6f
[81651.544349]  [<ffffffff8104010b>] warn_slowpath_common+0x7b/0xc0
[81651.550338]  [<ffffffff81050525>] update_process_times+0x65/0x80
[81651.556328]  [<ffffffff81093a1b>] tick_sched_timer+0x5b/0xc0
[81651.561974]  [<ffffffff8106602e>] __run_hrtimer+0x6e/0x240
[81651.567444]  [<ffffffff810667e5>] hrtimer_interrupt+0xe5/0x200
[81651.573264]  [<ffffffff81021a93>] smp_apic_timer_interrupt+0x63/0xa0
[81651.579598]  [<ffffffff815ac73a>] apic_timer_interrupt+0x6a/0x70
[81651.585583]  [<ffffffff8158dd57>] panic+0x18f/0x1d2
[81651.590448]  [<ffffffff815a5ccf>] oops_end+0xef/0xf0
[81651.595402]  [<ffffffff815a8042>] do_page_fault+0x402/0x530
[81651.600960]  [<ffffffff815a5075>] page_fault+0x25/0x30
[81651.606086] ---[ end trace d1e853d782dd9d8f ]---

comment:18 Changed 12 years ago by Suren A. Chilingaryan

Last edited 12 years ago by Suren A. Chilingaryan (previous) (diff)

comment:19 Changed 12 years ago by Suren A. Chilingaryan

[10744.359268] BUG: unable to handle kernel NULL pointer dereference at           (null)
[10744.367103] IP: [<          (null)>]           (null)
[10744.372152] PGD c77c72067 PUD c787c3067 PMD 0 
[10744.376629] Oops: 0010 [#1] PREEMPT SMP 
[10744.380590] CPU 0 
[10744.382420] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod snd_hda_codec_hdmi snd_hda_intel snd_hda_codec xfs ixgbe nvidia(PO) ioatdma acpi_cpufreq snd_hwdep snd_pcm snd_timer e1000e snd sg iTCO_wdt serio_raw joydev i7core_edac dca mperf button coretemp pcspkr iTCO_vendor_support i2c_i801 edac_core soundcore crc32c_intel snd_page_alloc mdio microcode autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr
[10744.443557][10744.445043] Pid: 15063, comm: nvidia-smi Tainted: P           O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF
[10744.455139] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[10744.462609] RSP: 0018:ffff880cbfc03e30  EFLAGS: 00010082
[10744.467900] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e293
[10744.475006] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff880c7589c008
[10744.482112] RBP: ffff880c7534b430 R08: 0000000000000001 R09: ffff880c73bb7b1c
[10744.489216] R10: 0000000000000000 R11: 0000000000000010 R12: ffff880c7589c008
[10744.496323] R13: ffff880c77a88008 R14: 0000000000000000 R15: ffff880c74b6d008
[10744.503428] FS:  00007fef894e2700(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000
[10744.511486] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10744.517209] CR2: 0000000000000000 CR3: 0000000c8186c000 CR4: 00000000000007f0
[10744.524314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10744.531419] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10744.538527] Process nvidia-smi (pid: 15063, threadinfo ffff880c75582000, task ffff880c7552a3c0)
[10744.547187] Stack:
[10744.549193]  ffffffffa1e5193b ffff880c77a88008 ffff880c75348438 ffff880cbfc03eb4
[10744.556627]  0000000000000010 0000000000000005 ffffffffa1e5a6d3 0000000000000010
[10744.564061]  ffff88187fe7a000 0000000000000000 ffff880c75583918 0000000000000009
[10744.571496] Call Trace:
[10744.573934] Inexact backtrace:
[10744.573935] 
[10744.578464]  <IRQ> 
[10744.580646]  [<ffffffffa1e5193b>] ? _nv014871rm+0x8d/0xe2 [nvidia]
[10744.586864]  [<ffffffffa1e5a6d3>] ? rm_isr+0xb8/0x152 [nvidia]
[10744.592730]  [<ffffffffa1e78031>] ? nv_kern_isr+0x21/0x70 [nvidia]
[10744.598894]  [<ffffffff810cc8e5>] ? handle_irq_event_percpu+0x75/0x2a0
[10744.605403]  [<ffffffff810ccb57>] ? handle_irq_event+0x47/0x70
[10744.611221]  [<ffffffff810cfcc0>] ? handle_fasteoi_irq+0x60/0x100
[10744.617298]  [<ffffffff810041e8>] ? handle_irq+0x18/0x30
[10744.622595]  [<ffffffff81003e63>] ? do_IRQ+0x53/0xd0
[10744.627542]  [<ffffffff815a4dea>] ? common_interrupt+0x6a/0x6a
[10744.633358]  <EOI> 
[10744.635470]  [<ffffffff8114d532>] ? cache_grow+0x202/0x2b0
[10744.640985]  [<ffffffffa1886821>] ? _nv014644rm+0x1ed/0x1ed [nvidia]
[10744.647364]  [<ffffffffa1886882>] ? _nv014650rm+0x61/0x8d [nvidia]
[10744.653570]  [<ffffffffa1886882>] ? _nv014650rm+0x61/0x8d [nvidia]
[10744.659768]  [<ffffffffa1886935>] ? _nv014652rm+0x32/0x9e [nvidia]
[10744.665966]  [<ffffffffa18869d9>] ? _nv014646rm+0x38/0x46 [nvidia]
[10744.672215]  [<ffffffffa1cf9e03>] ? _nv004066rm+0x1be/0x1607 [nvidia]
[10744.678725]  [<ffffffffa1cf9c60>] ? _nv004066rm+0x1b/0x1607 [nvidia]
[10744.685173]  [<ffffffffa1c1ca92>] ? _nv004044rm+0x1259/0xae8b [nvidia]
[10744.691795]  [<ffffffffa1c1c1b1>] ? _nv004044rm+0x978/0xae8b [nvidia]
[10744.698330]  [<ffffffffa1c1f505>] ? _nv004044rm+0x3ccc/0xae8b [nvidia]

[10744.704951]  [<ffffffffa1c1c009>] ? _nv004044rm+0x7d0/0xae8b [nvidia]
[10744.711406]  [<ffffffffa1853ab7>] ? _nv009864rm+0x175/0x278 [nvidia]
[10744.717795]  [<ffffffffa1e5f6fb>] ? _nv014853rm+0x21a/0x389 [nvidia]
[10744.724184]  [<ffffffffa1e60ca1>] ? _nv001095rm+0xac/0x65e [nvidia]
[10744.730485]  [<ffffffffa1e59a64>] ? rm_init_adapter+0xac/0x146 [nvidia]
[10744.737131]  [<ffffffffa1e79b4c>] ? nv_kern_open+0x45c/0x810 [nvidia]
[10744.743553]  [<ffffffff8116650d>] ? chrdev_open+0x9d/0x1b0
[10744.749023]  [<ffffffff81166470>] ? cdev_put+0x30/0x30
[10744.754142]  [<ffffffff8116006a>] ? __dentry_open+0x25a/0x330
[10744.759864]  [<ffffffff811711b8>] ? do_last+0x408/0x750
[10744.765068]  [<ffffffff8116dd05>] ? path_init+0x315/0x400
[10744.770445]  [<ffffffff81171619>] ? path_openat+0xd9/0x400
[10744.775908]  [<ffffffff81171a65>] ? do_filp_open+0x45/0xb0
[10744.781371]  [<ffffffff8116d511>] ? getname_flags+0x31/0xf0
[10744.786924]  [<ffffffff8117e11b>] ? alloc_fd+0xcb/0x120
[10744.792135]  [<ffffffff81161608>] ? do_sys_open+0xf8/0x1d0
[10744.797607]  [<ffffffff815abc39>] ? system_call_fastpath+0x16/0x1b
[10744.803768] Code:  Bad RIP value.
[10744.807113] RIP  [<          (null)>]           (null)
[10744.812250]  RSP <ffff880cbfc03e30>
[10744.815724] CR2: 0000000000000000
[10744.819340] ---[ end trace 03a974d4317ca792 ]---
[10744.823947] Kernel panic - not syncing: Fatal exception in interrupt
[10745.150950] ------------[ cut here ]------------
[10745.155556] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.4.4/linux-3.4/arch/x86/kernel/smp.c:120 update_process_times+0x65/0x80()
[10745.168200] Hardware name: X8DTG-QF
[10745.171676] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod snd_hda_codec_hdmi snd_hda_intel snd_hda_codec xfs ixgbe nvidia(PO) ioatdma acpi_cpufreq snd_hwdep snd_pcm snd_timer e1000e snd sg iTCO_wdt serio_raw joydev i7core_edac dca mperf button coretemp pcspkr iTCO_vendor_support i2c_i801 edac_core soundcore crc32c_intel snd_page_alloc mdio microcode autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr
[10745.232621] Pid: 15063, comm: nvidia-smi Tainted: P      D    O 3.4.4-2-desktop #1
[10745.240159] Call Trace:
[10745.242602]  [<ffffffff810043fa>] dump_trace+0xaa/0x2b0
[10745.247811]  [<ffffffff8158b1b1>] dump_stack+0x69/0x6f
[10745.252931]  [<ffffffff8104010b>] warn_slowpath_common+0x7b/0xc0
[10745.258920]  [<ffffffff81050525>] update_process_times+0x65/0x80
[10745.264912]  [<ffffffff81093a1b>] tick_sched_timer+0x5b/0xc0
[10745.270556]  [<ffffffff8106602e>] __run_hrtimer+0x6e/0x240
[10745.276028]  [<ffffffff810667e5>] hrtimer_interrupt+0xe5/0x200
[10745.281848]  [<ffffffff81021a93>] smp_apic_timer_interrupt+0x63/0xa0
[10745.288181]  [<ffffffff815ac73a>] apic_timer_interrupt+0x6a/0x70
[10745.294165]  [<ffffffff8158dd57>] panic+0x18f/0x1d2
[10745.299031]  [<ffffffff815a5ccf>] oops_end+0xef/0xf0
[10745.303985]  [<ffffffff815a8042>] do_page_fault+0x402/0x530
[10745.309542]  [<ffffffff815a5075>] page_fault+0x25/0x30
[10745.314667] ---[ end trace 03a974d4317ca793 ]---
  • Booted with pcie_aspm=off iommu=soft

comment:20 Changed 12 years ago by Suren A. Chilingaryan

[20331.519822] IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 1, got netfn 29 cmd 11
[21927.342079] BUG: unable to handle kernel NULL pointer dereference at           (null)
[21927.349921] IP: [<          (null)>]           (null)
[21927.354969] PGD c7c488067 PUD c8181b067 PMD 0 
[21927.359448] Oops: 0010 [#1] PREEMPT SMP 
[21927.363407] CPU 0 
[21927.365238] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod xfs snd_hda_codec_hdmi nvidia(PO) acpi_cpufreq mperf joydev coretemp crc32c_intel snd_hda_intel snd_hda_codec microcode snd_hwdep sg serio_raw pcspkr snd_pcm snd_timer iTCO_wdt ixgbe i2c_i801 i7core_edac e1000e iTCO_vendor_support snd ioatdma button soundcore dca edac_core mdio snd_page_alloc autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr
[21927.426375] 
[21927.427861] Pid: 25135, comm: nvidia-smi Tainted: P           O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF
[21927.437958] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[21927.445427] RSP: 0018:ffff880cbfc03e30  EFLAGS: 00010082
[21927.450718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000c047
[21927.457823] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff880c6e81c008
[21927.464929] RBP: ffff880c81c4b5b0 R08: 0000000000000001 R09: ffff880c6e84f35c
[21927.472034] R10: 0000000000000000 R11: 0000000000000010 R12: ffff880c6e81c008
[21927.479140] R13: ffff880c807a4008 R14: 0000000000000000 R15: ffff880c7bc3e008
[21927.486248] FS:  00007fd67c12e700(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000
[21927.494304] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21927.500027] CR2: 0000000000000000 CR3: 0000000c80683000 CR4: 00000000000007f0
[21927.507132] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[21927.514239] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[21927.521344] Process nvidia-smi (pid: 25135, threadinfo ffff880c81800000, task ffff880c829a8840)
[21927.530006] Stack:
[21927.532010]  ffffffffa1fb093b ffff880c807a4008 ffff880c81c485b8 ffff880cbfc03eb4
[21927.539446]  0000000000000010 0000000000000004 ffffffffa1fb96d3 0000000000000010
[21927.546880]  ffff881882e3e000 0000000000000000 ffff880c818018f8 0000000000000009
[21927.554315] Call Trace:
[21927.556752] Inexact backtrace:
[21927.556753] 
[21927.561283]  <IRQ> 
[21927.563478]  [<ffffffffa1fb093b>] ? _nv014871rm+0x8d/0xe2 [nvidia]
[21927.569692]  [<ffffffffa1fb96d3>] ? rm_isr+0xb8/0x152 [nvidia]
[21927.575558]  [<ffffffffa1fd7031>] ? nv_kern_isr+0x21/0x70 [nvidia]
[21927.581722]  [<ffffffff810cc8e5>] ? handle_irq_event_percpu+0x75/0x2a0
[21927.588229]  [<ffffffff810ccb57>] ? handle_irq_event+0x47/0x70
[21927.594046]  [<ffffffff810cfcc0>] ? handle_fasteoi_irq+0x60/0x100
[21927.600117]  [<ffffffff810041e8>] ? handle_irq+0x18/0x30
[21927.605413]  [<ffffffff81003e63>] ? do_IRQ+0x53/0xd0
[21927.610359]  [<ffffffff815a4dea>] ? common_interrupt+0x6a/0x6a
[21927.616166]  <EOI> 
[21927.618322]  [<ffffffffa19e58a5>] ? _nv014650rm+0x84/0x8d [nvidia]
[21927.624527]  [<ffffffffa19e5882>] ? _nv014650rm+0x61/0x8d [nvidia]
[21927.630725]  [<ffffffffa19e5882>] ? _nv014650rm+0x61/0x8d [nvidia]
[21927.636925]  [<ffffffffa19e5935>] ? _nv014652rm+0x32/0x9e [nvidia]
[21927.643131]  [<ffffffffa19e59d9>] ? _nv014646rm+0x38/0x46 [nvidia]
[21927.649380]  [<ffffffffa1e58e03>] ? _nv004066rm+0x1be/0x1607 [nvidia]
[21927.655888]  [<ffffffffa1e58c60>] ? _nv004066rm+0x1b/0x1607 [nvidia]
[21927.662338]  [<ffffffffa1d7ba92>] ? _nv004044rm+0x1259/0xae8b [nvidia]
[21927.668960]  [<ffffffffa1d7b1b1>] ? _nv004044rm+0x978/0xae8b [nvidia]
[21927.675495]  [<ffffffffa1d7e505>] ? _nv004044rm+0x3ccc/0xae8b [nvidia]
[21927.682117]  [<ffffffffa1d7b009>] ? _nv004044rm+0x7d0/0xae8b [nvidia]
[21927.688571]  [<ffffffffa19b2ab7>] ? _nv009864rm+0x175/0x278 [nvidia]
[21927.694959]  [<ffffffffa1fbe6fb>] ? _nv014853rm+0x21a/0x389 [nvidia]
[21927.701349]  [<ffffffffa1fbfca1>] ? _nv001095rm+0xac/0x65e [nvidia]
[21927.707649]  [<ffffffffa1fb8a64>] ? rm_init_adapter+0xac/0x146 [nvidia]
[21927.714295]  [<ffffffffa1fd8b4c>] ? nv_kern_open+0x45c/0x810 [nvidia]
[21927.720718]  [<ffffffff8116650d>] ? chrdev_open+0x9d/0x1b0
[21927.726188]  [<ffffffff81166470>] ? cdev_put+0x30/0x30
[21927.731305]  [<ffffffff8116006a>] ? __dentry_open+0x25a/0x330
[21927.737029]  [<ffffffff811711b8>] ? do_last+0x408/0x750
[21927.742233]  [<ffffffff8116dd05>] ? path_init+0x315/0x400
[21927.747609]  [<ffffffff81171619>] ? path_openat+0xd9/0x400
[21927.753073]  [<ffffffff81171a65>] ? do_filp_open+0x45/0xb0
[21927.758535]  [<ffffffff8116d511>] ? getname_flags+0x31/0xf0
[21927.764086]  [<ffffffff8117e11b>] ? alloc_fd+0xcb/0x120
[21927.769291]  [<ffffffff81161608>] ? do_sys_open+0xf8/0x1d0
[21927.774754]  [<ffffffff815abc39>] ? system_call_fastpath+0x16/0x1b
[21927.780917] Code:  Bad RIP value.
[21927.784262] RIP  [<          (null)>]           (null)
[21927.789396]  RSP <ffff880cbfc03e30>
[21927.792871] CR2: 0000000000000000
[21927.796488] ---[ end trace 5d3827f96f798ce2 ]---
[21927.801092] Kernel panic - not syncing: Fatal exception in interrupt
[21928.126593] ------------[ cut here ]------------
[21928.131200] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.4.4/linux-3.4/arch/x86/kernel/smp.c:120 update_process_times+0x65/0x80()
[21928.143843] Hardware name: X8DTG-QF
[21928.147318] Modules linked in: binfmt_misc nfs lockd fscache auth_rpcgss nfs_acl sunrpc af_packet ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod xfs snd_hda_codec_hdmi nvidia(PO) acpi_cpufreq mperf joydev coretemp crc32c_intel snd_hda_intel snd_hda_codec microcode snd_hwdep sg serio_raw pcspkr snd_pcm snd_timer iTCO_wdt ixgbe i2c_i801 i7core_edac e1000e iTCO_vendor_support snd ioatdma button soundcore dca edac_core mdio snd_page_alloc autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr
[21928.208263] Pid: 25135, comm: nvidia-smi Tainted: P      D    O 3.4.4-2-desktop #1
[21928.215801] Call Trace:
[21928.218245]  [<ffffffff810043fa>] dump_trace+0xaa/0x2b0
[21928.223455]  [<ffffffff8158b1b1>] dump_stack+0x69/0x6f
[21928.228583]  [<ffffffff8104010b>] warn_slowpath_common+0x7b/0xc0
[21928.234572]  [<ffffffff81050525>] update_process_times+0x65/0x80
[21928.240563]  [<ffffffff81093a1b>] tick_sched_timer+0x5b/0xc0
[21928.246209]  [<ffffffff8106602e>] __run_hrtimer+0x6e/0x240
[21928.251679]  [<ffffffff810667e5>] hrtimer_interrupt+0xe5/0x200
[21928.257498]  [<ffffffff81021a93>] smp_apic_timer_interrupt+0x63/0xa0
[21928.263833]  [<ffffffff815ac73a>] apic_timer_interrupt+0x6a/0x70
[21928.269826]  [<ffffffff8158dd57>] panic+0x18f/0x1d2
[21928.274691]  [<ffffffff815a5ccf>] oops_end+0xef/0xf0
[21928.279646]  [<ffffffff815a8042>] do_page_fault+0x402/0x530
[21928.285202]  [<ffffffff815a5075>] page_fault+0x25/0x30
[21928.290319] ---[ end trace 5d3827f96f798ce3 ]---

comment:21 Changed 12 years ago by Suren A. Chilingaryan

This time I have applied multiple changes to speed up tests

  • Hardware, the configuration now replicates ipepdvcompute1
    • slots used by extender card, sata controller, GPUs...
    • 10 GBit ethernet and frame grabber are removed
  • Few changes to the BIOS are done
    • Disabled IRQ19 Capture (probably doesn't matter)
    • Disabled PCIe slot containing SATA controller to preserve ROM space
    • Enabled PnP OS (this may alter distribution of interrupt numbers and prevent crashes due to IRQ conflicts)
    • Enabled APIC ACPI SCI IRQ (this option somehow affects distribution of interrupt numbers as well)
    • It is not possible to disable APIC controller al together. Otherwise, only a single core is available to Linux. As well it is not possible to disable ACPI. There is some conflict with IPMI BMC and even grub does not appear.
  • NVIDIA driver is instructed to use MSI interrupt
    • NVreg_EnableMSI=1 is passed to the module
    • Using this mode NVIDIA uses following interrupts:
       124:         63          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      nvidia
       125:         29          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      nvidia
       126:          0          0          0          0          0          0         17          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      nvidia
       127:          0          0          0          0          0          0          5          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      nvidia
       128:          0          0          0          0          0          0          3          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      nvidia
       129:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      nvidia
      
    • In standard mode, the interrupts are:
        16:      32057          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-IO-APIC-fasteoi   uhci_hcd:usb3, arcmsr, nvidia, nvidia, nvidia, nvidia, nvidia, nvidia
      
    • On ipepdvcompute1:
        16:    5806099          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3, arcmsr, arcmsr, nvidia, nvidia, nvidia, nvidia, nvidia
        18:         44          0          0          0          0    2315994          0          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb8, nvidia, nvidia, nvidia, nvidia
      

This configuration seems rather stable. I have not done the full test, but no hangs during a full 24 hour run. Currently, I check if

  • It will be worse with non-MSI interrupt

Later actions

  • If there would be no crashes, we will check if there is conflicts with the removed hardware (first with standard interrupts and, then, with MSI)
  • It seems the main difference between ipedvcompute1 and ufosrv1 is running X session. Because of VirtualGL, ipepdvcompute runs Xorg. As a result, the nvidia driver is always in active state and there is no initilization penalty for nvidia-smi. If we run any CUDA application on ufosrv1, the nvidia-smi will run as fast as on ipepdvcompute1 while the application is executed. It also seems that hangs are triggered by the driver initialization. Otherwise, it is not clear why nvidia-smi triggers crashes significantly more successful than heavy long-running CUDA applications. As a replacement to X-session, we may try to run GPUs in the persistent mode. In this case, there should be no initialization and hopefully no crashes.
  • As last resort, we will get GPU box for investigations and install 3rd GTX580 into the ufosrv1. In a long run, GTX580 may be replaced with GTX590. Then, we will got configuration more-or-less comparable to the current state.

comment:22 Changed 12 years ago by Suren A. Chilingaryan

[94324.326043] BUG: unable to handle kernel NULL pointer dereference at           (null)
[94324.333905] IP: [<          (null)>]           (null)
[94324.338971] PGD c7e6b8067 PUD c81d1e067 PMD 0 
[94324.343476] Oops: 0010 [#1] PREEMPT SMP 
[94324.347461] CPU 0 
[94324.349301] Modules linked in: nvidia(PO) nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipmi_devintf ipmi_si ipmi_msghandler w83795 w83627ehf hwmon_vid lm75 cpufreq_conservative jc42 cpufreq_userspace cpufreq_powersave dm_mod snd_hda_codec_hdmi xfs snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd e1000e soundcore iTCO_wdt acpi_cpufreq mperf snd_page_alloc joydev coretemp serio_raw iTCO_vendor_support ioatdma pcspkr sg i2c_i801 i7core_edac crc32c_intel microcode edac_core dca button autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid0 raid1 ata_piix processor thermal_sys ata_generic arcmsr [last unloaded: nvidia]
[94324.410142] 
[94324.411640] Pid: 5100, comm: nvidia-smi Tainted: P           O 3.4.4-2-desktop #1 Supermicro X8DTG-QF/X8DTG-QF
[94324.421683] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[94324.429170] RSP: 0018:ffff880cbfc03e30  EFLAGS: 00010082
[94324.434468] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000f1e8
[94324.441583] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff880c7a78a008
[94324.448698] RBP: ffff880c79a4f8f0 R08: 0000000000000001 R09: ffff880c79b5bd5c
[94324.455812] R10: 0000000000000000 R11: 0000000000000010 R12: ffff880c7a78a008
[94324.462927] R13: ffff880c8016a008 R14: 0000000000000000 R15: ffff880c7f756008
[94324.470041] FS:  00007f828267d700(0000) GS:ffff880cbfc00000(0000) knlGS:0000000000000000
[94324.478106] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[94324.483839] CR2: 0000000000000000 CR3: 0000000c7e5b2000 CR4: 00000000000007f0
[94324.490953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[94324.498068] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[94324.505181] Process nvidia-smi (pid: 5100, threadinfo ffff880c80b4c000, task ffff880c80852180)
[94324.513764] Stack:
[94324.515781]  ffffffffa153393b ffff880c8016a008 ffff880c79a4c8f8 ffff880cbfc03eb4
[94324.523249]  0000000000000010 0000000000000005 ffffffffa153c6d3 0000000000000010
[94324.530709]  ffff881881b21800 0000000000000000 ffff880c80b4d8e8 0000000000000009
[94324.538169] Call Trace:
[94324.540617] Inexact backtrace:
[94324.540618] 
[94324.545164]  <IRQ> 
[94324.547373]  [<ffffffffa153393b>] ? _nv014871rm+0x8d/0xe2 [nvidia]
[94324.553589]  [<ffffffffa153c6d3>] ? rm_isr+0xb8/0x152 [nvidia]
[94324.559456]  [<ffffffffa155a031>] ? nv_kern_isr+0x21/0x70 [nvidia]
[94324.565619]  [<ffffffff810cc8e5>] ? handle_irq_event_percpu+0x75/0x2a0
[94324.572127]  [<ffffffff810ccb57>] ? handle_irq_event+0x47/0x70
[94324.577946]  [<ffffffff810cfcc0>] ? handle_fasteoi_irq+0x60/0x100
[94324.584022]  [<ffffffff810041e8>] ? handle_irq+0x18/0x30
[94324.589321]  [<ffffffff81003e63>] ? do_IRQ+0x53/0xd0
[94324.594275]  [<ffffffff815a4dea>] ? common_interrupt+0x6a/0x6a
[94324.600090]  <EOI> 
[94324.602263]  [<ffffffffa0f68bed>] ? _nv014649rm+0xb/0x21 [nvidia]
[94324.608383]  [<ffffffffa0f68869>] ? _nv014650rm+0x48/0x8d [nvidia]
[94324.614589]  [<ffffffffa0f68882>] ? _nv014650rm+0x61/0x8d [nvidia]
[94324.620795]  [<ffffffffa0f68882>] ? _nv014650rm+0x61/0x8d [nvidia]
[94324.626993]  [<ffffffffa0f68935>] ? _nv014652rm+0x32/0x9e [nvidia]
[94324.633201]  [<ffffffffa0f689d9>] ? _nv014646rm+0x38/0x46 [nvidia]
[94324.639458]  [<ffffffffa13dbe03>] ? _nv004066rm+0x1be/0x1607 [nvidia]
[94324.645976]  [<ffffffffa13dbc60>] ? _nv004066rm+0x1b/0x1607 [nvidia]
[94324.652434]  [<ffffffffa12fea92>] ? _nv004044rm+0x1259/0xae8b [nvidia]
[94324.659062]  [<ffffffffa12fe1b1>] ? _nv004044rm+0x978/0xae8b [nvidia]
[94324.665607]  [<ffffffffa1301505>] ? _nv004044rm+0x3ccc/0xae8b [nvidia]
[94324.672239]  [<ffffffffa12fe009>] ? _nv004044rm+0x7d0/0xae8b [nvidia]
[94324.678702]  [<ffffffffa0f35ab7>] ? _nv009864rm+0x175/0x278 [nvidia]
[94324.685090]  [<ffffffffa15416fb>] ? _nv014853rm+0x21a/0x389 [nvidia]
[94324.691478]  [<ffffffffa1542ca1>] ? _nv001095rm+0xac/0x65e [nvidia]
[94324.697779]  [<ffffffffa153ba64>] ? rm_init_adapter+0xac/0x146 [nvidia]
[94324.704425]  [<ffffffffa155bb4c>] ? nv_kern_open+0x45c/0x810 [nvidia]
[94324.710848]  [<ffffffff8116650d>] ? chrdev_open+0x9d/0x1b0
[94324.716318]  [<ffffffff81166470>] ? cdev_put+0x30/0x30
[94324.721444]  [<ffffffff8116006a>] ? __dentry_open+0x25a/0x330
[94324.727178]  [<ffffffff811711b8>] ? do_last+0x408/0x750
[94324.732390]  [<ffffffff8116dd05>] ? path_init+0x315/0x400
[94324.737775]  [<ffffffff81171619>] ? path_openat+0xd9/0x400
[94324.743246]  [<ffffffff81171a65>] ? do_filp_open+0x45/0xb0
[94324.748718]  [<ffffffff8116d511>] ? getname_flags+0x31/0xf0
[94324.754279]  [<ffffffff8117e11b>] ? alloc_fd+0xcb/0x120
[94324.759490]  [<ffffffff81161608>] ? do_sys_open+0xf8/0x1d0
[94324.764963]  [<ffffffff815abc39>] ? system_call_fastpath+0x16/0x1b
[94324.771124] Code:  Bad RIP value.
[94324.774479] RIP  [<          (null)>]           (null)
[94324.779631]  RSP <ffff880cbfc03e30>
[94324.783115] CR2: 0000000000000000
[94324.786988] ---[ end trace 623a6040af9ab866 ]---
[94324.791596] Kernel panic - not syncing: Fatal exception in interrupt

comment:23 Changed 12 years ago by Suren A. Chilingaryan

Actually, GPUBox has interesting diagnostic signals. In both normal and hanged states, no errors are reported and correct link to the server is indicated. However, all GPU leds are blinking (long on / short off). According to documentation this means that cards are operating at 2.5 GBit/s per lane instead of 5 GBit/s.

  • bandwidthTest reports 5700 GB/s to device and 6355 MB/s which is pretty fine with full speed Gen2 x16 link.
  • In hanged state, first two lights stop blinking and the second pair continuous.
  • On ipepdvcompute1 no blinking is registered.
  • If diagnosed with OSS SysMon, no alarms is indicated.
    • However, voltage for -12V is reported at -13.51V. But this value never changes and it is exactly the same at ipepdvcompute1. I guess sensor or sysmon error.
    • Only other difference to ipepdvcompute1, the Out 6 is green on ufosrv1 while it is not highlighted on ipepdvcompute1. However, there is no information about meaning of Out 6

comment:24 Changed 12 years ago by Suren A. Chilingaryan

  • With the hardware back in the system, there is no changes to the list of devices using IRQ16.
      16:       5479          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-IO-APIC-fasteoi   uhci_hcd:usb3, arcmsr, nvidia, nvidia, nvidia, nvidia, nvidia
    
  • OK. Supermicro is apparently not able to work with more than 9 GPU cores due to limitation of ROM space. Could it be there is also card-limit? May be current driver is not able to properly handle more than 5 cards sharing the interrupt. As it could be seen on ipepdvcompute1, in the case of double-core cards, one core uses IRQ16 and another IRQ18.
  • So, all hardware is back and long test is executed with MSI-style interrupt to check if it is really solution to the problems.

comment:25 Changed 12 years ago by Suren A. Chilingaryan

No problems during 5 days. The following changes to configuration are required to avoid problems:

  • pci_aspm=off kernel parameter disable PCIe power management and fixes the errors on PCIe bus.
  • NVreg_EnableMSI=1 parameter of nvidia module enforces usage of MSI-style interrupts and prevents crashes in IRQ handler
  • nvidia-smi -pm 1 enables persistent mode

Call for testing…

Last edited 12 years ago by Suren A. Chilingaryan (previous) (diff)

comment:26 Changed 12 years ago by Matthias Balzer

Description: modified (diff)
ticket_due: 31/08/2012

comment:27 Changed 12 years ago by Suren A. Chilingaryan

Description: modified (diff)

comment:28 Changed 12 years ago by Suren A. Chilingaryan

Resolution: fixed
Status: newclosed

No crashes and complaints during 2 weeks. I'm closing the ticket. Please, re-open if further stability problems arise.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Suren A. Chilingaryan.
The resolution will be deleted. Next status will be 'reopened'.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.