Opened 10 years ago
Last modified 9 years ago
#229 reopened defect
UFO2 Server is crashing under the load
Reported by: | Suren A. Chilingaryan | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | |
Component: | Infrastructure | Version: | |
Keywords: | Cc: |
Description
Just start '/root/tests-v4/run.sh' and wait for a while...
Attachments (0)
Change History (23)
comment:1 Changed 10 years ago by
comment:3 follow-up: 4 Changed 10 years ago by
The system has also crashed with only 2 GPUs installed directly in the server box used.
comment:4 Changed 10 years ago by
Replying to csa:
The system has also crashed with only 2 GPUs installed directly in the server box used.
Did you physically disconnect all the GPUs, except for those two? Or did you just disable them / not use them by software?
comment:5 Changed 10 years ago by
Nope. I am just stress testing certain subsets of GPUs. Actually, it seems the GPU1 (from 0) is the problem. I ran the task on GPU0 for half day and all was fine. Currently, I'm stress testing 4 GPUs in the box and no crashes so far.
comment:7 Changed 10 years ago by
OK. It seems the problem really if GPU0 and GPU1 are used simultaneously. There is no crashes if GPU0 is excluded and all other resources are loaded.
There is following complains in the logs. However, this can be unrelated.
Ä74094.473877Ü ä3üÄHardware ErrorÜ: APEI generic hardware error status Ä74094.481968Ü ä3üÄHardware ErrorÜ: severity: 2, corrected Ä74094.481969Ü ä3üÄHardware ErrorÜ: section: 0, severity: 2, corrected Ä74094.481971Ü ä3üÄHardware ErrorÜ: flags: 0x01 Ä74094.481975Ü ä3üÄHardware ErrorÜ: primary Ä74094.481978Ü ä3üÄHardware ErrorÜ: fru_text: CorrectedErr Ä74094.481979Ü ä3üÄHardware ErrorÜ: section_type: PCIe error Ä74094.481980Ü ä3üÄHardware ErrorÜ: port_type: 0, PCIe end point Ä74094.481981Ü ä3üÄHardware ErrorÜ: version: 0.0 Ä74094.481982Ü ä3üÄHardware ErrorÜ: command: 0xffff, status: 0xffff Ä74094.481983Ü ä3üÄHardware ErrorÜ: device_id: 0000:00:02.3 Ä74094.481983Ü ä3üÄHardware ErrorÜ: slot: 0 Ä74094.481984Ü ä3üÄHardware ErrorÜ: secondary_bus: 0x00 Ä74094.481985Ü ä3üÄHardware ErrorÜ: vendor_id: 0xffff, device_id: 0xffff Ä74094.481986Ü ä3üÄHardware ErrorÜ: class_code: ffffff Ä75485.883599Ü ä4üÄHardware ErrorÜ: Hardware error from APEI Generic Hardware Error Source: 1 Ä75485.893606Ü ä4üÄHardware ErrorÜ: APEI generic hardware error status Ä75485.901216Ü ä4üÄHardware ErrorÜ: severity: 2, corrected Ä75485.901218Ü ä4üÄHardware ErrorÜ: section: 0, severity: 2, corrected Ä75485.901219Ü ä4üÄHardware ErrorÜ: flags: 0x01 Ä75485.901220Ü ä4üÄHardware ErrorÜ: primary Ä75485.901221Ü ä4üÄHardware ErrorÜ: fru_text: CorrectedErr Ä75485.901222Ü ä4üÄHardware ErrorÜ: section_type: PCIe error Ä75485.901223Ü ä4üÄHardware ErrorÜ: port_type: 0, PCIe end point Ä75485.901223Ü ä4üÄHardware ErrorÜ: version: 0.0 Ä75485.901224Ü ä4üÄHardware ErrorÜ: command: 0xffff, status: 0xffff Ä75485.901224Ü ä4üÄHardware ErrorÜ: device_id: 0000:00:02.3 Ä75485.901225Ü ä4üÄHardware ErrorÜ: slot: 0 Ä75485.901225Ü ä4üÄHardware ErrorÜ: secondary_bus: 0x00 Ä75485.901225Ü ä4üÄHardware ErrorÜ: vendor_id: 0xffff, device_id: 0xffff Ä75485.901226Ü ä4üÄHardware ErrorÜ: class_code: ffffff
comment:8 Changed 10 years ago by
SmBIOS reports: Smbios 0x0A Bus00(DevFn18)
which is PCI system error
comment:9 Changed 10 years ago by
I have enabled PCIe error reporting in the BIOS (PERR, SERR). Thats that I get continuously now:
Ä 359.418369Ü nvidia 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID) Ä 359.418370Ü nvidia 0000:02:00.0: device Ä10de:1005Ü error status/mask=00001000/0000a000 Ä 359.418371Ü nvidia 0000:02:00.0: Ä12Ü Replay Timer Timeout Ä 359.431184Ü pcieport 0000:00:02.0: AER: Corrected error received: id=0200 Ä 359.431187Ü nvidia 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID) Ä 359.431188Ü nvidia 0000:02:00.0: device Ä10de:1005Ü error status/mask=00001000/0000a000 Ä 359.431191Ü nvidia 0000:02:00.0: Ä12Ü Replay Timer Timeout Ä 359.432548Ü pcieport 0000:00:02.0: AER: Multiple Corrected error received: id=0010 Ä 359.432558Ü pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID) Ä 359.432559Ü pcieport 0000:00:02.0: device Ä8086:0e04Ü error status/mask=00000040/00002000 Ä 359.432560Ü pcieport 0000:00:02.0: Ä 6Ü Bad TLP Ä 359.449173Ü pcieport 0000:00:02.0: AER: Corrected error received: id=0010 Ä 359.449181Ü pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Transmitter ID) Ä 359.449182Ü pcieport 0000:00:02.0: device Ä8086:0e04Ü error status/mask=00001000/00002000 Ä 359.449183Ü pcieport 0000:00:02.0: Ä12Ü Replay Timer Timeout Ä 359.456220Ü pcieport 0000:00:02.0: AER: Corrected error received: id=0200 Ä 359.456223Ü nvidia 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID) Ä 359.456224Ü nvidia 0000:02:00.0: device Ä10de:1005Ü error status/mask=00001000/0000a000 Ä 359.456225Ü nvidia 0000:02:00.0: Ä12Ü Replay Timer Timeout Ä 360.685258Ü pcieport 0000:00:03.0: AER: Corrected error received: id=0300 Ä 360.685267Ü nvidia 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0300(Transmitter ID) Ä 360.685268Ü nvidia 0000:03:00.0: device Ä10de:1005Ü error status/mask=00001000/0000a000 Ä 360.685269Ü nvidia 0000:03:00.0: Ä12Ü Replay Timer Timeout Ä 361.321229Ü pcieport 0000:00:03.0: AER: Corrected error received: id=0300 Ä 361.321233Ü nvidia 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0300(Transmitter ID) Ä 361.321234Ü nvidia 0000:03:00.0: device Ä10de:1005Ü error status/mask=00001000/0000a000 Ä 361.321234Ü nvidia 0000:03:00.0: Ä12Ü Replay Timer Timeout
comment:10 Changed 10 years ago by
Ä 533.758487Ü pcieport 0000:8c:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8c00(Receiver ID) Ä 533.778803Ü pcieport 0000:8c:00.0: device Ä10b5:8749Ü error status/mask=00000001/0000e000 Ä 533.797718Ü pcieport 0000:8c:00.0: Ä 0Ü Receiver Error (First) Ä 536.887138Ü pcieport 0000:80:02.0: AER: Corrected error received: id=8c00 Ä 536.903296Ü pcieport 0000:8c:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8c00(Receiver ID) Ä 536.923931Ü pcieport 0000:8c:00.0: device Ä10b5:8749Ü error status/mask=00000001/0000e000 Ä 536.942301Ü pcieport 0000:8c:00.0: Ä 0Ü Receiver Error (First) Ä 543.582006Ü pcieport 0000:80:02.0: AER: Corrected error received: id=8c00 Ä 543.598644Ü pcieport 0000:8c:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8c00(Receiver ID) Ä 543.619015Ü pcieport 0000:8c:00.0: device Ä10b5:8749Ü error status/mask=00000001/0000e000 Ä 543.637437Ü pcieport 0000:8c:00.0: Ä 0Ü Receiver Error (First)
comment:11 Changed 10 years ago by
- Reducing the speed to PCIe gen2 seems to prevent the problem.
- Disabling I/OAT does not help
- Playing with Ageing Timer Rollover does not help
comment:13 Changed 10 years ago by
- The system is crashing without GPUBox connected
- Replacing both GPU0 and GPU1 does not prevent crashes (without GPUBox)
comment:14 Changed 10 years ago by
- The problem is reported to Tooltec
Dear Florian, Thanks a lot for fixing GPU-Box. It works now without problems. Unfortunately, we still have other issues with the system. The system is crashing if 2 Titan GPUs (0000:02:00.0 and 0000:03:00.0) are heavily loaded in the same time (GPUs are installed in the PCIe CPU1 Slots 2 & 4, on the edge of board most far away from CPUs). It is only happens if both these 2 GPUs are loaded simultaneously. Idle system is stable. If either of these GPUs is loaded and other is idle, the system is still stable. Actually, we can load 6 GPUs in the system expect either 1 of these 2 and system is working perfectly stable. It only crashes if both of these GPUs are loaded. The problem is also persist with GPUBox disconnected. I have also tried to replace GPUs in this two slots, but the system was still crashing. With PCIe logging enabled, I got the following messages from the kernel at VERY high rate until system is crashed (see dmesg.txt attached). If I enforce PCIe gen2 in the BIOS, the problem goes away or at least takes much longer to happen. I still get PCIe errors in the log, but at significantly slower rate. About 1-2 per hour instead of 10 per second. regards, Suren
comment:15 Changed 10 years ago by
Supermicro
Please check the following Test with each MMCFG setting available in the BIOS This MMCFG setting changes the memory addressing for PCI resources which can fix this kind of issue Regretfully it is impossible to tell which exact setting it is so you will have to test which each setting one by one MMCFG setting location: Advanced\Chipset Configuration\Northbridge\I/O Configuration\MMCFG Base : test with each setting (default is 0x8xxxx
The system is crashing with all possible settings of MMCFG
comment:16 Changed 10 years ago by
Supermicro
We have found out the following The powersupply Powerdistributor of your server is revision 1.2 For new GPU cards support with higher TDP (power usage) you require Powerdistributor revision 1.3 We recommend to request RMA for your Powerdistributor PDB-PT747-4648 and request revision 1.3 I cannot tell for certain if this is possible or not, otherwise you will have to purchase a new PDB-PT747-4648 rev. 1.3 if the RMA request is refused You need to request the RMA via your supplier who can check this directly with Supermicro RMA
I have forwarded this to Tooltec
comment:17 Changed 10 years ago by
Besides one GPU is significantly slower than others:
ankaimageufo2:# CUDA_VISIBLE_DEVICES="5" ./matrixMulCUBLAS MatrixA(320,640), MatrixB(320,640), MatrixC(320,640) Performance= 489.91 GFlop/s, Time= 0.268 msec, Size= 131072000 Ops ankaimageufo2:# CUDA_VISIBLE_DEVICES="4" ./matrixMulCUBLAS MatrixA(320,640), MatrixB(320,640), MatrixC(320,640) Performance= 1379.06 GFlop/s, Time= 0.095 msec, Size= 131072000 Ops
- It is stable behavior. Reboot does not help
- Besides 5, all other cards are fine
- Temperature of the card is fine as well
comment:20 Changed 10 years ago by
New mail to Tooltec
Dear Christian, Thanks for your help with this matter. Unfortunately, we still have problems with the server. Lots of errors on PCIe bus if PERR, SERR error reporting is enabled in the BIOS. And system freezes when first 2 GPUs are loaded simultaneously. From Supermicro, I got few more advices to play with BIOS settings. They also provided a new unofficial BIOS to try (X9DRGQF5.116), but thats makes no difference the errors are still there and system is still crashing. This happens independent if GPUBox is connected or disconnected. However, with connected external GPUBox more errors on PCIe bus are reported. I have found another problem with 1 of the GPUs in the external box. The performance is significantly slower. It seems the GPU is under-clocked for some reason. Just trying to run standard matrix multiplication from CUDA samples, I got 1400 GFlop/s on all of the cards except GPU5. Here is the output: ankaimageufo2:# CUDA_VISIBLE_DEVICES="5" ./matrixMulCUBLAS MatrixA(320,640), MatrixB(320,640), MatrixC(320,640) Performance= 489.91 GFlop/s, Time= 0.268 msec, Size= 131072000 Ops ankaimageufo2:# CUDA_VISIBLE_DEVICES="4" ./matrixMulCUBLAS MatrixA(320,640), MatrixB(320,640), MatrixC(320,640) Performance= 1379.06 GFlop/s, Time= 0.095 msec, Size= 131072000 Ops There is no overheating problem. The nvidia-smi reports temperatures bellow 40C for all cards. Timo has also noticed that the cooling system in the box behaves a bit weird. The red "!" (exclamation mark) LED on the box was blinking and the the coolers kept spinning up and then down again in about 3 - 5 second intervals. regards, Suren
comment:22 Changed 9 years ago by
Cc: | Timo Dritschler removed |
---|---|
Resolution: | fixed |
Status: | closed → reopened |
comment:23 Changed 9 years ago by
Seems there are still problems
Tue Mar 15 15:30:10 2016 +------------------------------------------------------+ | NVIDIA-SMI 340.96 Driver Version: 346.72 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TITAN On | 0000:02:00.0 Off | N/A | | 30% 25C P8 14W / 250W | 15MiB / 6143MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX TITAN On | 0000:03:00.0 Off | N/A | | 30% 25C P8 15W / 250W | 15MiB / 6143MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 ERR! ERR! | ERR! ERR! | ERR! | |ERR! ERR! ERR! ERR! / ERR! | 15MiB / 6143MiB | ERR! ERR! | +-------------------------------+----------------------+----------------------+ | 3 ERR! ERR! | ERR! ERR! | ERR! | |ERR! ERR! ERR! ERR! / ERR! | 15MiB / 6143MiB | ERR! ERR! | +-------------------------------+----------------------+----------------------+ | 4 GeForce GTX TITAN On | 0000:8E:00.0 Off | N/A | | 30% 26C P8 12W / 250W | 15MiB / 6143MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 GeForce GTX TITAN On | 0000:8F:00.0 Off | N/A | | 30% 25C P8 13W / 250W | 15MiB / 6143MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 GeForce GTX TITAN On | 0000:90:00.0 Off | N/A | | 30% 26C P8 14W / 250W | 15MiB / 6143MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 2 ERROR: GPU is lost | | 3 ERROR: GPU is lost | +-----------------------------------------------------------------------------+
dmesg:
[181456.020607] NVRM: GPU at 0000:8a:00.0 has fallen off the bus. [181456.020616] NVRM: GPU is on Board 0324013039203. [181456.020637] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. [181456.020678] NVRM: GPU at 0000:8b:00.0 has fallen off the bus. [181456.020683] NVRM: GPU is on Board 0324013038076. [181456.020758] pciehp 0000:87:08.0:pcie24: Card not present on Slot(8) [181484.391098] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [181484.402618] {1}[Hardware Error]: APEI generic hardware error status [181484.411826] {1}[Hardware Error]: severity: 2, corrected [181484.411827] {1}[Hardware Error]: section: 0, severity: 2, corrected [181484.411829] {1}[Hardware Error]: flags: 0x01 [181484.411831] {1}[Hardware Error]: primary [181484.411832] {1}[Hardware Error]: fru_text: CorrectedErr [181484.411833] {1}[Hardware Error]: section_type: PCIe error [181484.411835] {1}[Hardware Error]: port_type: 0, PCIe end point [181484.411835] {1}[Hardware Error]: version: 0.0 [181484.411837] {1}[Hardware Error]: command: 0xffff, status: 0xffff [181484.411838] {1}[Hardware Error]: device_id: 0000:80:02.3 [181484.411839] {1}[Hardware Error]: slot: 0 [181484.411840] {1}[Hardware Error]: secondary_bus: 0x00 [181484.411841] {1}[Hardware Error]: vendor_id: 0xffff, device_id: 0xffff [181484.411842] {1}[Hardware Error]: class_code: ffffff
Updated NVIDIA drivers from 340.32 to 343.22