#229 (UFO2 Server is crashing under the load) – ufo

comment:1 Changed 11 years ago by Suren A. Chilingaryan

Updated NVIDIA drivers from 340.32 to 343.22

comment:2 Changed 11 years ago by Suren A. Chilingaryan

Still crashing and I get nothing to the logs...

comment:3 follow-up: 4 Changed 11 years ago by Suren A. Chilingaryan

The system has also crashed with only 2 GPUs installed directly in the server box used.

comment:4 in reply to: 3 Changed 11 years ago by Timo Dritschler

Replying to csa:

The system has also crashed with only 2 GPUs installed directly in the server box used.

Did you physically disconnect all the GPUs, except for those two? Or did you just disable them / not use them by software?

comment:5 Changed 11 years ago by Suren A. Chilingaryan

Nope. I am just stress testing certain subsets of GPUs. Actually, it seems the GPU1 (from 0) is the problem. I ran the task on GPU0 for half day and all was fine. Currently, I'm stress testing 4 GPUs in the box and no crashes so far.

comment:6 Changed 11 years ago by Suren A. Chilingaryan

OK. All GPUs except GPU1 worked fine over the night.

comment:7 Changed 11 years ago by Suren A. Chilingaryan

OK. It seems the problem really if GPU0 and GPU1 are used simultaneously. There is no crashes if GPU0 is excluded and all other resources are loaded.

There is following complains in the logs. However, this can be unrelated.

Ä74094.473877Ü ä3üÄHardware ErrorÜ: APEI generic hardware error status
Ä74094.481968Ü ä3üÄHardware ErrorÜ: severity: 2, corrected
Ä74094.481969Ü ä3üÄHardware ErrorÜ: section: 0, severity: 2, corrected
Ä74094.481971Ü ä3üÄHardware ErrorÜ: flags: 0x01
Ä74094.481975Ü ä3üÄHardware ErrorÜ: primary
Ä74094.481978Ü ä3üÄHardware ErrorÜ: fru_text: CorrectedErr
Ä74094.481979Ü ä3üÄHardware ErrorÜ: section_type: PCIe error
Ä74094.481980Ü ä3üÄHardware ErrorÜ: port_type: 0, PCIe end point
Ä74094.481981Ü ä3üÄHardware ErrorÜ: version: 0.0
Ä74094.481982Ü ä3üÄHardware ErrorÜ: command: 0xffff, status: 0xffff
Ä74094.481983Ü ä3üÄHardware ErrorÜ: device_id: 0000:00:02.3
Ä74094.481983Ü ä3üÄHardware ErrorÜ: slot: 0
Ä74094.481984Ü ä3üÄHardware ErrorÜ: secondary_bus: 0x00
Ä74094.481985Ü ä3üÄHardware ErrorÜ: vendor_id: 0xffff, device_id: 0xffff
Ä74094.481986Ü ä3üÄHardware ErrorÜ: class_code: ffffff
Ä75485.883599Ü ä4üÄHardware ErrorÜ: Hardware error from APEI Generic Hardware Error Source: 1
Ä75485.893606Ü ä4üÄHardware ErrorÜ: APEI generic hardware error status
Ä75485.901216Ü ä4üÄHardware ErrorÜ: severity: 2, corrected
Ä75485.901218Ü ä4üÄHardware ErrorÜ: section: 0, severity: 2, corrected
Ä75485.901219Ü ä4üÄHardware ErrorÜ: flags: 0x01
Ä75485.901220Ü ä4üÄHardware ErrorÜ: primary
Ä75485.901221Ü ä4üÄHardware ErrorÜ: fru_text: CorrectedErr
Ä75485.901222Ü ä4üÄHardware ErrorÜ: section_type: PCIe error
Ä75485.901223Ü ä4üÄHardware ErrorÜ: port_type: 0, PCIe end point
Ä75485.901223Ü ä4üÄHardware ErrorÜ: version: 0.0
Ä75485.901224Ü ä4üÄHardware ErrorÜ: command: 0xffff, status: 0xffff
Ä75485.901224Ü ä4üÄHardware ErrorÜ: device_id: 0000:00:02.3
Ä75485.901225Ü ä4üÄHardware ErrorÜ: slot: 0
Ä75485.901225Ü ä4üÄHardware ErrorÜ: secondary_bus: 0x00
Ä75485.901225Ü ä4üÄHardware ErrorÜ: vendor_id: 0xffff, device_id: 0xffff
Ä75485.901226Ü ä4üÄHardware ErrorÜ: class_code: ffffff

comment:8 Changed 11 years ago by Suren A. Chilingaryan

SmBIOS reports: Smbios 0x0A Bus00(DevFn18) which is PCI system error

comment:9 Changed 11 years ago by Suren A. Chilingaryan

I have enabled PCIe error reporting in the BIOS (PERR, SERR). Thats that I get continuously now:

Ä  359.418369Ü nvidia 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)
Ä  359.418370Ü nvidia 0000:02:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  359.418371Ü nvidia 0000:02:00.0:    Ä12Ü Replay Timer Timeout  
Ä  359.431184Ü pcieport 0000:00:02.0: AER: Corrected error received: id=0200
Ä  359.431187Ü nvidia 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)
Ä  359.431188Ü nvidia 0000:02:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  359.431191Ü nvidia 0000:02:00.0:    Ä12Ü Replay Timer Timeout  
Ä  359.432548Ü pcieport 0000:00:02.0: AER: Multiple Corrected error received: id=0010
Ä  359.432558Ü pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID)
Ä  359.432559Ü pcieport 0000:00:02.0:   device Ä8086:0e04Ü error status/mask=00000040/00002000
Ä  359.432560Ü pcieport 0000:00:02.0:    Ä 6Ü Bad TLP               
Ä  359.449173Ü pcieport 0000:00:02.0: AER: Corrected error received: id=0010
Ä  359.449181Ü pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Transmitter ID)
Ä  359.449182Ü pcieport 0000:00:02.0:   device Ä8086:0e04Ü error status/mask=00001000/00002000
Ä  359.449183Ü pcieport 0000:00:02.0:    Ä12Ü Replay Timer Timeout  
Ä  359.456220Ü pcieport 0000:00:02.0: AER: Corrected error received: id=0200
Ä  359.456223Ü nvidia 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)
Ä  359.456224Ü nvidia 0000:02:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  359.456225Ü nvidia 0000:02:00.0:    Ä12Ü Replay Timer Timeout  
Ä  360.685258Ü pcieport 0000:00:03.0: AER: Corrected error received: id=0300
Ä  360.685267Ü nvidia 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0300(Transmitter ID)
Ä  360.685268Ü nvidia 0000:03:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  360.685269Ü nvidia 0000:03:00.0:    Ä12Ü Replay Timer Timeout  
Ä  361.321229Ü pcieport 0000:00:03.0: AER: Corrected error received: id=0300
Ä  361.321233Ü nvidia 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0300(Transmitter ID)
Ä  361.321234Ü nvidia 0000:03:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  361.321234Ü nvidia 0000:03:00.0:    Ä12Ü Replay Timer Timeout

comment:10 Changed 11 years ago by Suren A. Chilingaryan

Ä  533.758487Ü pcieport 0000:8c:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8c00(Receiver ID)
Ä  533.778803Ü pcieport 0000:8c:00.0:   device Ä10b5:8749Ü error status/mask=00000001/0000e000
Ä  533.797718Ü pcieport 0000:8c:00.0:    Ä 0Ü Receiver Error         (First)
Ä  536.887138Ü pcieport 0000:80:02.0: AER: Corrected error received: id=8c00
Ä  536.903296Ü pcieport 0000:8c:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8c00(Receiver ID)
Ä  536.923931Ü pcieport 0000:8c:00.0:   device Ä10b5:8749Ü error status/mask=00000001/0000e000
Ä  536.942301Ü pcieport 0000:8c:00.0:    Ä 0Ü Receiver Error         (First)
Ä  543.582006Ü pcieport 0000:80:02.0: AER: Corrected error received: id=8c00
Ä  543.598644Ü pcieport 0000:8c:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8c00(Receiver ID)
Ä  543.619015Ü pcieport 0000:8c:00.0:   device Ä10b5:8749Ü error status/mask=00000001/0000e000
Ä  543.637437Ü pcieport 0000:8c:00.0:    Ä 0Ü Receiver Error         (First)

comment:11 Changed 11 years ago by Suren A. Chilingaryan

Reducing the speed to PCIe gen2 seems to prevent the problem.
Disabling I/OAT does not help
Playing with Ageing Timer Rollover does not help

Last edited 11 years ago by Suren A. Chilingaryan (previous) (diff)

comment:12 Changed 11 years ago by Suren A. Chilingaryan

NVreg_EnablePCIeGen3=0 seems to be ignored with recent driver

comment:13 Changed 11 years ago by Suren A. Chilingaryan

The system is crashing without GPUBox connected
Replacing both GPU0 and GPU1 does not prevent crashes (without GPUBox)

comment:14 Changed 11 years ago by Suren A. Chilingaryan

The problem is reported to Tooltec

Dear Florian,

Thanks a lot for fixing GPU-Box. It works now without problems.
Unfortunately, we still have other issues with the system. The system is
crashing if 2 Titan GPUs (0000:02:00.0 and  0000:03:00.0) are heavily
loaded in the same time (GPUs are installed in the  PCIe CPU1 Slots 2 &
4, on the edge of board most far away from CPUs).

It is only happens if both these 2 GPUs are loaded simultaneously. Idle
system is stable. If either of these GPUs is loaded and other is idle,
the system is still stable. Actually, we can load 6 GPUs in the system
expect either 1 of these 2 and system is working perfectly stable. It
only crashes if both of these GPUs are loaded.

The problem is also persist with GPUBox disconnected. I have also tried
to replace GPUs in this two slots, but the system was still crashing.

With PCIe logging enabled, I got the following messages from the kernel
at VERY high rate until system is crashed (see dmesg.txt attached).

If I enforce PCIe gen2 in the BIOS, the problem goes away or at least
takes much longer to happen. I still get PCIe errors in the log, but at
significantly slower rate. About 1-2 per hour instead of 10 per second.

regards,
Suren

comment:15 Changed 11 years ago by Suren A. Chilingaryan

Supermicro

Please check the following

Test with each MMCFG setting available in the BIOS

This MMCFG setting changes the memory addressing for PCI resources which can fix this kind of issue

Regretfully it is impossible to tell which exact setting it is so you will have to test which each setting one by one

MMCFG setting location:

Advanced\Chipset Configuration\Northbridge\I/O Configuration\MMCFG Base                :               test with each setting (default is 0x8xxxx

The system is crashing with all possible settings of MMCFG

comment:16 Changed 11 years ago by Suren A. Chilingaryan

Supermicro

 

We have found out the following

The powersupply Powerdistributor of your server is revision 1.2

 

For new GPU cards support with higher TDP (power usage) you require Powerdistributor revision 1.3

We recommend to request RMA for your Powerdistributor PDB-PT747-4648 and request revision 1.3

I cannot tell for certain if this is possible or not, otherwise you will have to purchase a new PDB-PT747-4648 rev. 1.3 if the RMA request is refused

You need to request the RMA via your supplier who can check this directly with Supermicro RMA

I have forwarded this to Tooltec

comment:17 Changed 11 years ago by Suren A. Chilingaryan

Besides one GPU is significantly slower than others:

ankaimageufo2:# CUDA_VISIBLE_DEVICES="5" ./matrixMulCUBLAS 
MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Performance= 489.91 GFlop/s, Time= 0.268 msec, Size= 131072000 Ops

ankaimageufo2:# CUDA_VISIBLE_DEVICES="4" ./matrixMulCUBLAS 
MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Performance= 1379.06 GFlop/s, Time= 0.095 msec, Size= 131072000 Ops

It is stable behavior. Reboot does not help
Besides 5, all other cards are fine
Temperature of the card is fine as well

comment:18 Changed 11 years ago by Suren A. Chilingaryan

BIOS updated to X9DRGQF5.116, still crashing

comment:19 Changed 11 years ago by Suren A. Chilingaryan

Just checked if 64-bit decoding could be a cause. It is not.

comment:20 Changed 11 years ago by Suren A. Chilingaryan

New mail to Tooltec

Dear Christian,

Thanks for your help with this matter.

Unfortunately, we still have problems with the server. Lots of errors on PCIe bus if PERR, SERR error reporting is enabled in the BIOS. And system freezes when first 2 GPUs are loaded simultaneously. From Supermicro, I got few more advices to play with BIOS settings. They also provided a new unofficial BIOS to try (X9DRGQF5.116), but thats makes no difference the errors are still there and system is still crashing. This happens independent if GPUBox is connected or disconnected. However, with connected external GPUBox more errors on PCIe bus are reported.

I have found another problem with 1 of the GPUs in the external box. The performance is significantly slower. It seems the GPU is under-clocked for some reason.

Just trying to run standard matrix multiplication from CUDA samples, I got 1400 GFlop/s on all of the cards except GPU5. Here is the output:

ankaimageufo2:# CUDA_VISIBLE_DEVICES="5" ./matrixMulCUBLAS
MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Performance= 489.91 GFlop/s, Time= 0.268 msec, Size= 131072000 Ops

ankaimageufo2:# CUDA_VISIBLE_DEVICES="4" ./matrixMulCUBLAS
MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Performance= 1379.06 GFlop/s, Time= 0.095 msec, Size= 131072000 Ops

There is no overheating problem. The nvidia-smi reports temperatures bellow 40C for all cards. Timo has also noticed that the cooling system in the box behaves a bit weird. The red "!" (exclamation mark) LED on the box was blinking and the the coolers kept spinning up and then down again in about 3 - 5 second intervals.

regards,
Suren

comment:21 Changed 11 years ago by Suren A. Chilingaryan

Resolution:	→ fixed
Status:	new → closed

Looks fixed.

comment:22 Changed 10 years ago by Suren A. Chilingaryan

Cc:	Timo Dritschler removed
Resolution:	fixed
Status:	closed → reopened

comment:23 Changed 10 years ago by Suren A. Chilingaryan

Seems there are still problems

Tue Mar 15 15:30:10 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.96     Driver Version: 346.72         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   On   | 0000:02:00.0     Off |                  N/A |
| 30%   25C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   On   | 0000:03:00.0     Off |                  N/A |
| 30%   25C    P8    15W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  ERR!               ERR!  | ERR!            ERR! |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |     15MiB /  6143MiB |    ERR!         ERR! |
+-------------------------------+----------------------+----------------------+
|   3  ERR!               ERR!  | ERR!            ERR! |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |     15MiB /  6143MiB |    ERR!         ERR! |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX TITAN   On   | 0000:8E:00.0     Off |                  N/A |
| 30%   26C    P8    12W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX TITAN   On   | 0000:8F:00.0     Off |                  N/A |
| 30%   25C    P8    13W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX TITAN   On   | 0000:90:00.0     Off |                  N/A |
| 30%   26C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    2            ERROR: GPU is lost                                          |
|    3            ERROR: GPU is lost                                          |
+-----------------------------------------------------------------------------+

dmesg:

[181456.020607] NVRM: GPU at 0000:8a:00.0 has fallen off the bus.
[181456.020616] NVRM: GPU is on Board 0324013039203.
[181456.020637] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[181456.020678] NVRM: GPU at 0000:8b:00.0 has fallen off the bus.
[181456.020683] NVRM: GPU is on Board 0324013038076.
[181456.020758] pciehp 0000:87:08.0:pcie24: Card not present on Slot(8)
[181484.391098] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[181484.402618] {1}[Hardware Error]: APEI generic hardware error status
[181484.411826] {1}[Hardware Error]: severity: 2, corrected
[181484.411827] {1}[Hardware Error]: section: 0, severity: 2, corrected
[181484.411829] {1}[Hardware Error]: flags: 0x01
[181484.411831] {1}[Hardware Error]: primary
[181484.411832] {1}[Hardware Error]: fru_text: CorrectedErr
[181484.411833] {1}[Hardware Error]: section_type: PCIe error
[181484.411835] {1}[Hardware Error]: port_type: 0, PCIe end point
[181484.411835] {1}[Hardware Error]: version: 0.0
[181484.411837] {1}[Hardware Error]: command: 0xffff, status: 0xffff
[181484.411838] {1}[Hardware Error]: device_id: 0000:80:02.3
[181484.411839] {1}[Hardware Error]: slot: 0
[181484.411840] {1}[Hardware Error]: secondary_bus: 0x00
[181484.411841] {1}[Hardware Error]: vendor_id: 0xffff, device_id: 0xffff
[181484.411842] {1}[Hardware Error]: class_code: ffffff

Reported by:	Suren A. Chilingaryan	Owned by:
Priority:	critical	Milestone:
Component:	Infrastructure	Version:
Keywords:		Cc:

Summary:
Type:		Priority:
Milestone:		Component:
Version:		Keywords:
Cc:	Set your email in Preferences

Context Navigation

#229 reopened defect

UFO2 Server is crashing under the load

Description

Attachments (0)

Change History (23)