Modify

Opened 10 years ago

Last modified 9 years ago

#229 reopened defect

UFO2 Server is crashing under the load

Reported by: Suren A. Chilingaryan Owned by:
Priority: critical Milestone:
Component: Infrastructure Version:
Keywords: Cc:

Description

Just start '/root/tests-v4/run.sh' and wait for a while...

Attachments (0)

Change History (23)

comment:1 Changed 10 years ago by Suren A. Chilingaryan

Updated NVIDIA drivers from 340.32 to 343.22

comment:2 Changed 10 years ago by Suren A. Chilingaryan

Still crashing and I get nothing to the logs...

comment:3 Changed 10 years ago by Suren A. Chilingaryan

The system has also crashed with only 2 GPUs installed directly in the server box used.

comment:4 in reply to:  3 Changed 10 years ago by Timo Dritschler

Replying to csa:

The system has also crashed with only 2 GPUs installed directly in the server box used.

Did you physically disconnect all the GPUs, except for those two? Or did you just disable them / not use them by software?

comment:5 Changed 10 years ago by Suren A. Chilingaryan

Nope. I am just stress testing certain subsets of GPUs. Actually, it seems the GPU1 (from 0) is the problem. I ran the task on GPU0 for half day and all was fine. Currently, I'm stress testing 4 GPUs in the box and no crashes so far.

comment:6 Changed 10 years ago by Suren A. Chilingaryan

OK. All GPUs except GPU1 worked fine over the night.

comment:7 Changed 10 years ago by Suren A. Chilingaryan

OK. It seems the problem really if GPU0 and GPU1 are used simultaneously. There is no crashes if GPU0 is excluded and all other resources are loaded.

There is following complains in the logs. However, this can be unrelated.

Ä74094.473877Ü ä3üÄHardware ErrorÜ: APEI generic hardware error status
Ä74094.481968Ü ä3üÄHardware ErrorÜ: severity: 2, corrected
Ä74094.481969Ü ä3üÄHardware ErrorÜ: section: 0, severity: 2, corrected
Ä74094.481971Ü ä3üÄHardware ErrorÜ: flags: 0x01
Ä74094.481975Ü ä3üÄHardware ErrorÜ: primary
Ä74094.481978Ü ä3üÄHardware ErrorÜ: fru_text: CorrectedErr
Ä74094.481979Ü ä3üÄHardware ErrorÜ: section_type: PCIe error
Ä74094.481980Ü ä3üÄHardware ErrorÜ: port_type: 0, PCIe end point
Ä74094.481981Ü ä3üÄHardware ErrorÜ: version: 0.0
Ä74094.481982Ü ä3üÄHardware ErrorÜ: command: 0xffff, status: 0xffff
Ä74094.481983Ü ä3üÄHardware ErrorÜ: device_id: 0000:00:02.3
Ä74094.481983Ü ä3üÄHardware ErrorÜ: slot: 0
Ä74094.481984Ü ä3üÄHardware ErrorÜ: secondary_bus: 0x00
Ä74094.481985Ü ä3üÄHardware ErrorÜ: vendor_id: 0xffff, device_id: 0xffff
Ä74094.481986Ü ä3üÄHardware ErrorÜ: class_code: ffffff
Ä75485.883599Ü ä4üÄHardware ErrorÜ: Hardware error from APEI Generic Hardware Error Source: 1
Ä75485.893606Ü ä4üÄHardware ErrorÜ: APEI generic hardware error status
Ä75485.901216Ü ä4üÄHardware ErrorÜ: severity: 2, corrected
Ä75485.901218Ü ä4üÄHardware ErrorÜ: section: 0, severity: 2, corrected
Ä75485.901219Ü ä4üÄHardware ErrorÜ: flags: 0x01
Ä75485.901220Ü ä4üÄHardware ErrorÜ: primary
Ä75485.901221Ü ä4üÄHardware ErrorÜ: fru_text: CorrectedErr
Ä75485.901222Ü ä4üÄHardware ErrorÜ: section_type: PCIe error
Ä75485.901223Ü ä4üÄHardware ErrorÜ: port_type: 0, PCIe end point
Ä75485.901223Ü ä4üÄHardware ErrorÜ: version: 0.0
Ä75485.901224Ü ä4üÄHardware ErrorÜ: command: 0xffff, status: 0xffff
Ä75485.901224Ü ä4üÄHardware ErrorÜ: device_id: 0000:00:02.3
Ä75485.901225Ü ä4üÄHardware ErrorÜ: slot: 0
Ä75485.901225Ü ä4üÄHardware ErrorÜ: secondary_bus: 0x00
Ä75485.901225Ü ä4üÄHardware ErrorÜ: vendor_id: 0xffff, device_id: 0xffff
Ä75485.901226Ü ä4üÄHardware ErrorÜ: class_code: ffffff

comment:8 Changed 10 years ago by Suren A. Chilingaryan

SmBIOS reports: Smbios 0x0A Bus00(DevFn18) which is PCI system error

comment:9 Changed 10 years ago by Suren A. Chilingaryan

I have enabled PCIe error reporting in the BIOS (PERR, SERR). Thats that I get continuously now:

Ä  359.418369Ü nvidia 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)
Ä  359.418370Ü nvidia 0000:02:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  359.418371Ü nvidia 0000:02:00.0:    Ä12Ü Replay Timer Timeout  
Ä  359.431184Ü pcieport 0000:00:02.0: AER: Corrected error received: id=0200
Ä  359.431187Ü nvidia 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)
Ä  359.431188Ü nvidia 0000:02:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  359.431191Ü nvidia 0000:02:00.0:    Ä12Ü Replay Timer Timeout  
Ä  359.432548Ü pcieport 0000:00:02.0: AER: Multiple Corrected error received: id=0010
Ä  359.432558Ü pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID)
Ä  359.432559Ü pcieport 0000:00:02.0:   device Ä8086:0e04Ü error status/mask=00000040/00002000
Ä  359.432560Ü pcieport 0000:00:02.0:    Ä 6Ü Bad TLP               
Ä  359.449173Ü pcieport 0000:00:02.0: AER: Corrected error received: id=0010
Ä  359.449181Ü pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Transmitter ID)
Ä  359.449182Ü pcieport 0000:00:02.0:   device Ä8086:0e04Ü error status/mask=00001000/00002000
Ä  359.449183Ü pcieport 0000:00:02.0:    Ä12Ü Replay Timer Timeout  
Ä  359.456220Ü pcieport 0000:00:02.0: AER: Corrected error received: id=0200
Ä  359.456223Ü nvidia 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)
Ä  359.456224Ü nvidia 0000:02:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  359.456225Ü nvidia 0000:02:00.0:    Ä12Ü Replay Timer Timeout  
Ä  360.685258Ü pcieport 0000:00:03.0: AER: Corrected error received: id=0300
Ä  360.685267Ü nvidia 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0300(Transmitter ID)
Ä  360.685268Ü nvidia 0000:03:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  360.685269Ü nvidia 0000:03:00.0:    Ä12Ü Replay Timer Timeout  
Ä  361.321229Ü pcieport 0000:00:03.0: AER: Corrected error received: id=0300
Ä  361.321233Ü nvidia 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0300(Transmitter ID)
Ä  361.321234Ü nvidia 0000:03:00.0:   device Ä10de:1005Ü error status/mask=00001000/0000a000
Ä  361.321234Ü nvidia 0000:03:00.0:    Ä12Ü Replay Timer Timeout  

comment:10 Changed 10 years ago by Suren A. Chilingaryan

Ä  533.758487Ü pcieport 0000:8c:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8c00(Receiver ID)
Ä  533.778803Ü pcieport 0000:8c:00.0:   device Ä10b5:8749Ü error status/mask=00000001/0000e000
Ä  533.797718Ü pcieport 0000:8c:00.0:    Ä 0Ü Receiver Error         (First)
Ä  536.887138Ü pcieport 0000:80:02.0: AER: Corrected error received: id=8c00
Ä  536.903296Ü pcieport 0000:8c:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8c00(Receiver ID)
Ä  536.923931Ü pcieport 0000:8c:00.0:   device Ä10b5:8749Ü error status/mask=00000001/0000e000
Ä  536.942301Ü pcieport 0000:8c:00.0:    Ä 0Ü Receiver Error         (First)
Ä  543.582006Ü pcieport 0000:80:02.0: AER: Corrected error received: id=8c00
Ä  543.598644Ü pcieport 0000:8c:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=8c00(Receiver ID)
Ä  543.619015Ü pcieport 0000:8c:00.0:   device Ä10b5:8749Ü error status/mask=00000001/0000e000
Ä  543.637437Ü pcieport 0000:8c:00.0:    Ä 0Ü Receiver Error         (First)

comment:11 Changed 10 years ago by Suren A. Chilingaryan

  • Reducing the speed to PCIe gen2 seems to prevent the problem.
  • Disabling I/OAT does not help
  • Playing with Ageing Timer Rollover does not help
Last edited 10 years ago by Suren A. Chilingaryan (previous) (diff)

comment:12 Changed 10 years ago by Suren A. Chilingaryan

  • NVreg_EnablePCIeGen3=0 seems to be ignored with recent driver

comment:13 Changed 10 years ago by Suren A. Chilingaryan

  • The system is crashing without GPUBox connected
  • Replacing both GPU0 and GPU1 does not prevent crashes (without GPUBox)

comment:14 Changed 10 years ago by Suren A. Chilingaryan

  • The problem is reported to Tooltec
    Dear Florian,
    
    Thanks a lot for fixing GPU-Box. It works now without problems.
    Unfortunately, we still have other issues with the system. The system is
    crashing if 2 Titan GPUs (0000:02:00.0 and  0000:03:00.0) are heavily
    loaded in the same time (GPUs are installed in the  PCIe CPU1 Slots 2 &
    4, on the edge of board most far away from CPUs).
    
    It is only happens if both these 2 GPUs are loaded simultaneously. Idle
    system is stable. If either of these GPUs is loaded and other is idle,
    the system is still stable. Actually, we can load 6 GPUs in the system
    expect either 1 of these 2 and system is working perfectly stable. It
    only crashes if both of these GPUs are loaded.
    
    The problem is also persist with GPUBox disconnected. I have also tried
    to replace GPUs in this two slots, but the system was still crashing.
    
    With PCIe logging enabled, I got the following messages from the kernel
    at VERY high rate until system is crashed (see dmesg.txt attached).
    
    If I enforce PCIe gen2 in the BIOS, the problem goes away or at least
    takes much longer to happen. I still get PCIe errors in the log, but at
    significantly slower rate. About 1-2 per hour instead of 10 per second.
    
    regards,
    Suren
    

comment:15 Changed 10 years ago by Suren A. Chilingaryan

Supermicro

Please check the following

Test with each MMCFG setting available in the BIOS

This MMCFG setting changes the memory addressing for PCI resources which can fix this kind of issue

Regretfully it is impossible to tell which exact setting it is so you will have to test which each setting one by one

MMCFG setting location:

Advanced\Chipset Configuration\Northbridge\I/O Configuration\MMCFG Base                :               test with each setting (default is 0x8xxxx


The system is crashing with all possible settings of MMCFG

comment:16 Changed 10 years ago by Suren A. Chilingaryan

Supermicro

 

We have found out the following

The powersupply Powerdistributor of your server is revision 1.2

 

For new GPU cards support with higher TDP (power usage) you require Powerdistributor revision 1.3

We recommend to request RMA for your Powerdistributor PDB-PT747-4648 and request revision 1.3

I cannot tell for certain if this is possible or not, otherwise you will have to purchase a new PDB-PT747-4648 rev. 1.3 if the RMA request is refused

You need to request the RMA via your supplier who can check this directly with Supermicro RMA

I have forwarded this to Tooltec

comment:17 Changed 10 years ago by Suren A. Chilingaryan

Besides one GPU is significantly slower than others:

ankaimageufo2:# CUDA_VISIBLE_DEVICES="5" ./matrixMulCUBLAS 
MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Performance= 489.91 GFlop/s, Time= 0.268 msec, Size= 131072000 Ops

ankaimageufo2:# CUDA_VISIBLE_DEVICES="4" ./matrixMulCUBLAS 
MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Performance= 1379.06 GFlop/s, Time= 0.095 msec, Size= 131072000 Ops
  • It is stable behavior. Reboot does not help
  • Besides 5, all other cards are fine
  • Temperature of the card is fine as well

comment:18 Changed 10 years ago by Suren A. Chilingaryan

  • BIOS updated to X9DRGQF5.116, still crashing

comment:19 Changed 10 years ago by Suren A. Chilingaryan

  • Just checked if 64-bit decoding could be a cause. It is not.

comment:20 Changed 10 years ago by Suren A. Chilingaryan

New mail to Tooltec

Dear Christian,

Thanks for your help with this matter. 

Unfortunately, we still have problems with the server. Lots of errors on PCIe bus if PERR, SERR  error reporting is enabled in the BIOS. And system freezes when first 2 GPUs are loaded simultaneously. From Supermicro, I got few more advices to play with BIOS settings. They also provided a new unofficial BIOS to try (X9DRGQF5.116), but thats makes no difference the errors are still there and system is still crashing.  This happens independent if GPUBox is connected or disconnected. However, with connected external GPUBox more errors on PCIe bus are reported.

I have found another problem with 1 of the GPUs in the external box. The performance is significantly slower. It seems the GPU is under-clocked for some reason.

Just trying to run standard matrix multiplication from CUDA samples, I got 1400 GFlop/s on all of the cards except GPU5. Here is the output:

ankaimageufo2:# CUDA_VISIBLE_DEVICES="5" ./matrixMulCUBLAS 
MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Performance= 489.91 GFlop/s, Time= 0.268 msec, Size= 131072000 Ops

ankaimageufo2:# CUDA_VISIBLE_DEVICES="4" ./matrixMulCUBLAS 
MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Performance= 1379.06 GFlop/s, Time= 0.095 msec, Size= 131072000 Ops

There is no overheating problem. The nvidia-smi reports temperatures bellow 40C for all cards. Timo has also noticed that the cooling system in the box behaves a bit weird. The red "!" (exclamation mark) LED on the box was blinking and the the coolers kept spinning up and then down again in about 3 - 5 second intervals. 

regards,
Suren

comment:21 Changed 10 years ago by Suren A. Chilingaryan

Resolution: fixed
Status: newclosed

Looks fixed.

comment:22 Changed 9 years ago by Suren A. Chilingaryan

Cc: Timo Dritschler removed
Resolution: fixed
Status: closedreopened

comment:23 Changed 9 years ago by Suren A. Chilingaryan

Seems there are still problems

Tue Mar 15 15:30:10 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.96     Driver Version: 346.72         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   On   | 0000:02:00.0     Off |                  N/A |
| 30%   25C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   On   | 0000:03:00.0     Off |                  N/A |
| 30%   25C    P8    15W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  ERR!               ERR!  | ERR!            ERR! |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |     15MiB /  6143MiB |    ERR!         ERR! |
+-------------------------------+----------------------+----------------------+
|   3  ERR!               ERR!  | ERR!            ERR! |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |     15MiB /  6143MiB |    ERR!         ERR! |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX TITAN   On   | 0000:8E:00.0     Off |                  N/A |
| 30%   26C    P8    12W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX TITAN   On   | 0000:8F:00.0     Off |                  N/A |
| 30%   25C    P8    13W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX TITAN   On   | 0000:90:00.0     Off |                  N/A |
| 30%   26C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    2            ERROR: GPU is lost                                          |
|    3            ERROR: GPU is lost                                          |
+-----------------------------------------------------------------------------+

dmesg:

[181456.020607] NVRM: GPU at 0000:8a:00.0 has fallen off the bus.
[181456.020616] NVRM: GPU is on Board 0324013039203.
[181456.020637] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[181456.020678] NVRM: GPU at 0000:8b:00.0 has fallen off the bus.
[181456.020683] NVRM: GPU is on Board 0324013038076.
[181456.020758] pciehp 0000:87:08.0:pcie24: Card not present on Slot(8)
[181484.391098] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[181484.402618] {1}[Hardware Error]: APEI generic hardware error status
[181484.411826] {1}[Hardware Error]: severity: 2, corrected
[181484.411827] {1}[Hardware Error]: section: 0, severity: 2, corrected
[181484.411829] {1}[Hardware Error]: flags: 0x01
[181484.411831] {1}[Hardware Error]: primary
[181484.411832] {1}[Hardware Error]: fru_text: CorrectedErr
[181484.411833] {1}[Hardware Error]: section_type: PCIe error
[181484.411835] {1}[Hardware Error]: port_type: 0, PCIe end point
[181484.411835] {1}[Hardware Error]: version: 0.0
[181484.411837] {1}[Hardware Error]: command: 0xffff, status: 0xffff
[181484.411838] {1}[Hardware Error]: device_id: 0000:80:02.3
[181484.411839] {1}[Hardware Error]: slot: 0
[181484.411840] {1}[Hardware Error]: secondary_bus: 0x00
[181484.411841] {1}[Hardware Error]: vendor_id: 0xffff, device_id: 0xffff
[181484.411842] {1}[Hardware Error]: class_code: ffffff

Modify Ticket

Change Properties
Set your email in Preferences
Action
as reopened The ticket will remain with no owner.
as The resolution will be set. Next status will be 'closed'.
to The owner will be changed from (none) to the specified user. Next status will be 'new'.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.