Random reboots on disk load

redroom

New Member
Aug 31, 2021
16
1
3
Hello,
We have experienced the following problem on our server - it randomly reboots, here's some investigation info

Specs
AMD Ryzen 9 5950X
128 GB RAM
2x3.84 TB NVME

Filesystem
ZFS in raid1
We have only 1 VM with 3 TB disk (Virtio SCSI, write back, discard on); OS Windows Server 2019.

ARC is limited to 64GB

There's nothing in log files

We've put some stress tests and discovered that host reboots only when there's some load on virtual disk. RAM and CPU stress test affects nothing. RAM memtest passed without errors (1 pass)
smartctl info states no errors. Some utils output below.

Some observations during disk stress test:
1. cache enabled: write back, discard is on. host reboots when load is 150-200 MB/s almost immediately.
2. cache disabled, discard is off. at 100-200 MB/s it works for 2-4 minutes then reboots. at the same time when downloading something from fileshare at 100 MB/s it works fine and finish download without rebooting.

Writing random data with dd inside proxmox (dd if=/dev/urandom of=file bs=1M count=100000) doesnt affect host so possibly this issue only VM-related.

Can someone advice how to fix this behavior?

Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-5
pve-kernel-helper: 6.4-5
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1


 smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.128-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       MTFDHBE3T8TDF
Serial Number:                      21122E525ACE
Firmware Version:                   95420260
PCI Vendor/Subsystem ID:            0x1344
IEEE OUI Identifier:                0x00a075
Total NVM Capacity:                 3,840,755,982,336 [3.84 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          3,840,755,982,336 [3.84 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 012e525ace
Local Time is:                      Tue Aug 31 04:00:48 2021 MSK
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001e):   Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x0034):     DS_Mngmt Sav/Sel_Feat Resv
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     72 Celsius
Critical Comp. Temp. Threshold:     75 Celsius
Namespace 1 Features (0x08):        No_ID_Reuse

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    12.00W       -        -    0  0  0  0        0       0
 1 +    11.00W       -        -    1  1  1  1        0       0
 2 +    10.00W       -        -    2  2  2  2        0       0
 3 +     9.00W       -        -    3  3  3  3        0       0
 4 +     8.00W       -        -    4  4  4  4        0       0
 5 +     7.00W       -        -    5  5  5  5        0       0
 6 +     6.00W       -        -    6  6  6  6        0       0
 7 +     5.00W       -        -    7  7  7  7        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -    4096       0         0
 2 -     512       8         0
 3 -    4096       8         0
 4 -    4096      64         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    3,418,086 [1.75 TB]
Data Units Written:                 17,534,102 [8.97 TB]
Host Read Commands:                 131,348,990
Host Write Commands:                164,119,906
Controller Busy Time:               5,505
Power Cycles:                       8
Power On Hours:                     949
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               47 Celsius
Temperature Sensor 2:               40 Celsius

Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged

smartctl -a /dev/nvme1n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.128-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       MTFDHBE3T8TDF
Serial Number:                      21122E525AD0
Firmware Version:                   95420260
PCI Vendor/Subsystem ID:            0x1344
IEEE OUI Identifier:                0x00a075
Total NVM Capacity:                 3,840,755,982,336 [3.84 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          3,840,755,982,336 [3.84 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 012e525ad0
Local Time is:                      Tue Aug 31 04:01:21 2021 MSK
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001e):   Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x0034):     DS_Mngmt Sav/Sel_Feat Resv
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     72 Celsius
Critical Comp. Temp. Threshold:     75 Celsius
Namespace 1 Features (0x08):        No_ID_Reuse

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    12.00W       -        -    0  0  0  0        0       0
 1 +    11.00W       -        -    1  1  1  1        0       0
 2 +    10.00W       -        -    2  2  2  2        0       0
 3 +     9.00W       -        -    3  3  3  3        0       0
 4 +     8.00W       -        -    4  4  4  4        0       0
 5 +     7.00W       -        -    5  5  5  5        0       0
 6 +     6.00W       -        -    6  6  6  6        0       0
 7 +     5.00W       -        -    7  7  7  7        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -    4096       0         0
 2 -     512       8         0
 3 -    4096       8         0
 4 -    4096      64         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    3,417,186 [1.74 TB]
Data Units Written:                 17,535,651 [8.97 TB]
Host Read Commands:                 131,925,516
Host Write Commands:                164,458,041
Controller Busy Time:               5,506
Power Cycles:                       8
Power On Hours:                     945
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               47 Celsius
Temperature Sensor 2:               39 Celsius

Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged
 
Last edited:
i would check the journal/dmesg when such an error occurs...
also if the computer spontaneously 'hard resets', it most likely is some faulty hardware, be it ssd, mainboard, cpu or psu
 
i would check the journal/dmesg when such an error occurs...
also if the computer spontaneously 'hard resets', it most likely is some faulty hardware, be it ssd, mainboard, cpu or psu
Checked and didnt notice anything suspicious, it contains only data during initial boot
File attached

The problem it resets during described scenario, no other cases (heavy disk load inside proxmox server survives).
 

Attachments

Last edited:
I assume this is a Hetzner AX101; the specs look familiar. Get in touch with tech support, have them check the PCIe BIOS settings. You'll get lower throughput on the drives (in my case, 2.7GB/s instead of around 3.5GB/s), but the drives will be stable.

Also, possibly do a CPU stress test (I had to have one of my servers swapped out for a faulty CPU). To do that, you can use s-tui, for example.
 
Last edited:
I assume this is a Hetzner AX101; the specs look familiar. Get in touch with tech support, have them check the PCIe BIOS settings. You'll get lower throughput on the drives (in my case, 2.7GB/s instead of around 3.5GB/s), but the drives will be stable.

Also, possibly do a CPU stress test (I had to have one of my servers swapped out for a faulty CPU). To do that, you can use s-tui, for example.
Yep, its AX101
you mean change GEN settings for pcie? its currently auto

CPU stress test performed without issues inside guest VM.
 
Ok, I meant doing the stress test on the host, so all actual CPU cores get stresstested.

I am not sure which PCIe-Settings the support was setting, but I expect it might have been the GEN settings (didn't ask for a KVM to check myself). Try setting that to a fixed value, and in doubt, set it down one generation and test again, see if that helps.
 
Ok, I meant doing the stress test on the host, so all actual CPU cores get stresstested.

I am not sure which PCIe-Settings the support was setting, but I expect it might have been the GEN settings (didn't ask for a KVM to check myself). Try setting that to a fixed value, and in doubt, set it down one generation and test again, see if that helps.
I'll try, thanks for suggestion
 
If it doesn't help, try running

# hdparm -tT --direct /dev/nvme*n1

and/or do dd writes to all the drives - I've had some drives on some servers be slow, and not all of it was PCIe settings - I actually had a faulty drive, too. All in all....make sure you check the hardware thoroughly with these boxes.

If a

dd if=/dev/zero of=/dev/nvme<num>n1 bs=1M status=progress

shows very obviously broken speeds (in my case, it went down to 50MB/s while on OK drives it was 2.5 to 3.5 GB/s), that's something to contact tech support about and have them check it.

WARNING: this overwrites the drive, of course, but until your hardware issues are sorted, it's best to reinstall anyway. Who knows what corrupt data might have been written otherwise.
 
Last edited:
  • Like
Reactions: redroom
update

changing settings in BIOS didnt help
replacing disks didnt help

after we replaced whole server the issue resolved
that was definately hardware issue
 
  • Like
Reactions: Eddy Buhler
update

changing settings in BIOS didnt help
replacing disks didnt help

after we replaced whole server the issue resolved
that was definately hardware issue
i have the same problem as yours
what do you mean by replacing whole server?
replace mother board? or cpu
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!