Hello,
We have experienced the following problem on our server - it randomly reboots, here's some investigation info
Specs
AMD Ryzen 9 5950X
128 GB RAM
2x3.84 TB NVME
Filesystem
ZFS in raid1
We have only 1 VM with 3 TB disk (Virtio SCSI, write back, discard on); OS Windows Server 2019.
ARC is limited to 64GB
There's nothing in log files
We've put some stress tests and discovered that host reboots only when there's some load on virtual disk. RAM and CPU stress test affects nothing. RAM memtest passed without errors (1 pass)
smartctl info states no errors. Some utils output below.
Some observations during disk stress test:
1. cache enabled: write back, discard is on. host reboots when load is 150-200 MB/s almost immediately.
2. cache disabled, discard is off. at 100-200 MB/s it works for 2-4 minutes then reboots. at the same time when downloading something from fileshare at 100 MB/s it works fine and finish download without rebooting.
Writing random data with dd inside proxmox (dd if=/dev/urandom of=file bs=1M count=100000) doesnt affect host so possibly this issue only VM-related.
Can someone advice how to fix this behavior?
We have experienced the following problem on our server - it randomly reboots, here's some investigation info
Specs
AMD Ryzen 9 5950X
128 GB RAM
2x3.84 TB NVME
Filesystem
ZFS in raid1
We have only 1 VM with 3 TB disk (Virtio SCSI, write back, discard on); OS Windows Server 2019.
ARC is limited to 64GB
There's nothing in log files
We've put some stress tests and discovered that host reboots only when there's some load on virtual disk. RAM and CPU stress test affects nothing. RAM memtest passed without errors (1 pass)
smartctl info states no errors. Some utils output below.
Some observations during disk stress test:
1. cache enabled: write back, discard is on. host reboots when load is 150-200 MB/s almost immediately.
2. cache disabled, discard is off. at 100-200 MB/s it works for 2-4 minutes then reboots. at the same time when downloading something from fileshare at 100 MB/s it works fine and finish download without rebooting.
Writing random data with dd inside proxmox (dd if=/dev/urandom of=file bs=1M count=100000) doesnt affect host so possibly this issue only VM-related.
Can someone advice how to fix this behavior?
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-5
pve-kernel-helper: 6.4-5
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1
smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.128-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: MTFDHBE3T8TDF
Serial Number: 21122E525ACE
Firmware Version: 95420260
PCI Vendor/Subsystem ID: 0x1344
IEEE OUI Identifier: 0x00a075
Total NVM Capacity: 3,840,755,982,336 [3.84 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 3,840,755,982,336 [3.84 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 00a075 012e525ace
Local Time is: Tue Aug 31 04:00:48 2021 MSK
Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001e): Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x0034): DS_Mngmt Sav/Sel_Feat Resv
Log Page Attributes (0x02): Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 72 Celsius
Critical Comp. Temp. Threshold: 75 Celsius
Namespace 1 Features (0x08): No_ID_Reuse
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 12.00W - - 0 0 0 0 0 0
1 + 11.00W - - 1 1 1 1 0 0
2 + 10.00W - - 2 2 2 2 0 0
3 + 9.00W - - 3 3 3 3 0 0
4 + 8.00W - - 4 4 4 4 0 0
5 + 7.00W - - 5 5 5 5 0 0
6 + 6.00W - - 6 6 6 6 0 0
7 + 5.00W - - 7 7 7 7 0 0
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 4096 0 0
2 - 512 8 0
3 - 4096 8 0
4 - 4096 64 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 38 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 3,418,086 [1.75 TB]
Data Units Written: 17,534,102 [8.97 TB]
Host Read Commands: 131,348,990
Host Write Commands: 164,119,906
Controller Busy Time: 5,505
Power Cycles: 8
Power On Hours: 949
Unsafe Shutdowns: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 47 Celsius
Temperature Sensor 2: 40 Celsius
Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged
smartctl -a /dev/nvme1n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.128-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: MTFDHBE3T8TDF
Serial Number: 21122E525AD0
Firmware Version: 95420260
PCI Vendor/Subsystem ID: 0x1344
IEEE OUI Identifier: 0x00a075
Total NVM Capacity: 3,840,755,982,336 [3.84 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 3,840,755,982,336 [3.84 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 00a075 012e525ad0
Local Time is: Tue Aug 31 04:01:21 2021 MSK
Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001e): Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x0034): DS_Mngmt Sav/Sel_Feat Resv
Log Page Attributes (0x02): Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 72 Celsius
Critical Comp. Temp. Threshold: 75 Celsius
Namespace 1 Features (0x08): No_ID_Reuse
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 12.00W - - 0 0 0 0 0 0
1 + 11.00W - - 1 1 1 1 0 0
2 + 10.00W - - 2 2 2 2 0 0
3 + 9.00W - - 3 3 3 3 0 0
4 + 8.00W - - 4 4 4 4 0 0
5 + 7.00W - - 5 5 5 5 0 0
6 + 6.00W - - 6 6 6 6 0 0
7 + 5.00W - - 7 7 7 7 0 0
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 4096 0 0
2 - 512 8 0
3 - 4096 8 0
4 - 4096 64 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 3,417,186 [1.74 TB]
Data Units Written: 17,535,651 [8.97 TB]
Host Read Commands: 131,925,516
Host Write Commands: 164,458,041
Controller Busy Time: 5,506
Power Cycles: 8
Power On Hours: 945
Unsafe Shutdowns: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 47 Celsius
Temperature Sensor 2: 39 Celsius
Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged
Last edited: