[SOLVED] Hypervisor and all VMs hungs during high IO loads

tomaslederer · Mar 17, 2022

Hi all,

I am trying to troubleshoot an issue I am having for some time with my homelab.

Server HW:

CPU: AMD Ryzen 2400G
RAM: 32GB
System/VM disk: 2x256GB NVMe ZFS mirror (one in onboard slot, other in pcie enclosure)
Data disk: 10TB WD Gold

The 10TB hard drive is passed-trough to one of the VMs, where it is mounted and formated as ext4. The issue I am having is that during heavy writes/reads to this disk almost always system periodically hungs up - vms stuck, cant access web gui, nothing. After few minutes everything gets back to normal.

I troubleshooted lots of things, saw almost full RAM so I upgraded from 16 gigs to 32... Later I noticed in the logs "deadman" events from system ZFS pool (syslog at the end of the post), which according to documentation should be "flaky hardware or drivers". The errors are same for each issue. However, the errors happen randomly on both disks in array, both are relatively new (half a year) - SMART in screenshot bellow. It s basically same for both drives, as they were only used in this server in mirror.

Snímek obrazovky 2022-03-17 v 11.39.32.png

I would be think it is a faulty disk, if errors were only from one. But from both? Any ideas, what I might have done wrong, might try?

Here is a syslog, starting with time where something starts to happen until I successfuly login to admin console:

Code:

Mar 16 18:27:24 HomeServer-hypervisor pvescheduler[564969]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 16 18:27:24 HomeServer-hypervisor pvestatd[2723]: status update time (15.999 seconds)
Mar 16 18:27:24 HomeServer-hypervisor pve-firewall[2724]: firewall update time (15.818 seconds)
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/100: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: -1
Mar 16 18:53:38 HomeServer-hypervisor pve-firewall[2724]: firewall update time (74.129 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pve-ha-lrm[2779]: loop take too long (88 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pve-ha-crm[2764]: loop take too long (76 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pvescheduler[703528]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 16 18:53:52 HomeServer-hypervisor pvestatd[2723]: status update time (88.347 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pve-firewall[2724]: firewall update time (13.790 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: -1
Mar 16 18:53:52 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: -1
Mar 16 18:55:03 HomeServer-hypervisor pvedaemon[2754]: <root@pam> successful auth for user 'root@pam'
Mar 16 18:55:39 HomeServer-hypervisor pvedaemon[2753]: <root@pam> successful auth for user 'root@pam'
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Missed 26 events
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Bumping queue length to 67108864
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Missed 89 events
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Bumping queue length to 134217728
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Missed 46 events
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Bumping queue length to 268435456
Mar 16 18:59:22 HomeServer-hypervisor zed[722309]: eid=172 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710919168 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557663
Mar 16 18:59:22 HomeServer-hypervisor zed[722306]: eid=167 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710960128 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557668
Mar 16 18:59:22 HomeServer-hypervisor zed[722315]: eid=166 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710968320 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557669
Mar 16 18:59:22 HomeServer-hypervisor zed[722327]: eid=175 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710894592 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557660
Mar 16 18:59:22 HomeServer-hypervisor zed[722326]: eid=169 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710943744 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557666
Mar 16 18:59:22 HomeServer-hypervisor zed[722322]: eid=168 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710951936 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557667
Mar 16 18:59:22 HomeServer-hypervisor zed[722340]: eid=181 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710845440 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557654
Mar 16 18:59:22 HomeServer-hypervisor zed[722338]: eid=174 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710902784 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557661
Mar 16 18:59:22 HomeServer-hypervisor zed[722341]: eid=170 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710935552 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557665
Mar 16 18:59:22 HomeServer-hypervisor zed[722342]: eid=173 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710910976 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557662
Mar 16 18:59:22 HomeServer-hypervisor zed[722354]: eid=180 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710853632 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557655
Mar 16 18:59:22 HomeServer-hypervisor zed[722339]: eid=176 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710886400 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557659
Mar 16 18:59:22 HomeServer-hypervisor zed[722355]: eid=171 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710927360 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557664
Mar 16 18:59:22 HomeServer-hypervisor zed[722344]: eid=177 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710878208 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557658
Mar 16 18:59:22 HomeServer-hypervisor zed[722349]: eid=179 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710861824 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557656
Mar 16 18:59:22 HomeServer-hypervisor zed[722353]: eid=183 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710829056 priority=3 err=0 flags=0x380880 bookmark=66527:1:0:14557652
Mar 16 18:59:22 HomeServer-hypervisor zed[722358]: eid=182 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710837248 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557653
Mar 16 18:59:22 HomeServer-hypervisor zed[722376]: eid=178 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710870016 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557657
Mar 16 18:59:22 HomeServer-hypervisor zed[724326]: eid=185 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710820864 priority=3 err=0 flags=0x380880 bookmark=66527:1:0:14557651
Mar 16 18:59:22 HomeServer-hypervisor zed[724324]: eid=184 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=131072 offset=166710706176 priority=3 err=0 flags=0x40080c80 delay=8574092ms
Mar 16 19:04:18 HomeServer-hypervisor pve-firewall[2724]: firewall update time (145.874 seconds)
Mar 16 19:04:18 HomeServer-hypervisor pvestatd[2723]: status update time (146.324 seconds)
Mar 16 19:04:18 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: -1
Mar 16 19:04:18 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: illegal attempt to update using time 1647453858 when last update time is 1647453858 (minimum one second step)
Mar 16 19:04:18 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: -1
Mar 16 19:04:18 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: illegal attempt to update using time 1647453858 when last update time is 1647453858 (minimum one second step)
Mar 16 19:04:23 HomeServer-hypervisor pve-ha-lrm[2779]: loop take too long (156 seconds)
Mar 16 19:04:23 HomeServer-hypervisor pve-ha-crm[2764]: loop take too long (151 seconds)
Mar 16 19:08:54 HomeServer-hypervisor pvedaemon[2753]: <root@pam> successful auth for user 'root@pam'

Thank you all for your suggestions!

LnxBil · Mar 17, 2022

AFAIK Apacer AS2280P4 256GB are consumer SSDs and are not fit to be used in ZFS in a server.

tomaslederer · Mar 17, 2022

Alright.... should I give up ZFS, or are consumer SSDs unusable with proxmox? This is a server, but not enterprise one - homelad with few vms for smart home, nas, media etc....

LnxBil · Mar 17, 2022

tomaslederer said:
Alright.... should I give up ZFS, or are consumer SSDs unusable with proxmox? This is a server, but not enterprise one - homelad with few vms for smart home, nas, media etc....

consumer/prosumer SSDs are not built for copy-on-write filesystems like ZFS. You may get better performance with ext4.

tomaslederer · Mar 17, 2022

that makes sense, thank you. I will try reinstalling proxmox on ext4. Thanks for your time

LnxBil · Mar 17, 2022

tomaslederer said:
that makes sense, thank you. I will try reinstalling proxmox on ext4. Thanks for your time

Yes, please try and report back.

tomaslederer · Mar 21, 2022

On Friday I reinstalled proxmox to ext4. All vms are running from ext4 too. Now, after few days of testing, overall performance seems much better a no hang of vms or proxmox had happened since.

Thanks for your advice!

jhr · Jul 22, 2022

I have the same error, except I have 4x ata-WDC_WD2005FBYZ in Raid10 ZFS.

Hi IO load reset hypervisor, not only hangs for a while

Jul 22 03:22:03 proxmox2 zed: eid=16 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=8192 offset=1073236172800 priority=0 err=0 flags=0x180880 delay=30499ms bookmark=3589:1:0:61613330

Jul 22 03:22:03 proxmox2 zed: eid=9 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E9FXZY-part3 size=65536 offset=350642769920 priority=0 err=0 flags=0x40080c80 delay=32473ms

Jul 22 03:22:03 proxmox2 zed: eid=11 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E1R1NL-part3 size=8192 offset=140229066752 priority=0 err=0 flags=0x180880 delay=31620ms bookmark=3589:1:0:61613819

Jul 22 03:22:03 proxmox2 zed: eid=8 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E9FXZY-part3 size=1044480 offset=350641725440 priority=0 err=0 flags=0x40080c80 delay=32473ms

Jul 22 03:22:03 proxmox2 zed: eid=15 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E10Z19-part3 size=753664 offset=140228165632 priority=0 err=0 flags=0x40080c80 delay=30340ms

Jul 22 03:22:03 proxmox2 zed: eid=12 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=24576 offset=350644924416 priority=0 err=0 flags=0x40080c80 delay=31620ms

Jul 22 03:22:03 proxmox2 zed: eid=10 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=1044480 offset=350643879936 priority=0 err=0 flags=0x40080c80 delay=32473ms

Jul 22 03:22:03 proxmox2 zed: eid=14 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=4096 offset=1175542091776 priority=0 err=0 flags=0x180880 delay=31620ms bookmark=3589:1:0:61613678

Jul 22 03:22:03 proxmox2 zed: eid=7 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E9FXZY-part3 size=8192 offset=1116908425216 priority=0 err=0 flags=0x180880 delay=32473ms bookmark=3589:1:0:61613679

Jul 22 03:22:03 proxmox2 zed: eid=20 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E10Z19-part3 size=61440 offset=86045593600 priority=0 err=0 flags=0x180880 delay=34105ms bookmark=139:50069:0:0

Jul 22 03:22:03 proxmox2 zed: eid=13 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=24576 offset=1116777095168 priority=0 err=0 flags=0x40080c80 delay=31620ms

Jul 22 03:22:03 proxmox2 zed: eid=17 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E10Z19-part3 size=4096 offset=1032840855552 priority=0 err=0 flags=0x180880 delay=30838ms bookmark=3589:1:0:61613097

Jul 22 03:22:03 proxmox2 zed: eid=6 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E9FXZY-part3 size=8192 offset=1116762341376 priority=0 err=0 flags=0x180880 delay=32473ms bookmark=3589:1:0:61613446

Jul 22 03:22:03 proxmox2 zed: eid=19 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E1R1NL-part3 size=909312 offset=140227256320 priority=0 err=0 flags=0x40080c80 delay=30858ms

Jul 22 03:22:03 proxmox2 pve-firewall[4266]: firewall update time (64.510 seconds)

Jul 22 03:22:03 proxmox2 zed: eid=18 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E1R1NL-part3 size=1044480 offset=140226211840 priority=0 err=0 flags=0x40080c80 delay=30858ms

Jul 22 03:22:05 proxmox2 pvestatd[4276]: status update time (72.567 seconds)

Jul 22 03:22:06 proxmox2 pmxcfs[4147]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/proxmox2: -1

Any advice, plese ?

Neobin · Jul 22, 2022

jhr said:
Any advice, plese

Get a pair of enterprise SSDs for the host and guests and use the HDDs only for the non-system and cold data.

jhr · Jul 22, 2022

But there are four of WD GOLD Editions, I thought a GOLD edition is good, is it not ?

LnxBil · Jul 22, 2022

jhr said:
But there are four of WD GOLD Editions, I thought a GOLD edition is good, is it not ?

It's a harddisk, so you won't have good performance (in comparison to SATA/SAS SSDs or even NVMe).

jhr · Jul 22, 2022

But I don't care about performance, I care about Proxmox is going to reset when hi IO load happens.

LnxBil · Jul 22, 2022

jhr said:
But I don't care about performance, I care about Proxmox is going to reset when hi IO load happens.

In my book, performance is how a system can handle high IO load, so increasing it would help.

Enterprise disks normally use SAS, which those Gold ones dont. They are prosumer disks and not suitable for real enterprise usage. Normally those real enterprise disks have SAS (instead of SATA) and a special firmware that ensures IO response times of less than a second IN ANY CASE. Your disks don't respond in 32s, so that is NOT GOOD. Have you checked the SMART values? Over 32s response (even over 1s) is totally unaccepable.

Neobin · Jul 22, 2022

jhr said:
But there are four of WD GOLD Editions, I thought a GOLD edition is good, is it not ?

It is not bad, but also it is and will ever be a HDD. HDDs have terrible IOPS in comparison to SSDs and you are hammering them with probably several OSs (VMs/LXCs) on top of a CoW-filesystem.

jhr said:
But I don't care about performance, I care about Proxmox is going to reset when hi IO load happens.

Yeah, probably because your storage comes to a point where it can not handle anything anymore, things timeout and crash.

jhr · Jul 22, 2022

I have about 6 Proxmox servers and I have no trouble at all with those Gold Enterprise, except this one.
Anyway, slow disks should not reset whole Proxmox server.
Smart looks clean to me.
It hangs one hour after backup was made.
In that time I'm not registering any process which should made hi IO.
I would be OK, if it hanged during a backup, but one our later ?

LnxBil · Jul 22, 2022

jhr said:
I would be OK, if it hanged during a backup, but one our later ?

"later" is a bit unspecific. Normally, the backup is not written synchronously, so if writes into the buffer cache until this is full and then the syncing on the disk starts. Yet in general, proxmox should not crash but as @Neobin already wrote, it the disks are busy, what else should the OS do? What about the SATA controller? Maybe a problem there or with the cables?

jhr · Jul 22, 2022

LnxBil said:
"later" is a bit unspecific. Normally, the backup is not written synchronously, so if writes into the buffer cache until this is full and then the syncing on the disk starts. Yet in general, proxmox should not crash but as @Neobin already wrote, it the disks are busy, what else should the OS do? What about the SATA controller? Maybe a problem there or with the cables?

It's a typo, it should be one hour later

But now I found a process in the VM, which starts (and writes 35GB of data )at the same time as the proxmox crashed. So maybe I have perpentrator.

@tomaslederer in the end change ZFS to ext4. But I know he has a different situation, his proxmox hangs only, not crashed.

Neobin · Jul 22, 2022

jhr said:
But I know he has a different situation, his proxmox hangs only, not crashed.

Maybe because he uses low-end consumer NVMe SSDs, which perform somewhat better than HDDs...
But still not good enough as you could read here.

jhr · Aug 31, 2022

Adding more RAM solved my problem with crashing. No need to change HDDs.
ZFS eats so much RAM

hvisage · Aug 31, 2022

jhr said:
Adding more RAM solved my problem with crashing. No need to change HDDs.
ZFS eats so much RAM

perhaps should've limited ZFS's ARC sizing?

[SOLVED] Hypervisor and all VMs hungs during high IO loads

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Renowned Member