[SOLVED] Hypervisor and all VMs hungs during high IO loads

tomaslederer

New Member
Mar 17, 2022
4
1
1
28
Hi all,

I am trying to troubleshoot an issue I am having for some time with my homelab.

Server HW:
  • CPU: AMD Ryzen 2400G
  • RAM: 32GB
  • System/VM disk: 2x256GB NVMe ZFS mirror (one in onboard slot, other in pcie enclosure)
  • Data disk: 10TB WD Gold
The 10TB hard drive is passed-trough to one of the VMs, where it is mounted and formated as ext4. The issue I am having is that during heavy writes/reads to this disk almost always system periodically hungs up - vms stuck, cant access web gui, nothing. After few minutes everything gets back to normal.

I troubleshooted lots of things, saw almost full RAM so I upgraded from 16 gigs to 32... Later I noticed in the logs "deadman" events from system ZFS pool (syslog at the end of the post), which according to documentation should be "flaky hardware or drivers". The errors are same for each issue. However, the errors happen randomly on both disks in array, both are relatively new (half a year) - SMART in screenshot bellow. It s basically same for both drives, as they were only used in this server in mirror.

Snímek obrazovky 2022-03-17 v 11.39.32.png

I would be think it is a faulty disk, if errors were only from one. But from both? Any ideas, what I might have done wrong, might try?


Here is a syslog, starting with time where something starts to happen until I successfuly login to admin console:
Code:
Mar 16 18:27:24 HomeServer-hypervisor pvescheduler[564969]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 16 18:27:24 HomeServer-hypervisor pvestatd[2723]: status update time (15.999 seconds)
Mar 16 18:27:24 HomeServer-hypervisor pve-firewall[2724]: firewall update time (15.818 seconds)
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/100: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: -1
Mar 16 18:27:24 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: -1
Mar 16 18:53:38 HomeServer-hypervisor pve-firewall[2724]: firewall update time (74.129 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pve-ha-lrm[2779]: loop take too long (88 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pve-ha-crm[2764]: loop take too long (76 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pvescheduler[703528]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 16 18:53:52 HomeServer-hypervisor pvestatd[2723]: status update time (88.347 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pve-firewall[2724]: firewall update time (13.790 seconds)
Mar 16 18:53:52 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: -1
Mar 16 18:53:52 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: -1
Mar 16 18:55:03 HomeServer-hypervisor pvedaemon[2754]: <root@pam> successful auth for user 'root@pam'
Mar 16 18:55:39 HomeServer-hypervisor pvedaemon[2753]: <root@pam> successful auth for user 'root@pam'
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Missed 26 events
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Bumping queue length to 67108864
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Missed 89 events
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Bumping queue length to 134217728
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Missed 46 events
Mar 16 18:59:22 HomeServer-hypervisor zed[2153]: Bumping queue length to 268435456
Mar 16 18:59:22 HomeServer-hypervisor zed[722309]: eid=172 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710919168 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557663
Mar 16 18:59:22 HomeServer-hypervisor zed[722306]: eid=167 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710960128 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557668
Mar 16 18:59:22 HomeServer-hypervisor zed[722315]: eid=166 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710968320 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557669
Mar 16 18:59:22 HomeServer-hypervisor zed[722327]: eid=175 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710894592 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557660
Mar 16 18:59:22 HomeServer-hypervisor zed[722326]: eid=169 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710943744 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557666
Mar 16 18:59:22 HomeServer-hypervisor zed[722322]: eid=168 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710951936 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557667
Mar 16 18:59:22 HomeServer-hypervisor zed[722340]: eid=181 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710845440 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557654
Mar 16 18:59:22 HomeServer-hypervisor zed[722338]: eid=174 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710902784 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557661
Mar 16 18:59:22 HomeServer-hypervisor zed[722341]: eid=170 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710935552 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557665
Mar 16 18:59:22 HomeServer-hypervisor zed[722342]: eid=173 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710910976 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557662
Mar 16 18:59:22 HomeServer-hypervisor zed[722354]: eid=180 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710853632 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557655
Mar 16 18:59:22 HomeServer-hypervisor zed[722339]: eid=176 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710886400 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557659
Mar 16 18:59:22 HomeServer-hypervisor zed[722355]: eid=171 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710927360 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557664
Mar 16 18:59:22 HomeServer-hypervisor zed[722344]: eid=177 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710878208 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557658
Mar 16 18:59:22 HomeServer-hypervisor zed[722349]: eid=179 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710861824 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557656
Mar 16 18:59:22 HomeServer-hypervisor zed[722353]: eid=183 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710829056 priority=3 err=0 flags=0x380880 bookmark=66527:1:0:14557652
Mar 16 18:59:22 HomeServer-hypervisor zed[722358]: eid=182 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710837248 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557653
Mar 16 18:59:22 HomeServer-hypervisor zed[722376]: eid=178 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710870016 priority=3 err=0 flags=0x180880 bookmark=66527:1:0:14557657
Mar 16 18:59:22 HomeServer-hypervisor zed[724326]: eid=185 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=8192 offset=166710820864 priority=3 err=0 flags=0x380880 bookmark=66527:1:0:14557651
Mar 16 18:59:22 HomeServer-hypervisor zed[724324]: eid=184 class=deadman pool='rpool' vdev=nvme-Apacer_AS2280P4_256GB_L09938R002700-part3 size=131072 offset=166710706176 priority=3 err=0 flags=0x40080c80 delay=8574092ms
Mar 16 19:04:18 HomeServer-hypervisor pve-firewall[2724]: firewall update time (145.874 seconds)
Mar 16 19:04:18 HomeServer-hypervisor pvestatd[2723]: status update time (146.324 seconds)
Mar 16 19:04:18 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: -1
Mar 16 19:04:18 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local: illegal attempt to update using time 1647453858 when last update time is 1647453858 (minimum one second step)
Mar 16 19:04:18 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: -1
Mar 16 19:04:18 HomeServer-hypervisor pmxcfs[2484]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: /var/lib/rrdcached/db/pve2-storage/HomeServer-hypervisor/local-zfs: illegal attempt to update using time 1647453858 when last update time is 1647453858 (minimum one second step)
Mar 16 19:04:23 HomeServer-hypervisor pve-ha-lrm[2779]: loop take too long (156 seconds)
Mar 16 19:04:23 HomeServer-hypervisor pve-ha-crm[2764]: loop take too long (151 seconds)
Mar 16 19:08:54 HomeServer-hypervisor pvedaemon[2753]: <root@pam> successful auth for user 'root@pam'


Thank you all for your suggestions!
 
Last edited:
Alright.... should I give up ZFS, or are consumer SSDs unusable with proxmox? This is a server, but not enterprise one - homelad with few vms for smart home, nas, media etc....
 
Alright.... should I give up ZFS, or are consumer SSDs unusable with proxmox? This is a server, but not enterprise one - homelad with few vms for smart home, nas, media etc....
consumer/prosumer SSDs are not built for copy-on-write filesystems like ZFS. You may get better performance with ext4.
 
  • Like
Reactions: ITT
On Friday I reinstalled proxmox to ext4. All vms are running from ext4 too. Now, after few days of testing, overall performance seems much better a no hang of vms or proxmox had happened since.

Thanks for your advice!
 
  • Like
Reactions: ITT
I have the same error, except I have 4x ata-WDC_WD2005FBYZ in Raid10 ZFS.

Hi IO load reset hypervisor, not only hangs for a while :(

Jul 22 03:22:03 proxmox2 zed: eid=16 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=8192 offset=1073236172800 priority=0 err=0 flags=0x180880 delay=30499ms bookmark=3589:1:0:61613330

Jul 22 03:22:03 proxmox2 zed: eid=9 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E9FXZY-part3 size=65536 offset=350642769920 priority=0 err=0 flags=0x40080c80 delay=32473ms

Jul 22 03:22:03 proxmox2 zed: eid=11 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E1R1NL-part3 size=8192 offset=140229066752 priority=0 err=0 flags=0x180880 delay=31620ms bookmark=3589:1:0:61613819

Jul 22 03:22:03 proxmox2 zed: eid=8 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E9FXZY-part3 size=1044480 offset=350641725440 priority=0 err=0 flags=0x40080c80 delay=32473ms

Jul 22 03:22:03 proxmox2 zed: eid=15 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E10Z19-part3 size=753664 offset=140228165632 priority=0 err=0 flags=0x40080c80 delay=30340ms

Jul 22 03:22:03 proxmox2 zed: eid=12 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=24576 offset=350644924416 priority=0 err=0 flags=0x40080c80 delay=31620ms

Jul 22 03:22:03 proxmox2 zed: eid=10 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=1044480 offset=350643879936 priority=0 err=0 flags=0x40080c80 delay=32473ms

Jul 22 03:22:03 proxmox2 zed: eid=14 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=4096 offset=1175542091776 priority=0 err=0 flags=0x180880 delay=31620ms bookmark=3589:1:0:61613678

Jul 22 03:22:03 proxmox2 zed: eid=7 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E9FXZY-part3 size=8192 offset=1116908425216 priority=0 err=0 flags=0x180880 delay=32473ms bookmark=3589:1:0:61613679

Jul 22 03:22:03 proxmox2 zed: eid=20 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E10Z19-part3 size=61440 offset=86045593600 priority=0 err=0 flags=0x180880 delay=34105ms bookmark=139:50069:0:0

Jul 22 03:22:03 proxmox2 zed: eid=13 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E7255V-part3 size=24576 offset=1116777095168 priority=0 err=0 flags=0x40080c80 delay=31620ms

Jul 22 03:22:03 proxmox2 zed: eid=17 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E10Z19-part3 size=4096 offset=1032840855552 priority=0 err=0 flags=0x180880 delay=30838ms bookmark=3589:1:0:61613097

Jul 22 03:22:03 proxmox2 zed: eid=6 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E9FXZY-part3 size=8192 offset=1116762341376 priority=0 err=0 flags=0x180880 delay=32473ms bookmark=3589:1:0:61613446

Jul 22 03:22:03 proxmox2 zed: eid=19 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E1R1NL-part3 size=909312 offset=140227256320 priority=0 err=0 flags=0x40080c80 delay=30858ms

Jul 22 03:22:03 proxmox2 pve-firewall[4266]: firewall update time (64.510 seconds)

Jul 22 03:22:03 proxmox2 zed: eid=18 class=delay pool='rpool' vdev=ata-WDC_WD2005FBYZ-01YCBB1_WD-WMC6N0E1R1NL-part3 size=1044480 offset=140226211840 priority=0 err=0 flags=0x40080c80 delay=30858ms

Jul 22 03:22:05 proxmox2 pvestatd[4276]: status update time (72.567 seconds)

Jul 22 03:22:06 proxmox2 pmxcfs[4147]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/proxmox2: -1

Any advice, plese ?
 
But there are four of WD GOLD Editions, I thought a GOLD edition is good, is it not ?
 
But I don't care about performance, I care about Proxmox is going to reset when hi IO load happens.
 
But I don't care about performance, I care about Proxmox is going to reset when hi IO load happens.
In my book, performance is how a system can handle high IO load, so increasing it would help.

Enterprise disks normally use SAS, which those Gold ones dont. They are prosumer disks and not suitable for real enterprise usage. Normally those real enterprise disks have SAS (instead of SATA) and a special firmware that ensures IO response times of less than a second IN ANY CASE. Your disks don't respond in 32s, so that is NOT GOOD. Have you checked the SMART values? Over 32s response (even over 1s) is totally unaccepable.
 
But there are four of WD GOLD Editions, I thought a GOLD edition is good, is it not ?

It is not bad, but also it is and will ever be a HDD. HDDs have terrible IOPS in comparison to SSDs and you are hammering them with probably several OSs (VMs/LXCs) on top of a CoW-filesystem.

But I don't care about performance, I care about Proxmox is going to reset when hi IO load happens.

Yeah, probably because your storage comes to a point where it can not handle anything anymore, things timeout and crash.
 
Last edited:
I have about 6 Proxmox servers and I have no trouble at all with those Gold Enterprise, except this one.
Anyway, slow disks should not reset whole Proxmox server.
Smart looks clean to me.
It hangs one hour after backup was made.
In that time I'm not registering any process which should made hi IO.
I would be OK, if it hanged during a backup, but one our later ?
 
I would be OK, if it hanged during a backup, but one our later ?
"later" is a bit unspecific. Normally, the backup is not written synchronously, so if writes into the buffer cache until this is full and then the syncing on the disk starts. Yet in general, proxmox should not crash but as @Neobin already wrote, it the disks are busy, what else should the OS do? What about the SATA controller? Maybe a problem there or with the cables?
 
"later" is a bit unspecific. Normally, the backup is not written synchronously, so if writes into the buffer cache until this is full and then the syncing on the disk starts. Yet in general, proxmox should not crash but as @Neobin already wrote, it the disks are busy, what else should the OS do? What about the SATA controller? Maybe a problem there or with the cables?
It's a typo, it should be one hour later :)

But now I found a process in the VM, which starts (and writes 35GB of data )at the same time as the proxmox crashed. So maybe I have perpentrator.

@tomaslederer in the end change ZFS to ext4. But I know he has a different situation, his proxmox hangs only, not crashed.
 
But I know he has a different situation, his proxmox hangs only, not crashed.

Maybe because he uses low-end consumer NVMe SSDs, which perform somewhat better than HDDs...
But still not good enough as you could read here.
 
Last edited:
Adding more RAM solved my problem with crashing. No need to change HDDs.
ZFS eats so much RAM
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!