Proxmox VE 8.2.2 - High IO delay

I just toggle between the two kernels with pve-boot-tool. The upgrades to proxmox 8.2 were done ages ago, so no upgrade was involved in this switching. My default kernel is 6.2 and I then use sudo proxmox-boot-tool kernel pin 6.8.8-2-pve --next-boot and then rebooting the server. I've been thinking on and off since I saw this problem and I haven't come up with my own answer, so other peoples thoughts appreciated :)

I've enclosed 2 screen shots of the same machine, one running a 6.2 kernel and, my default and the other running 6.8. The machine is part of a very small home cluster, so I can do pretty much I like to it whenever I like, within reason. Each screenshot was taken after rebooting and then leaving the workload to settle right down. The machine doesn't have very much running on it and the workload doesn'tr change very much. The workload is mostly home automation via Home Assistant, TV management via Emby, an adguard server, simple file server and no doubt other bits and bobs. I can produce a list if that helps.

The boot drive is a 2 mirror ZFS and holds all the VMs and LXCs. The data drive is a larger 2 mirror ZFS for TV programs, home document storage and so on. No machines are run from this data drive. All drives are spinning rust, no SSDs

I saw the same issue with 6.5 when I upgraded to that when it first same out. Do let me know of anything else you'd like me to add or grab hold of as I'd like to be 6.8
The only feedback I got from the OpenZFS IRC Channel is that Kernel 6.8 changed MANY THINGS. Not very specific I know, but that's what I know.

Not sure if the Issue was already on Kernel 6.5.x. Granted my Workload might have changed since (and was probably fairly light on Kernel 6.5.x), so the Issue maybe didn't manifest that obviously in my case on that Kernel.
 
Interestingly as a test I moved back to a 6.5 kernel on both nodes, 6.5.13-5-pve and the iodelay is fine once again. My guess is that when I first upgraded to a 6.5 kernel ages ago, there was some other factor which combined with a 6.5 kernel resulted in the high IO delays. Perhaps ZFS. Whatever that was, that seems to be no longer the case, although I'll leave 6.5 running for a couple of days to be sure. If it's fine then, the issue would be restricted to just a 6.8 kernel......
 
  • Like
Reactions: silverstone
Interestingly as a test I moved back to a 6.5 kernel on both nodes, 6.5.13-5-pve and the iodelay is fine once again. My guess is that when I first upgraded to a 6.5 kernel ages ago, there was some other factor which combined with a 6.5 kernel resulted in the high IO delays. Perhaps ZFS. Whatever that was, that seems to be no longer the case, although I'll leave 6.5 running for a couple of days to be sure. If it's fine then, the issue would be restricted to just a 6.8 kernel......
Any further Discoveries ?

I'm quite disappointed that all the Proxmox VE Team and other Users say "Do NOT use Consumer SSD", when the Issue arise after a Package /Kernel and/or ZFS) Upgrade ...

I guess I could maybe take Kernel 6.5.x config from Proxmox VE, Download Kernel 6.6.41 Sources, then run a make olddefconfig or something like that and build it from Sources. It would just be easier if the Proxmox VE Team would acknowledge the Issue though ...
 
Any further Discoveries ?

I'm quite disappointed that all the Proxmox VE Team and other Users say "Do NOT use Consumer SSD", when the Issue arise after a Package /Kernel and/or ZFS) Upgrade ...

I guess I could maybe take Kernel 6.5.x config from Proxmox VE, Download Kernel 6.6.41 Sources, then run a make olddefconfig or something like that and build it from Sources. It would just be easier if the Proxmox VE Team would acknowledge the Issue though ...
Yes some progress with my non-SSD based system. I ran a 6.5 kernel very well for 2-3 days without any instances of high IO delays, but then one node would no longer display the GUI details for the VMs and LXCs. I was just seeing ? and timeout messages on the GUI. The other node was fine. The LXCs and VMs were also operating fine. There were various suggests on the forum which didn't breath any life back to the GUI for me, so I reverted back to old reliable 6.2 kernel and all is well once more.
 
  • Like
Reactions: silverstone
Yes some progress with my non-SSD based system. I ran a 6.5 kernel very well for 2-3 days without any instances of high IO delays, but then one node would no longer display the GUI details for the VMs and LXCs. I was just seeing ? and timeout messages on the GUI. The other node was fine. The LXCs and VMs were also operating fine. There were various suggests on the forum which didn't breath any life back to the GUI for me, so I reverted back to old reliable 6.2 kernel and all is well once more.
Weird ... The GUI being picky about the Kernel Version :oops: ? I mean, if the Kernel is good to run VMs/CTs, then it should also be for the GUI (IMHO).

I don't see how the Kernel Version would break the GUI in that regards ... Did you check Services pvestatd and pveproxy ?

I think I'll try a Custom Kernel Build in the Afternoon Today. Too bad because one of the big Selling points of Proxmox VE was that it ships the Kernel together with the zfs Kernel Modules. Going the custom Kernel Build would mean that I need to install zfs-dkms Package and probably blacklist the Proxmox VE zfs* Packages (so that I pull everything from the Debian-Backports Repository instead). Not sure if that would work or if APT will prevent me from doing that.

Last Question: you are NOT doing something "weird" (like I am ;)) with regards to ZFS ? Like creating/having a ZFS Pool (Guest VM) on top of a ZVOL (Host) ?

On #openzfs IRC Channel somebody reported an even weirder Thing: he created a ZFS Pool on the HOST out of 2 x ZVOLs (again on the HOST). He reported that ZFS would deadlock the Pool. While not the same Situation as his, maybe something similar is going on with my Podman Servers (the only place where I have ZFS [single Disk] Pool on top of a ZVOL), made much worse since Kernel 6.8.x.
 
Weird ... The GUI being picky about the Kernel Version :oops: ? I mean, if the Kernel is good to run VMs/CTs, then it should also be for the GUI (IMHO).

I don't see how the Kernel Version would break the GUI in that regards ... Did you check Services pvestatd and pveproxy ?

I think I'll try a Custom Kernel Build in the Afternoon Today. Too bad because one of the big Selling points of Proxmox VE was that it ships the Kernel together with the zfs Kernel Modules. Going the custom Kernel Build would mean that I need to install zfs-dkms Package and probably blacklist the Proxmox VE zfs* Packages (so that I pull everything from the Debian-Backports Repository instead). Not sure if that would work or if APT will prevent me from doing that.

Last Question: you are NOT doing something "weird" (like I am ;)) with regards to ZFS ? Like creating/having a ZFS Pool (Guest VM) on top of a ZVOL (Host) ?

On #openzfs IRC Channel somebody reported an even weirder Thing: he created a ZFS Pool on the HOST out of 2 x ZVOLs (again on the HOST). He reported that ZFS would deadlock the Pool. While not the same Situation as his, maybe something similar is going on with my Podman Servers (the only place where I have ZFS [single Disk] Pool on top of a ZVOL), made much worse since Kernel 6.8.x.
The services pve-cluster, corosync, pvestatd, pveproxy and pvedaemon were all restarted which I know is very heavy handed but which seemed to be what other people have used to kick the gui into life again. But not for me :)

The only weird thing I can think of with ZFS is that from time to time, with any kernel, the spare memory needs to be compacted. If you try and restart a VM or LXC when this condition arises, it will fail with errors in the log indicating there is not enough free memory of enough size for things like networks for the VM/LXC to be created. Many people have had this issue and the best workaround I have found so far is to:

echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory

During testing, I found that just stopping 1 VM/LXC and starting another would be fine, but that's not ideal. So I have the above as an hourly cron script and it clears this problem right up on my home server. When I run a new kernel, I try this out to just to see that all is well and if the kernel stays up for a few days I remove the hourly job. But so far this elusive ZFS related issue remains.

If you get that sort of issue, you can use

cat /proc/buddyinfo

to see the amounts of free memory in the various categories before and after the 'hack'. Someone did suggest I stopped using ZFS, but I simply frowned at them :)
 
  • Like
Reactions: silverstone
The services pve-cluster, corosync, pvestatd, pveproxy and pvedaemon were all restarted which I know is very heavy handed but which seemed to be what other people have used to kick the gui into life again. But not for me :)

The only weird thing I can think of with ZFS is that from time to time, with any kernel, the spare memory needs to be compacted. If you try and restart a VM or LXC when this condition arises, it will fail with errors in the log indicating there is not enough free memory of enough size for things like networks for the VM/LXC to be created. Many people have had this issue and the best workaround I have found so far is to:

echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory

During testing, I found that just stopping 1 VM/LXC and starting another would be fine, but that's not ideal. So I have the above as an hourly cron script and it clears this problem right up on my home server. When I run a new kernel, I try this out to just to see that all is well and if the kernel stays up for a few days I remove the hourly job. But so far this elusive ZFS related issue remains.

If you get that sort of issue, you can use

cat /proc/buddyinfo

to see the amounts of free memory in the various categories before and after the 'hack'. Someone did suggest I stopped using ZFS, but I simply frowned at them :)
Uhm I never got that Memory Issue. But that's also probably because I force ZFS to do what I want instead of leaving it on a very loose leash ;) .

I reduced the amount of ARC I allow ZFS to use. Maximum 4GB for the Proxmox VE Host on a 32GB System (otherwise ZFS can eat up to 50%, i.e. 16GB RAM!).

/etc/modprobe.d/zfs.conf:
Code:
# Set Max ARC size => 4GB == 4294967296 Bytes
options zfs zfs_arc_max=4294967296
 
# Set Min ARC size => 1GB == 1073741824
options zfs zfs_arc_min=1073741824

In the Podman Guests I allow much less ARC (1 GB max), maybe that's also partly to blame for the low Performance. The other issue is that some VMs are swapping quite aggressively even though they barely use 30% of RAM, so I need to tweak the vm.swappiness=1 (do NOT swap unless ABSOLUTELY required).
 
Thanks for that reminder - I have my MAX arc a bit higher than yours and I'm happy with the performance. The cluster doesn't have a high load either CPU or IO. Maybe when the next kernel or ZFS version comes out I'll review the ZFS settings again.
 
I am observing some very high (>40%, sometimes 80%) IO Delay on Proxmox VE 8.2.2 with pve-no-subscription Repository.

I'm quite disappointed that all the Proxmox VE Team and other Users say "Do NOT use Consumer SSD", when the Issue arise after a Package /Kernel and/or ZFS) Upgrade ...
I mean, the new kernel may as well have an issue and maybe even contribute up to 20% IO delay (to be generous), but 40%-80% IO delay is more than likely SSD I/O issue. I have never seen anywhere close to those numbers even while backup/restoring VM's. Highest peak I have is 15% and I don't even have an NVMe SSD, it's just a lowly SATA (enterprise) SSD. I got these numbers on PVE 8.2.4
 
I mean, the new kernel may as well have an issue and maybe even contribute up to 20% IO delay (to be generous), but 40%-80% IO delay is more than likely SSD I/O issue. I have never seen anywhere close to those numbers even while backup/restoring VM's. Highest peak I have is 15% and I don't even have an NVMe SSD, it's just a lowly SATA (enterprise) SSD. I got these numbers on PVE 8.2.4
I don't use SSD at all. These are older style spinning rust mirrors. I agree that it would seem odd that moving from 6.2 or 6.8 with the same version of ZFS introduces this increase in IO delay, but that it what I see. Ho hum
 
  • Like
Reactions: silverstone
I mean, the new kernel may as well have an issue and maybe even contribute up to 20% IO delay (to be generous), but 40%-80% IO delay is more than likely SSD I/O issue. I have never seen anywhere close to those numbers even while backup/restoring VM's. Highest peak I have is 15% and I don't even have an NVMe SSD, it's just a lowly SATA (enterprise) SSD. I got these numbers on PVE 8.2.4
Well the EXTREME Iowait 60%-80% seems more on those Podman Systems (ZFS on top of ZVOL) so maybe there it's a similar Issue to what the User reported on #openzfs (ZFS Pool Deadlock), although not to the same extent.

On the other systems it might indeed be lower, but still somewhat 20% or so.
 
I migrated my Podman Data from ZFS on top of ZVOL to EXT4 on top of ZVOL.

I tried to do zpool trim rpool on the Proxmox VE Host to see if that would improve Things. It didn't unfortunately.

The Issue still persists. I also tested on Kernel 6.5.x, same thing. IOWait can jump above 80% very easily.

Last Attempt is to upgrade the Crucial MX500 Firmware.

Otherwise the Conclusion is: this Drive SUCKS BIG TIME.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!