Proxmox VE 8.2.2 - High IO delay

silverstone

Well-Known Member
Apr 28, 2018
80
5
48
35
I am observing some very high (>40%, sometimes 80%) IO Delay on Proxmox VE 8.2.2 with pve-no-subscription Repository.

Looking at some Posts over this Forum, this may be due to not using Enterprise-Grade SSD, although to be honest I don't necessarily "buy" this justification.

I am using Crucial MX500 that, while being a Consumer (Budget) SSD, still features (Partial) Power Loss Protection and is still based on TLC NAND, not QLC.

The weird thing is that I am observing this Issue on TWO fairly-recent (relatively speaking) Nodes based on Supermicro X11SSL-F Motherboard with 64GB of RAM and Intel Xeon E3 1240 v5 CPU. Disks are ZFS mirrors of 2 x 1000GB.

Evben weirder is that I am NOT observing this issue on other older Nodes based on Supermicro X10SLL-F / X10SLM(+)-F with 32GB of RAM and Intel Xeon E3-1230v3/1240v3/1231v3 CPUs. Similar Disks, even potentially smaller (Disks are ZFS mirrors of 2 x 500GB).

I can see quite a few things in dmesg (see attached File, cannot paste here, message is too long).

I can only assume this is an issue with Kernel 6.8.x and/or ZFS 2.2.3, as previous Versions of Proxmox VE didn't have (as far as I remember) this Issue :( .

ZFS Version Info
Code:
zfs-2.2.3-pve2
zfs-kmod-2.2.3-pve2

ZFS 2.2.4 was recently released, maybe the upgrade would fix this Issue as well ?

Sometimes I am stuck on "Writing to file" while trying to save a (very small) file I am editing with nano, this definitively does NOT feel normal :(.

EDIT 1
Not sure if Crucial MX500 FW Update is required/relevant, old Systems I didn't Update (If if ain't broken, don't fix it)
1717050585472.png

On the High IO-delay Newer Systems I am running (apparently) FW M3CR043, while on the older Systems I am running (apparently) FW M3CR022.

At least on one of the new/old systems that is (I didn't check ALL systems).

This is also because on the Newer Systems the Crucial MX500 Drive is also newer (Manufacturing Date maybe 2022-2024), vs < 2020 or so ...

EDIT 2
On the Low IO-delay Systems I'm running Kernel 6.5.x:
Code:
Linux XXXXX 6.5.13-5-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-5 (2024-04-05T11:03Z) x86_64 GNU/Linu

On the High IO-delay Systems I'm running Kernel 6.8.x:
Code:
Linux YYYYY 6.8.4-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) x86_64 GNU/Linux
 

Attachments

  • 20240530_proxmox_ve_io_delay_dmesg.txt
    145.6 KB · Views: 6
Last edited:
I have much the same issue with both my nodes in a cluster, but I'm not using SSD, just plain mirrored spinning disks. I see high IO delay with both 6.8 and 6.5 kernel, but If I flick back to kernel 6.2, all is well.
 
  • Like
Reactions: silverstone
Any update from Proxmox Developers would be appreciated. I am experiencing this on SEVERAL Servers. And it cannot be that I need NVME Drives for the very limited amount of work that I am currently doing :rolleyes: .
 
I use the latest ZFS available on the non-production repositories and the problem persists. I think it's a kernel issue as 6.2 is fine, but 6.5 and 6.8 both show the same high IO for the same work.
 
I don't recall 6.5 being *THIS* Problematic. You could be right though, as I'm doing some more work now that I used to do back then ...

Definitively 6.8 is an Issue.

But having to install Kernel 6.2 on the latest Proxmox VE ... That seems really a Hack.

Did you have to recompile the Kernel ? I don't even think it's safe to use that Version (maybe 6.6 or possibly 6.1 since those are LTS).
 
I was running 6.2 as that was the standard proxmox kernel back then (sorry no exact date), but when the standard kernel moved to 6.5 I saw the problem and reverted back. Ditto on 6.8.x
 
Looking at some Posts over this Forum, this may be due to not using Enterprise-Grade SSD, although to be honest I don't necessarily "buy" this justification.
indeed storing vDisks on ZFS require real datacenter flash disks because ZFS have massive write amplification and because ZFS is slow on consumer flash drives as they can't guarantee writes fast.
it's not few posts about the facts, it's daily posts to reminder the facts.
burned/failed/replaced consumers flash drives on ZFS is monthly posts.
btw, consumer flash drives like your MX500 as PVE OS boot on ZFS mirror is ok but VM storage will burn their TBW.
it's better to convert your consumers drives as PBS storage, as PBS never write same data twice.

still features (Partial) Power Loss Protection
it's nothing to do about the writecache safety.
it's just to inform the controller don't loose its FTL after power loss. some old ssd were requiring format after power loss.
 
Last edited:
If it helps, I'm not using SSD on my old but servicable home lab. All spinning rust, but the issue is easy for me to reproduce by just doing a one off reboot into (say) 6.8. The io delays goes from its typical 1-3% up to 40-50 for as long as I leave it running. If I reboot back to the older kernel, normal IO rates are restored.
 
indeed storing vDisks on ZFS require real datacenter flash disks because ZFS have massive write amplification and because ZFS is slow on consumer flash drives as they can't guarantee writes fast.
it's not few posts about the facts, it's daily posts to reminder the facts.
burned/failed/replaced consumers flash drives on FZS is monthly posts.
btw, consumer flash drives like your MX500 as PVE OS boot on ZFS mirror is ok but VM storage will burn their TBW.
it's better to convert your consumers drives as PBS storage, as PBS never write same data twice.


it's nothing to do about the writecache safety.
it's just to inform the controller don't loose its FTL after power loss. some old ssd were requiring format after power loss.
Again you seem to miss the Point ... It is a new occurrence with Proxmox 8 and Kernel 6.8.x (at least for me), it didn't happen before. Never before I was getting hangups when saving a 100KB file with nano !

As for the Write Amplification I can sort of agree, I looked at the TBW and I was VERY shoked. Got 120TBW on some 5 year old Systems and more worryingly 80TBW on a 1 year old System !

Few things to do ASAP: Turn off atime (set noatime for guests / mount), possibly install a SLOG/ZIL (I have a SLC 32GB SSD I could use for that).

As for replacing tens of SSDs with enterprise ones like you are suggesting, that's absolutely NOT an option, I don't have the Money for that !

And whether this is reported daily etc, that could very well be. But then explain why it didn't occur with older Kernels. And why, as @dmshimself said, it seems to work just fine with Kernel 6.x.x (in his Case at least).
 
So leave ZFS for your VM , keep ZFS on 30 GB mirrored partition only to boot PVE OS.
What exactly do you mean there ? I don't want to change my entire infrastructure just for the fun of it whenever there is a new BUG popping up ....
 
you can limit TBW and boost performance making partitions :
ZFS mirrored partitions for PVE OS , like 30 GB
freespace of the first disk as LVM-Thin datastore.
freespace of the second disk as ext4 PBS datastore.
Twice a day backup, it's not more a mirror, but TBW will be regular and performance restored.
if you keep VM on ZFS on consumer disk, they will be even more slower after many TBW ...
change my entire infrastructure just for the fun
infrastructure based on ZFS on consumer disks is for fun.
 
  • Like
Reactions: LnxBil
you can limit TBW and boost performance making partitions :
ZFS mirrored partitions for PVE OS , like 30 GB
freespace of the first disk as LVM-Thin datastore.
freespace of the second disk as ext4 PBS datastore.
Twice a day backup, it's not more a mirror, but TBW will be regular and performance restored.
if you keep VM on ZFS on consumer disk, they will be even more slower after many TBW ...
LVM is an absolute PITA to manage. I tried to recover previous systems. Never again :rolleyes: !

Why are you so focused on PBS ? I am talking about Proxmox VE, not Proxmox Backup Server.

infrastructure based on ZFS on consumer disks is for fun.
So is having to change a Partition Layout when you have Data on it already ...

Let alone to setup backups instead of a simple zfs send | ssh zfs receive Solution this would turn into a nightmare with 300 different Variations, partition layouts, filesystems and whatnot !

I'd like to move on with my Lab and Life, not having to go backwards and redo every single thing from Scratch on SEVERAL Systems.

And again, this did NOT happen with previous Versions of Proxmox VE / Kernel ...
 
Again just to emphasize that while I'm using ZFS for boot and data pools, I'm not using SSD. The issue (for me) was present the first time I tried 6.5 and also occurs with 6.8. I suspect that if there were to be a solution for my rig, that same solution would work for people with SSD drives too.
 
  • Like
Reactions: silverstone
I bet on usage is different on your previous systems.
I don't think it's related to PVE / Kernel version (sometimes there is some glitchs but it's not the root cause).
Well you might have a point, at least to some extent, I am not debating that. I just think that if a new Issue shows up on 3-4 Servers of mine after an Update, it's a BUG, not a feature. I am debating that being the only cause.

And while I tend to agree that the Issue seems more predominant in Servers where I also run Podman ("Docker" alternative) in a Virtual Machine, with ZFS on top of a ZVOL (with "coordinated" compression/snapshots [NOT BOTH turned on on the HOST AND GUEST] and autotrim enabled in the guest) I am pretty sure it also occurs in other Hosts. Probably because Podman/Docker deal with a huge number of very small files (overlayfs/overlayfs2 for each Container's "root" "partition" [chroot] that keep changing all the time, plus logging of course). On those VMs I see (just had a quick look right now with iostat) around 800KB/s - 1500 KB/s writes. Plus of course there will be [huge] spikes. And 1.5MB/s*3600s/h*24h/day = 129'600 MB/day ~ 130 GB/day. And that's just INSIDE the VM. Let alone all write amplification AFTER on the HOST/NAND side.

And of course all Hosts have ZFS on top of LUKS (and some of them previously were NOT encrypted), so that also contributes to write amplification and I/O.

I am not familiar with LVM Thin, so it's not a Personal attack against you, sorry if it sounded rude yesterday. I am just tired (read: exhausted) of dealing with Issue after Issue and not being able to move forward o_O .

I just gave it a try with all of the pvcreate, lvcreate, vgcreate etc back then. I just find it EXTREMELY messy (at least to me, who is not used to work with it). It's like pulling teeth ... mdadm RAID1 Mirror [without LVM] I can use (for /boot and efi Partitions, ext4 and vfat Formatted respectively) and while I still find it more complicated than it needs to be compared to ZFS command line, it's still manageable. LVM on the other Hand ... I don't know how to explain ... I couldn't Figure it out at all. Whether it's just because there is no equivalent like zfs list or because LVM on top of mdadm (or mdadm on top of LVM) was extremely messy, I cannot remember. It just didn't "fly".


I would be open to do that on a test System. I have a few [old] SSDs I could setup on a Playground Server and then run some Benchmark on them (how ??? I saw @Dunuin Posts e.g. on https://forum.proxmox.com/threads/how-to-best-benchmark-ssds.93543/, is there something more "automated" ?), but I would need a very good Guide (whether it's Tutorial, some Github Repo, a GitHub GIST or something like that).

And ... the other step is how do you backup it ? If you run LVM thin with ZFS inside the Guest, I guess it could be just zfs send | ssh zfs receive. More complicated (need to configure ~ 20 SSH keys and sanoid/etc + backupserver but doable). Otherwise ... Would the backup be done with dd or what ? Pretty sure LVM Support Snapshots as that is one of their advertised Features, but how do you manage them (take a snapshot and replicate to a Remote Server) ?
 
Last edited:
And ... the other step is how do you backup it ? If you run LVM thin with ZFS inside the Guest, I guess it could be just zfs send | ssh zfs receive. More complicated (need to configure ~ 20 SSH keys and sanoid/etc + backupserver but doable). Otherwise ... Would the backup be done with dd or what ?
simply use proxmox backup feature ? I don't use any snapshot feature of storage.


Pretty sure LVM Support Snapshots as that is one of their advertised Features, but how do you manage them (take a snapshot and replicate to a Remote Server)
you cant export|import lvm snapshots like zfs.
(afaik, only zfs && ceph rbd support this feature)
 
I bet on usage is different on your previous systems.
I don't think it's related to PVE / Kernel version (sometimes there is some glitchs but it's not the root cause).
On my particular system the workloads with a 6.2 kernel are the same as with the same system running a 6.8 kernel. But the IO delays are much higher with 6.8.
 
  • Like
Reactions: silverstone
On my particular system the workloads with a 6.2 kernel are the same as with the same system running a 6.8 kernel. But the IO delays are much higher with 6.8.
Just to be clear: this is reproducible by changing the kernel back and forth or was this measured before and after an upgrade?
 
Just to be clear: this is reproducible by changing the kernel back and forth or was this measured before and after an upgrade?
I just toggle between the two kernels with pve-boot-tool. The upgrades to proxmox 8.2 were done ages ago, so no upgrade was involved in this switching. My default kernel is 6.2 and I then use sudo proxmox-boot-tool kernel pin 6.8.8-2-pve --next-boot and then rebooting the server. I've been thinking on and off since I saw this problem and I haven't come up with my own answer, so other peoples thoughts appreciated :)

I've enclosed 2 screen shots of the same machine, one running a 6.2 kernel and, my default and the other running 6.8. The machine is part of a very small home cluster, so I can do pretty much I like to it whenever I like, within reason. Each screenshot was taken after rebooting and then leaving the workload to settle right down. The machine doesn't have very much running on it and the workload doesn'tr change very much. The workload is mostly home automation via Home Assistant, TV management via Emby, an adguard server, simple file server and no doubt other bits and bobs. I can produce a list if that helps.

The boot drive is a 2 mirror ZFS and holds all the VMs and LXCs. The data drive is a larger 2 mirror ZFS for TV programs, home document storage and so on. No machines are run from this data drive. All drives are spinning rust, no SSDs

I saw the same issue with 6.5 when I upgraded to that when it first same out. Do let me know of anything else you'd like me to add or grab hold of as I'd like to be 6.8

*update* Let me run through stopping all the Vms and LXCs and stopping other things gradually and see if the load drops off.....
 

Attachments

  • 6.8 IO Delay.png
    6.8 IO Delay.png
    46.2 KB · Views: 25
  • 6.2 IO Delay.png
    6.2 IO Delay.png
    51.7 KB · Views: 25
Last edited:
  • Like
Reactions: silverstone

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!