Proxmox VE 8.2.2 - High IO delay

silverstone · May 30, 2024

I am observing some very high (>40%, sometimes 80%) IO Delay on Proxmox VE 8.2.2 with pve-no-subscription Repository.

Looking at some Posts over this Forum, this may be due to not using Enterprise-Grade SSD, although to be honest I don't necessarily "buy" this justification.

I am using Crucial MX500 that, while being a Consumer (Budget) SSD, still features (Partial) Power Loss Protection and is still based on TLC NAND, not QLC.

The weird thing is that I am observing this Issue on TWO fairly-recent (relatively speaking) Nodes based on Supermicro X11SSL-F Motherboard with 64GB of RAM and Intel Xeon E3 1240 v5 CPU. Disks are ZFS mirrors of 2 x 1000GB.

Evben weirder is that I am NOT observing this issue on other older Nodes based on Supermicro X10SLL-F / X10SLM(+)-F with 32GB of RAM and Intel Xeon E3-1230v3/1240v3/1231v3 CPUs. Similar Disks, even potentially smaller (Disks are ZFS mirrors of 2 x 500GB).

I can see quite a few things in dmesg (see attached File, cannot paste here, message is too long).

I can only assume this is an issue with Kernel 6.8.x and/or ZFS 2.2.3, as previous Versions of Proxmox VE didn't have (as far as I remember) this Issue

.

ZFS Version Info

Code:

zfs-2.2.3-pve2
zfs-kmod-2.2.3-pve2

ZFS 2.2.4 was recently released, maybe the upgrade would fix this Issue as well ?

Sometimes I am stuck on "Writing to file" while trying to save a (very small) file I am editing with nano, this definitively does NOT feel normal

.

EDIT 1
Not sure if Crucial MX500 FW Update is required/relevant, old Systems I didn't Update (If if ain't broken, don't fix it)

On the High IO-delay Newer Systems I am running (apparently) FW M3CR043, while on the older Systems I am running (apparently) FW M3CR022.

At least on one of the new/old systems that is (I didn't check ALL systems).

This is also because on the Newer Systems the Crucial MX500 Drive is also newer (Manufacturing Date maybe 2022-2024), vs < 2020 or so ...

EDIT 2
On the Low IO-delay Systems I'm running Kernel 6.5.x:

Code:

Linux XXXXX 6.5.13-5-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-5 (2024-04-05T11:03Z) x86_64 GNU/Linu

On the High IO-delay Systems I'm running Kernel 6.8.x:

Code:

Linux YYYYY 6.8.4-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) x86_64 GNU/Linux

dmshimself · Jun 16, 2024

I have much the same issue with both my nodes in a cluster, but I'm not using SSD, just plain mirrored spinning disks. I see high IO delay with both 6.8 and 6.5 kernel, but If I flick back to kernel 6.2, all is well.

silverstone · Jul 11, 2024

Any update from Proxmox Developers would be appreciated. I am experiencing this on SEVERAL Servers. And it cannot be that I need NVME Drives for the very limited amount of work that I am currently doing

.

dmshimself · Jul 11, 2024

I use the latest ZFS available on the non-production repositories and the problem persists. I think it's a kernel issue as 6.2 is fine, but 6.5 and 6.8 both show the same high IO for the same work.

silverstone · Jul 11, 2024

I don't recall 6.5 being *THIS* Problematic. You could be right though, as I'm doing some more work now that I used to do back then ...

Definitively 6.8 is an Issue.

But having to install Kernel 6.2 on the latest Proxmox VE ... That seems really a Hack.

Did you have to recompile the Kernel ? I don't even think it's safe to use that Version (maybe 6.6 or possibly 6.1 since those are LTS).

dmshimself · Jul 11, 2024

I was running 6.2 as that was the standard proxmox kernel back then (sorry no exact date), but when the standard kernel moved to 6.5 I saw the problem and reverted back. Ditto on 6.8.x

_gabriel · Jul 11, 2024

silverstone said:
Looking at some Posts over this Forum, this may be due to not using Enterprise-Grade SSD, although to be honest I don't necessarily "buy" this justification.

indeed storing vDisks on ZFS require real datacenter flash disks because ZFS have massive write amplification and because ZFS is slow on consumer flash drives as they can't guarantee writes fast.
it's not few posts about the facts, it's daily posts to reminder the facts.
burned/failed/replaced consumers flash drives on ZFS is monthly posts.
btw, consumer flash drives like your MX500 as PVE OS boot on ZFS mirror is ok but VM storage will burn their TBW.
it's better to convert your consumers drives as PBS storage, as PBS never write same data twice.

silverstone said:
still features (Partial) Power Loss Protection

it's nothing to do about the writecache safety.
it's just to inform the controller don't loose its FTL after power loss. some old ssd were requiring format after power loss.

dmshimself · Jul 11, 2024

If it helps, I'm not using SSD on my old but servicable home lab. All spinning rust, but the issue is easy for me to reproduce by just doing a one off reboot into (say) 6.8. The io delays goes from its typical 1-3% up to 40-50 for as long as I leave it running. If I reboot back to the older kernel, normal IO rates are restored.

silverstone · Jul 11, 2024

_gabriel said:
indeed storing vDisks on ZFS require real datacenter flash disks because ZFS have massive write amplification and because ZFS is slow on consumer flash drives as they can't guarantee writes fast.
it's not few posts about the facts, it's daily posts to reminder the facts.
burned/failed/replaced consumers flash drives on FZS is monthly posts.
btw, consumer flash drives like your MX500 as PVE OS boot on ZFS mirror is ok but VM storage will burn their TBW.
it's better to convert your consumers drives as PBS storage, as PBS never write same data twice.

it's nothing to do about the writecache safety.
it's just to inform the controller don't loose its FTL after power loss. some old ssd were requiring format after power loss.

Again you seem to miss the Point ... It is a new occurrence with Proxmox 8 and Kernel 6.8.x (at least for me), it didn't happen before. Never before I was getting hangups when saving a 100KB file with nano !

As for the Write Amplification I can sort of agree, I looked at the TBW and I was VERY shoked. Got 120TBW on some 5 year old Systems and more worryingly 80TBW on a 1 year old System !

Few things to do ASAP: Turn off atime (set noatime for guests / mount), possibly install a SLOG/ZIL (I have a SLC 32GB SSD I could use for that).

As for replacing tens of SSDs with enterprise ones like you are suggesting, that's absolutely NOT an option, I don't have the Money for that !

And whether this is reported daily etc, that could very well be. But then explain why it didn't occur with older Kernels. And why, as @dmshimself said, it seems to work just fine with Kernel 6.x.x (in his Case at least).

_gabriel · Jul 11, 2024

silverstone said:
I don't have the Money for that !

So leave ZFS for your VM , keep ZFS on 30 GB mirrored partition only to boot PVE OS.

silverstone · Jul 11, 2024

_gabriel said:
So leave ZFS for your VM , keep ZFS on 30 GB mirrored partition only to boot PVE OS.

What exactly do you mean there ? I don't want to change my entire infrastructure just for the fun of it whenever there is a new BUG popping up ....

_gabriel · Jul 12, 2024

you can limit TBW and boost performance making partitions :
ZFS mirrored partitions for PVE OS , like 30 GB
freespace of the first disk as LVM-Thin datastore.
freespace of the second disk as ext4 PBS datastore.
Twice a day backup, it's not more a mirror, but TBW will be regular and performance restored.
if you keep VM on ZFS on consumer disk, they will be even more slower after many TBW ...

silverstone said:
change my entire infrastructure just for the fun

infrastructure based on ZFS on consumer disks is for fun.

silverstone · Jul 12, 2024

_gabriel said:
you can limit TBW and boost performance making partitions :
ZFS mirrored partitions for PVE OS , like 30 GB
freespace of the first disk as LVM-Thin datastore.
freespace of the second disk as ext4 PBS datastore.
Twice a day backup, it's not more a mirror, but TBW will be regular and performance restored.
if you keep VM on ZFS on consumer disk, they will be even more slower after many TBW ...

LVM is an absolute PITA to manage. I tried to recover previous systems. Never again

!

Why are you so focused on PBS ? I am talking about Proxmox VE, not Proxmox Backup Server.

_gabriel said:
infrastructure based on ZFS on consumer disks is for fun.

So is having to change a Partition Layout when you have Data on it already ...

Let alone to setup backups instead of a simple zfs send | ssh zfs receive Solution this would turn into a nightmare with 300 different Variations, partition layouts, filesystems and whatnot !

I'd like to move on with my Lab and Life, not having to go backwards and redo every single thing from Scratch on SEVERAL Systems.

And again, this did NOT happen with previous Versions of Proxmox VE / Kernel ...

dmshimself · Jul 12, 2024

Again just to emphasize that while I'm using ZFS for boot and data pools, I'm not using SSD. The issue (for me) was present the first time I tried 6.5 and also occurs with 6.8. I suspect that if there were to be a solution for my rig, that same solution would work for people with SSD drives too.

_gabriel · Jul 12, 2024

silverstone said:
this did NOT happen with previous Versions of Proxmox VE / Kernel ...

I bet on usage is different on your previous systems.
I don't think it's related to PVE / Kernel version (sometimes there is some glitchs but it's not the root cause).

silverstone · Jul 12, 2024

_gabriel said:
I bet on usage is different on your previous systems.
I don't think it's related to PVE / Kernel version (sometimes there is some glitchs but it's not the root cause).

Well you might have a point, at least to some extent, I am not debating that. I just think that if a new Issue shows up on 3-4 Servers of mine after an Update, it's a BUG, not a feature. I am debating that being the only cause.

And while I tend to agree that the Issue seems more predominant in Servers where I also run Podman ("Docker" alternative) in a Virtual Machine, with ZFS on top of a ZVOL (with "coordinated" compression/snapshots [NOT BOTH turned on on the HOST AND GUEST] and autotrim enabled in the guest) I am pretty sure it also occurs in other Hosts. Probably because Podman/Docker deal with a huge number of very small files (overlayfs/overlayfs2 for each Container's "root" "partition" [chroot] that keep changing all the time, plus logging of course). On those VMs I see (just had a quick look right now with iostat) around 800KB/s - 1500 KB/s writes. Plus of course there will be [huge] spikes. And 1.5MB/s*3600s/h*24h/day = 129'600 MB/day ~ 130 GB/day. And that's just INSIDE the VM. Let alone all write amplification AFTER on the HOST/NAND side.

And of course all Hosts have ZFS on top of LUKS (and some of them previously were NOT encrypted), so that also contributes to write amplification and I/O.

I am not familiar with LVM Thin, so it's not a Personal attack against you, sorry if it sounded rude yesterday. I am just tired (read: exhausted) of dealing with Issue after Issue and not being able to move forward

.

I just gave it a try with all of the pvcreate, lvcreate, vgcreate etc back then. I just find it EXTREMELY messy (at least to me, who is not used to work with it). It's like pulling teeth ... mdadm RAID1 Mirror [without LVM] I can use (for /boot and efi Partitions, ext4 and vfat Formatted respectively) and while I still find it more complicated than it needs to be compared to ZFS command line, it's still manageable. LVM on the other Hand ... I don't know how to explain ... I couldn't Figure it out at all. Whether it's just because there is no equivalent like zfs list or because LVM on top of mdadm (or mdadm on top of LVM) was extremely messy, I cannot remember. It just didn't "fly".

I would be open to do that on a test System. I have a few [old] SSDs I could setup on a Playground Server and then run some Benchmark on them (how ??? I saw @Dunuin Posts e.g. on https://forum.proxmox.com/threads/how-to-best-benchmark-ssds.93543/, is there something more "automated" ?), but I would need a very good Guide (whether it's Tutorial, some Github Repo, a GitHub GIST or something like that).

And ... the other step is how do you backup it ? If you run LVM thin with ZFS inside the Guest, I guess it could be just zfs send | ssh zfs receive. More complicated (need to configure ~ 20 SSH keys and sanoid/etc + backupserver but doable). Otherwise ... Would the backup be done with dd or what ? Pretty sure LVM Support Snapshots as that is one of their advertised Features, but how do you manage them (take a snapshot and replicate to a Remote Server) ?

spirit · Jul 12, 2024

And ... the other step is how do you backup it ? If you run LVM thin with ZFS inside the Guest, I guess it could be just zfs send | ssh zfs receive. More complicated (need to configure ~ 20 SSH keys and sanoid/etc + backupserver but doable). Otherwise ... Would the backup be done with dd or what ?

simply use proxmox backup feature ? I don't use any snapshot feature of storage.

Pretty sure LVM Support Snapshots as that is one of their advertised Features, but how do you manage them (take a snapshot and replicate to a Remote Server)

you cant export|import lvm snapshots like zfs.
(afaik, only zfs && ceph rbd support this feature)

dmshimself · Jul 12, 2024

_gabriel said:
I bet on usage is different on your previous systems.
I don't think it's related to PVE / Kernel version (sometimes there is some glitchs but it's not the root cause).

On my particular system the workloads with a 6.2 kernel are the same as with the same system running a 6.8 kernel. But the IO delays are much higher with 6.8.

LnxBil · Jul 13, 2024

dmshimself said:
On my particular system the workloads with a 6.2 kernel are the same as with the same system running a 6.8 kernel. But the IO delays are much higher with 6.8.

Just to be clear: this is reproducible by changing the kernel back and forth or was this measured before and after an upgrade?

dmshimself · Jul 14, 2024

LnxBil said:
Just to be clear: this is reproducible by changing the kernel back and forth or was this measured before and after an upgrade?

I just toggle between the two kernels with pve-boot-tool. The upgrades to proxmox 8.2 were done ages ago, so no upgrade was involved in this switching. My default kernel is 6.2 and I then use sudo proxmox-boot-tool kernel pin 6.8.8-2-pve --next-boot and then rebooting the server. I've been thinking on and off since I saw this problem and I haven't come up with my own answer, so other peoples thoughts appreciated

I've enclosed 2 screen shots of the same machine, one running a 6.2 kernel and, my default and the other running 6.8. The machine is part of a very small home cluster, so I can do pretty much I like to it whenever I like, within reason. Each screenshot was taken after rebooting and then leaving the workload to settle right down. The machine doesn't have very much running on it and the workload doesn'tr change very much. The workload is mostly home automation via Home Assistant, TV management via Emby, an adguard server, simple file server and no doubt other bits and bobs. I can produce a list if that helps.

The boot drive is a 2 mirror ZFS and holds all the VMs and LXCs. The data drive is a larger 2 mirror ZFS for TV programs, home document storage and so on. No machines are run from this data drive. All drives are spinning rust, no SSDs

I saw the same issue with 6.5 when I upgraded to that when it first same out. Do let me know of anything else you'd like me to add or grab hold of as I'd like to be 6.8

*update* Let me run through stopping all the Vms and LXCs and stopping other things gradually and see if the load drops off.....

Proxmox VE 8.2.2 - High IO delay

Renowned Member

Attachments

Member

Renowned Member

Member

Renowned Member

Member

Famous Member

Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Member

Famous Member

Renowned Member

Distinguished Member

Member

Distinguished Member

Member

Attachments

We value your privacy