I am running small ZFS pool (2x 2TB SATA SSDs) on my server.
Not much things running, it was operating OK for long time (1.5 years).
A week ago I upgraded the kernel while adding another node to my cluster - from the old one (not sure which version) to 6.1.15-1-pve.
I did some reconfiguration of the machines and after another half day of operation I rebooted - and therefore loaded the new kernel.
Since then, I have 40-70% of IO operations blocking CPU (Ryzen 5 5600g), everything takes ages, all virtual machines started to report systemd problems with writing to journal, etc.
I added 2 more SSDs and moved the machines one-by-one from ZFS into ext4 directory.
All of them now run with <1% IO delay for CPU.
A soon as I try to move 60GB disk image from the ext4 partition back to re-created ZFS pool, I am hit with 40+% IO ops delays up until the whole copy process stales.
There is no error message in journal or dmesg, just kernel reports timeout of process and that's it.
The ZFS become completely unusable.
During investigation I did the same thing on 2nd node, where I have ZFS pool with just 1 NVMe SSD (2TB) and the result is exactly the same - super-high IO delays.
So, now I am without ZFS, everything runs only from ext4, so I cannot use any replication or automated backups.
Is it just me (e.g. specific HW), or is this common and known problem?
I tried searching internet and I can see similar problems from 1+ year ago, the solution was only to switch to older kernel, but in my case, using the 6.8.12-17-pve seems to create similar issues and high IO load. I can try to go back to 6.8.12-8-pve, but not sure it will help.
Not much things running, it was operating OK for long time (1.5 years).
A week ago I upgraded the kernel while adding another node to my cluster - from the old one (not sure which version) to 6.1.15-1-pve.
I did some reconfiguration of the machines and after another half day of operation I rebooted - and therefore loaded the new kernel.
Since then, I have 40-70% of IO operations blocking CPU (Ryzen 5 5600g), everything takes ages, all virtual machines started to report systemd problems with writing to journal, etc.
I added 2 more SSDs and moved the machines one-by-one from ZFS into ext4 directory.
All of them now run with <1% IO delay for CPU.
A soon as I try to move 60GB disk image from the ext4 partition back to re-created ZFS pool, I am hit with 40+% IO ops delays up until the whole copy process stales.
There is no error message in journal or dmesg, just kernel reports timeout of process and that's it.
The ZFS become completely unusable.
During investigation I did the same thing on 2nd node, where I have ZFS pool with just 1 NVMe SSD (2TB) and the result is exactly the same - super-high IO delays.
So, now I am without ZFS, everything runs only from ext4, so I cannot use any replication or automated backups.
Is it just me (e.g. specific HW), or is this common and known problem?
I tried searching internet and I can see similar problems from 1+ year ago, the solution was only to switch to older kernel, but in my case, using the 6.8.12-17-pve seems to create similar issues and high IO load. I can try to go back to 6.8.12-8-pve, but not sure it will help.