[KNOWN ZFS PROBLEM] Freezing while high IO (only on host - not guest)

yes, some time soon (but note that the difference only consists of some minor, unrelated bug fixes).
I did upgrade to 0.7.6 on Friday and after 2 days of testing I didn't have any slowdown/freezing problems as with previous versions. I was able to do all backups including largest VM and everything run smoothly, no system load spikes. I will keep testing through week and report back if problem will come back.
 
Hello gentlemen.

Found this thread after encountering this problem.

I have 2 proxmox installs. I updated both of them hoping to resolve this issue.

1. I had a /etc/modprobe.d/zfs.conf file configured as documented here https://pve.proxmox.com/wiki/ZFS_on_Linux. After running apt-get dist-upgrade, the ZFS portion of the upgrade threw a ton of errors saying it couldn't process the file then completed successfully. However when I run "grep | match ZFS" I'm still getting Loaded module v0.7.3-1, ZFS pool version 5000

The update seems to have failed partially and not replaced the module. I removed the file and tried again but it seems to think everything is fine. How can I force the update again?

2. The second install doesn't have that problem and is now loading v0.7.6-1 as this thread suggests it should. However I'm still seeing significant IO latencies when duplicating/backing up or doing mass IO operations.

Both systems run NVMe drives for the pools. 1 runs 6 mixed samsung NVMe drives, 2 runs 4 toshiba NVMe drives. Neither are running a SLOG as I think the drives are plenty fast to keep up and memory is plenty in both systems. 1 does however run dedupe and compression, which I thought was the initial problem with IO, but 2 doesn't run either and still has the same problem.

Thanks for your time.
 
Hello gentlemen.

Found this thread after encountering this problem.

I have 2 proxmox installs. I updated both of them hoping to resolve this issue.

1. I had a /etc/modprobe.d/zfs.conf file configured as documented here https://pve.proxmox.com/wiki/ZFS_on_Linux. After running apt-get dist-upgrade, the ZFS portion of the upgrade threw a ton of errors saying it couldn't process the file then completed successfully. However when I run "grep | match ZFS" I'm still getting Loaded module v0.7.3-1, ZFS pool version 5000

The update seems to have failed partially and not replaced the module. I removed the file and tried again but it seems to think everything is fine. How can I force the update again?

there is no "ZFS portion of the upgrade" in PVE, you likely misconfigured something and have zfs-dkms packages from stock Debian (backports?) installed. verify that all the spl and zfs related packages have "pve" somewhere in their version number.

2. The second install doesn't have that problem and is now loading v0.7.6-1 as this thread suggests it should. However I'm still seeing significant IO latencies when duplicating/backing up or doing mass IO operations.

Both systems run NVMe drives for the pools. 1 runs 6 mixed samsung NVMe drives, 2 runs 4 toshiba NVMe drives. Neither are running a SLOG as I think the drives are plenty fast to keep up and memory is plenty in both systems. 1 does however run dedupe and compression, which I thought was the initial problem with IO, but 2 doesn't run either and still has the same problem.

Thanks for your time.

ZFS' internal scheduling can sometimes get out of whack - it offers a lot of knobs to fine tune it (see "man zfs-module-parameters"). deduplication is a bad choice in almost all situations, (lz4/on) compression is a good choice unless your disks have a big performance advantage over your CPUs.
 
When I did an "apt-get dist-upgrade" the following 2 packages were attempted to be updated
zfs-initramfs & zfsutils-linux. It failed on the first machine, and passed on the second. Version numbering changed on the second. I'm not sure how to repeat the procedure as it seems to think that it passed fine.

I'm getting some pretty silly deduplication rates (11.4x), and performance has been great. I run a lot of VM's for the sake of testing installation software, and the VM's all end up basically the same. The installs I set up take approximately an hour to run, and the install copies over 20GB of data over the network in about a minute. I've had no problems beyond Windows 10 being approximately 50% slower in my tests compared to Windows 7. I haven't figured that out but it doesn't bother me that much. On hardware this problem doesn't exist, so I don't really care.

Deduplication/compression is only on in the first system. The second system doesn't need it and won't benefit from it. The IO delays in backup/clone operations happens on both systems, even though they are very different.
 
When I did an "apt-get dist-upgrade" the following 2 packages were attempted to be updated
zfs-initramfs & zfsutils-linux. It failed on the first machine, and passed on the second. Version numbering changed on the second. I'm not sure how to repeat the procedure as it seems to think that it passed fine.

please check /var/log/apt/history.log and post the relevant parts

I'm getting some pretty silly deduplication rates (11.4x), and performance has been great. I run a lot of VM's for the sake of testing installation software, and the VM's all end up basically the same. The installs I set up take approximately an hour to run, and the install copies over 20GB of data over the network in about a minute. I've had no problems beyond Windows 10 being approximately 50% slower in my tests compared to Windows 7. I haven't figured that out but it doesn't bother me that much. On hardware this problem doesn't exist, so I don't really care.

the problem with deduplication is that it needs to keep a table in memory that grows with the number of blocks. this gets more and more expensive the more data you (phyiscally) store, and there is no way around it with the current implementation.

Deduplication/compression is only on in the first system. The second system doesn't need it and won't benefit from it. The IO delays in backup/clone operations happens on both systems, even though they are very different.

is cloning offline or online? you can easily rate-limit the backup jobs if they are causing too much interference with regular I/O (see "bwlimit" in vzdump.cfg).
 
It appears I did an upgrade rather than a dist-upgrade.

Start-Date: 2018-02-27 18:54:09
Commandline: apt-get upgrade
Upgrade: libdns-export162:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), libpve-access-control:amd64 (5.0-7, 5.0-8), libisccfg140:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), linux-libc-dev:amd64 (4.9.65-3+deb9u2, 4.9.82-1+deb9u2), cpp-6:amd64 (6.3.0-18, 6.3.0-18+deb9u1), pve-qemu-kvm:amd64 (2.9.1-5, 2.9.1-9), libirs141:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), bind9-host:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), dnsutils:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), pve-docs:amd64 (5.1-15, 5.1-16), pve-ha-manager:amd64 (2.0-4, 2.0-5), zfs-initramfs:amd64 (0.7.3-pve1~bpo9, 0.7.6-pve1~bpo9), libisc160:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), pve-container:amd64 (2.0-18, 2.0-19), libquadmath0:amd64 (6.3.0-18, 6.3.0-18+deb9u1), gcc-6-base:amd64 (6.3.0-18, 6.3.0-18+deb9u1), pve-cluster:amd64 (5.0-19, 5.0-20), libgcc1:amd64 (1:6.3.0-18, 1:6.3.0-18+deb9u1), zfsutils-linux:amd64 (0.7.3-pve1~bpo9, 0.7.6-pve1~bpo9), libisc-export160:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), libvorbisenc2:amd64 (1.3.5-4, 1.3.5-4+deb9u1), spl:amd64 (0.7.3-pve1~bpo9, 0.7.6-pve1~bpo9), libgfortran3:amd64 (6.3.0-18, 6.3.0-18+deb9u1), libzfs2linux:amd64 (0.7.3-pve1~bpo9, 0.7.6-pve1~bpo9), liblwres141:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), libpve-common-perl:amd64 (5.0-25, 5.0-28), qemu-server:amd64 (5.0-18, 5.0-22), iproute2:amd64 (4.10.0-1, 4.13.0-3), libdns162:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), libzpool2linux:amd64 (0.7.3-pve1~bpo9, 0.7.6-pve1~bpo9), libxml2:amd64 (2.9.4+dfsg1-2.2+deb9u1, 2.9.4+dfsg1-2.2+deb9u2), libisccc140:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), libvorbis0a:amd64 (1.3.5-4, 1.3.5-4+deb9u1), libbind9-140:amd64 (1:9.10.3.dfsg.P4-12.3+deb9u3, 1:9.10.3.dfsg.P4-12.3+deb9u4), libnvpair1linux:amd64 (0.7.3-pve1~bpo9, 0.7.6-pve1~bpo9), libtasn1-6:amd64 (4.10-1.1, 4.10-1.1+deb9u1), libuutil1linux:amd64 (0.7.3-pve1~bpo9, 0.7.6-pve1~bpo9), libstdc++6:amd64 (6.3.0-18, 6.3.0-18+deb9u1), libcurl3-gnutls:amd64 (7.52.1-5+deb9u3, 7.52.1-5+deb9u4), lxcfs:amd64 (2.0.8-1, 2.0.8-2)
End-Date: 2018-02-27 18:55:08​

Memory is not an issue for me in the system that uses dedupe. I agree with the conclusion that dedupe is not worth it for most people that are using spinny disk media. However flash is expensive and if you need the fast IO in heaps, memory becomes pretty cheap. I can get 32GB of ECC memory for the price of 1TB of flash that's fast and will handle the writes I require. Sata drives generally give up with the loads that I expect. At that point it becomes a tradeoff, and I am limited much more by PCIe availability than the amount of ram I can shove in 16 slots.

Cloning offline and online causes the issue. I initially did an online clone and figured it was a byproduct of locking up the image. I followed it up with an offline clone and had the same issue. I'll investigate into bwlimit. Thanks for that.
 
Problem still exists on Proxmox
Kernel Version
Linux 4.15.18-4-pve #1 SMP PVE 4.15.18-23 (Thu, 30 Aug 2018 13:04:08 +0200)
PVE Manager Version pve-manager/5.2-9/4b30e8f9

filename: /lib/modules/4.15.18-4-pve/zfs/zfs.ko
version: 0.7.9-1
license: CDDL
author: OpenZFS on Linux
description: ZFS
srcversion: 942E3E6D877636C437456E0
depends: spl,znvpair,zcommon,zunicode,zavl,icp
retpoline: Y
name: zfs
vermagic: 4.15.18-4-pve SMP mod_unload modversions


Any news or updates for this Problem ? Actually zfs is not usable with this bug.
Is there any workaround ?
 
Dear,

if found some indication that the issue comes from a too low Kernel config_hz parameter in Proxmox (shall be 1000 to prevent this happening but proxmox uses 100/250).

Can anyone confirm this?
 
Last edited:
I am experiencing the same issues and confirm currently patched 5.4.7 proxmox has CONFIG_HZ of 100 configured in the kernel.
IO wait is high under light(ish) loads.
txg_sync high cpu usage

Dear,

if found some indication that the issue comes from a too low Kernel config_hz parameter in Proxmox (shall be 1000 to prevent this happening but proxmox uses 100/250).

Can anyone confirm this?
 
if found some indication that the issue comes from a too low Kernel config_hz parameter in Proxmox (shall be 1000 to prevent this happening but proxmox uses 100/250).
You mean from that thread?
https://forum.proxmox.com/threads/kernel-4-15-18-8-zfs-freeze.48938/#post-244733

As already stated there:
We're using dynamic ticks now, so you get already anything between 100 and 1500, depending on needs, see:
https://elinux.org/Kernel_Timer_Systems#Dynamic_ticks

I am experiencing the same issues and confirm currently patched 5.4.7 proxmox has CONFIG_HZ of 100 configured in the kernel.
see above quote from me... Also, he did not ask for confirmation about if the kernel has some specific ticks but if changing that fixes the issue...
 
I'm running 6.0.4 and I having the same issue.

It was worse (didn't require much IO to crash the host) on LVM on top of a mdadm mirror, so moved to a ZFS mirror. It is still happening (but it takes a lot more IO to freeze the host - usually pveproxy and pvedaemon crash)

I have ticket open with support and I am patiently waiting for a solution (I get an avg of 3 to 4 replies from them per week, and most of the replies make no progress towards a solution)
 
@jkalousek and @pmxpms, are you guys by chance not using HPET (instead using TSC for system timer)?

this could affect the CONFIG_HZ issue, see https://www.advenage.com/topics/linux-timer-interrupt-frequency

The linked post is very old, talks about Ubuntu 10.04 as most recent Distribution, and 2.6 based kernels, the information of that post is not really valid anymore.. Again, as said:

We're using dynamic ticks now, so you get already anything between 100 and 1500, depending on needs, see:
https://elinux.org/Kernel_Timer_Systems#Dynamic_ticks

It was worse (didn't require much IO to crash the host) on LVM on top of a mdadm mirror, so moved to a ZFS mirror. It is still happening (but it takes a lot more IO to freeze the host - usually pveproxy and pvedaemon crash)

I have ticket open with support and I am patiently waiting for a solution (I get an avg of 3 to 4 replies from them per week, and most of the replies make no progress towards a solution)

If you see those crashes with two completely fundamentally different storage setups then I'd guess that's really hardware affected (either an issue, or incompatibility from HW with kernel) or some strange bug in a lower level kernel IO code-path, as else LVM or MD really do not share code with ZFS...
 
JUST FYI

I had such issues before but they are not related to ZFS or the disk itself.
Instead its NFS and kernel.

heres what happend: nfs was mounted via automount. one of the nfs shares gone byebye but forgot to disable automount.
host suddenly dropped its i/o and troughput of the harddisks.
to compare a 4drive raid10 with 272mb/sec went down to a crawl of 5mb/s

removing the automount solved the hole thing. its simply nfs locking up the kernel
so if you mount nfs and have any issues with it mounts (source no longer there, or sourcy under high load,...) your system may lockup if anything trys to utilize that nfs share.

in my case automount did that (it would be less of an issue if it was mounted by fstab and not used at all)
 
The linked post is very old, talks about Ubuntu 10.04 as most recent Distribution, and 2.6 based kernels, the information of that post is not really valid anymore.

I know it is old. I only wanted to bring attention to how using an alternate clocksource (TSC in my case, instead of HPET) could affect the same part of the system CONFIG_HZ affects. Most machines today use HPET. For some reason, my machine (8th Gen Xeon, very new) has a clock drift when on HPET AND no kernel cmdline, so I had to set "tsc=reliable" in /proc/cmdline.

UPDATE: I changed the cmdline to "tsc=unstable" per the recommendation shown in dmesg on Debian 10. I am now running clocksource HPET without time drift. I will update again if this solved the main issue (system unresponsiveness during heavy I/O)

If you see those crashes with two completely fundamentally different storage setups then I'd guess that's really hardware affected (either an issue, or incompatibility from HW with kernel) or some strange bug in a lower level kernel IO code-path, as else LVM or MD really do not share code with ZFS...

After extensive testing, I can say in my case that it is not hardware (I used different disks, SATA ports, SATA cables). It is something lower than the storage setup, but above the hardware. Probably the kernel (and maybe affected by the clocksource, I am yet to test that)
 
Last edited:
JUST FYI

I had such issues before but they are not related to ZFS or the disk itself.
Instead its NFS and kernel.

heres what happend: nfs was mounted via automount. one of the nfs shares gone byebye but forgot to disable automount.
host suddenly dropped its i/o and troughput of the harddisks.
to compare a 4drive raid10 with 272mb/sec went down to a crawl of 5mb/s

removing the automount solved the hole thing. its simply nfs locking up the kernel
so if you mount nfs and have any issues with it mounts (source no longer there, or sourcy under high load,...) your system may lockup if anything trys to utilize that nfs share.

in my case automount did that (it would be less of an issue if it was mounted by fstab and not used at all)

You made my day!! Searched the last two days whats the error. For others running in the same situation: Mine was a bit different but can now say its similar: I have automount cifs-Shares in there and the target wasn't responding correctly so it slowed down my speed at read - depending on how big the file is?! - from 30-2MB/s... When the target is fully reachable (restarted ;) ) its working like a charm.

Does anybody knows how this comes from? Should we file a bug?
And @bofh: how did you find that out?!!!
 
Does anybody knows how this comes from? Should we file a bug?
And @bofh: how did you find that out?!!!
yea file a bug @linustorwald and tell him monokernel sucks and he shall eat his bloatware
its simple if the driver is userspace no freeze, its a kernel module youre done.
not automounts fault, not promox fault

how did i find that out,.. the usual way, blood, sweat, tears and desperation but mostly luck

#iwhishwehadamicrokernel
 
I have automount cifs-Shares in there and the target wasn't responding correctly so it slowed down my speed at read - depending on how big the file is?! - from 30-2MB/s... When the target is fully reachable (restarted ;) ) its working like a charm.

yes to be expected. the issue seems related to any device i/o, so at the first few megs caches gonna work a bit in your favour.
the moment it really needs to access the disk its over. this seems to be very low level kernel related.
since most cfis code is also kernel code not that big of a surprise.

just cfis seem still a bit less prone to this than nfs. nfs is brutal in that regard and may even force you to force boot.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!