Proxmox freezes with high IO, maybe ZFS related

arukashi

Member
Jan 21, 2023
19
15
8
Hello.
I'm now experiencing freezes which started three days ago and happens every day but in random time. Hardware is Hetzner Server
Symptoms: Proxmox interface shows high IO delay, about 30-50%, while no significant IO operations are happening. I watched iotop, and 15M/s write was maximum
Simple reboot fixes the problem until it happens again

In system logs i can see many records like this
zed[1407629]: eid=333 class=deadman pool='rpool' vdev=sdd3 size=73728 offset=139804012544 priority=3 err=0 flags=0x80100480 delay=101266124ms
Meanwhile zpool events -v shows this
Code:
Apr 13 2026 15:54:06.617975577 ereport.fs.zfs.deadman
        class = "ereport.fs.zfs.deadman"
        ena = 0x819c1944f5406001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xc0300e71306e37ec
                vdev = 0xb685d326aa5b66c8
        (end detector)
        pool = "rpool"
        pool_guid = 0xc0300e71306e37ec
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0xb685d326aa5b66c8
        vdev_type = "disk"
        vdev_path = "/dev/sdd3"
        vdev_ashift = 0x9
        vdev_complete_ts = 0x4827c2f25336
        vdev_delta_ts = 0xb40f810ebd
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        vdev_delays = 0x0
        dio_verify_errors = 0x0
        parent_guid = 0x79041166b7a9b905
        parent_type = "mirror"
        vdev_spare_paths =
        vdev_spare_guids =
        zio_err = 0x0
        zio_flags = 0x300080 [CANFAIL DONT_QUEUE DONT_PROPAGATE]
        zio_stage = 0x200000 [VDEV_IO_START]
        zio_pipeline = 0x4e00000 [VDEV_IO_START VDEV_IO_DONE VDEV_IO_ASSESS DONE]
        zio_delay = 0x0
        zio_timestamp = 0x481318fd6e2e
        zio_delta = 0x0
        zio_type = 0x2 [WRITE]
        zio_priority = 0x3 [ASYNC_WRITE]
        zio_offset = 0x14922dd000
        zio_size = 0x2000
        zio_objset = 0x8245
        zio_object = 0x1
        zio_level = 0x0
        zio_blkid = 0x1af7f7c
        time = 0x69dcf57e 0x24d58f19
        eid = 0x64

Freezes come out of the blue: I didn't do any ZFS configurations recently, no new workloads too
So i thought it was faulty /dev/sdd drive, i asked support to replace the drive, it helped, but not for long time - 5 hours later freezes repeated.
But this time in logs all messages were about all the drives. Which makes me think, that that's not the drives case.
Apr 12 09:08:24 pve-htznr-6 zed[1466615]: eid=124 class=deadman pool='rpool' vdev=sdb3 size=4096 offset=1687645704192 priority=3 err=0 flags=0x300080>
Apr 12 09:08:24 pve-htznr-6 zed[168287]: Missed 506 events
Apr 12 09:08:24 pve-htznr-6 zed[168287]: Missed 306 events
Apr 12 09:08:24 pve-htznr-6 zed[1466623]: eid=126 class=deadman pool='rpool' vdev=sdc3 size=4096 offset=382222438400 priority=3 err=0 flags=0x300080 >
Apr 12 09:08:24 pve-htznr-6 zed[168287]: Missed 242 events
Apr 12 09:08:24 pve-htznr-6 zed[1466627]: eid=127 class=deadman pool='rpool' vdev=sda3 size=8192 offset=1603562139648 priority=3 err=0 flags=0x300080>

Some diagnostic info
Code:
zpool status -v
  pool: rpool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
    The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: resilvered 4.47G in 00:00:20 with 0 errors on Mon Apr 13 22:55:05 2026
config:

    NAME                                                  STATE     READ WRITE CKSUM
    rpool                                                 ONLINE       0     0     0
      mirror-0                                            ONLINE       0     0     0
        sdb3                                              ONLINE       0     0     0
        sdc3                                              ONLINE       0     0     0
      mirror-1                                            ONLINE       0     0     0
        ata-Micron_5200_MTFDDAK1T9TDC_18501FD2E0D6-part3  ONLINE       0     0     0
        sda3                                              ONLINE       0     0     0

errors: No known data errors
Code:
proxmox-ve: 9.1.0 (running kernel: 6.17.13-2-pve)
pve-manager: 9.1.7 (running version: 9.1.7/16b139a017452f16)
proxmox-kernel-helper: 9.0.4
pve-kernel-5.15: 7.4-15
proxmox-kernel-6.17: 6.17.13-2
proxmox-kernel-6.17.13-2-pve-signed: 6.17.13-2
proxmox-kernel-6.17.4-2-pve-signed: 6.17.4-2
proxmox-kernel-6.8: 6.8.12-17
proxmox-kernel-6.8.12-17-pve-signed: 6.8.12-17
pve-kernel-5.15.158-2-pve: 5.15.158-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 19.2.3-pve1
corosync: 3.1.10-pve2
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx12
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.1
libproxmox-backup-qemu0: 2.0.2
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.6
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.1.1
libpve-cluster-perl: 9.1.1
libpve-common-perl: 9.1.9
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.5
libpve-rs-perl: 0.11.4
libpve-storage-perl: 9.1.1
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-4
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.5-1
proxmox-backup-file-restore: 4.1.5-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.9
pve-cluster: 9.1.1
pve-container: 6.1.2
pve-docs: 9.1.2
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.18-2
pve-ha-manager: 5.1.3
pve-i18n: 3.7.0
pve-qemu-kvm: 10.1.2-7
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.6
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.4.1-pve1

Hardware Hetzner Server
CPU
: Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz
Memory: 4 * Samsung M393A4G43AB3-CWE ECC Registered + 4 * Samsung M393A4K40EB3-CWE ECC Registered Speed: 2933 MT/s
Drives:
Code:
Device Model:     SAMSUNG MZ7LH1T9HMLT-00005
Device Model:     Micron_5200_MTFDDAK1T9TDC
Device Model:     Micron_5200_MTFDDAK1T9TDC
Device Model:     Micron_5200_MTFDDAK1T9TDC


Any help appreciated, thanks!
 
Last edited:
  • not changed any settings.
  • The issue still occurred even after replacing the SSD (`sdd`).
  • The same issue has also occurred on the other disks (`sda` / `sdb` / `sdc`).

Based on the above, I suspect the problem may be related to a component shared by all SSDs, such as the SATA controller, backplane, or similar shared infrastructure.
 
  • Like
Reactions: arukashi
Freezes repeated again. This is what iostat says
Code:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda              6.19      0.03     0.00   0.00    0.13     4.90    6.39      0.20     0.00   0.00    0.25    31.75    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.06
sdb              0.00      0.00     0.00   0.00    0.00     0.00    2.20      0.13     0.00   0.00 2958.27    58.55    0.00      0.00     0.00   0.00    0.00     0.00    0.40 1934.00    7.27  71.88
sdc             35.93      0.20     0.00   0.00    0.18     5.73    2.40      0.16     0.00   0.00    0.42    68.00    0.00      0.00     0.00   0.00    0.00     0.00    0.40    0.00    0.01   0.22
sdd             67.27      0.37     0.20   0.30    0.19     5.57    6.99      0.20     0.40   5.41    0.20    29.03    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.01   0.42


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda              2.60      0.01     0.00   0.00    0.23     4.00   13.80      0.68     0.60   4.17    0.29    50.09    0.00      0.00     0.00   0.00    0.00     0.00    0.60    0.00    0.00   0.06
sdb              0.00      0.00     0.00   0.00    0.00     0.00    4.80      0.26     0.00   0.00 2800.71    56.17    0.00      0.00     0.00   0.00    0.00     0.00    0.80 1500.50   14.64 117.36
sdc              0.00      0.00     0.00   0.00    0.00     0.00    7.40      0.70     0.00   0.00    0.73    96.43    0.00      0.00     0.00   0.00    0.00     0.00    0.80    0.25    0.01   0.10
sdd              2.00      0.01     0.20   9.09    0.80     4.40   13.00      0.68     0.80   5.80    0.28    53.17    0.00      0.00     0.00   0.00    0.00     0.00    0.60    0.33    0.01   0.12


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda             43.60      0.32     0.00   0.00    0.16     7.60    8.80      1.04     0.20   2.22    1.82   120.55    0.00      0.00     0.00   0.00    0.00     0.00    0.20    1.00    0.02   0.92
sdb              0.00      0.00     0.00   0.00    0.00     0.00    4.40      0.22     0.20   4.35 2933.95    51.27    0.00      0.00     0.00   0.00    0.00     0.00    0.40 1634.00   13.56 114.72
sdc             16.00      0.11     0.20   1.23    0.20     7.30    7.80      1.04     1.40  15.22    0.90   136.82    0.00      0.00     0.00   0.00    0.00     0.00    0.20    0.00    0.01   0.24
sdd             32.40      0.25     0.00   0.00    0.25     7.78    8.20      1.04     0.80   8.89    0.56   129.37    0.00      0.00     0.00   0.00    0.00     0.00    0.20    0.00    0.01   0.50

In logs i can see this
zed[1848837]: eid=242 class=deadman pool='rpool' vdev=sdb3 size=4096 offset=1619770630144 priority=3 err=0 flags=0x300080 bookmark=53732:0:5:0
In this case only sdb causing trouble. Taking sdb3 out of pool
zpool offline rpool sdb3
Problem seems to be fixed, loadavg getting lower, io delay lowers to almost zero.

Getting sdb drive back to the pool
zpool online rpool sdb3
IO delay rising again.
What is this, another faulty drive? Faulty ZFS logic which unfairly loads only one drive?