(FYI, I already filed a bug report for the same issue to OpenZFS here: https://github.com/openzfs/zfs/issues/18094. I am sharing here as well since I imagine this is of interest of Proxmox community as well)
I noticed kernel task hangups in dmesg just yesterday. This is on a system I use with ZFS RAID1 and ECC memory:
After rebooting the system, I saw kernel metaslab error, similar to one on this screenshot, except system would continue to boot:

I then tried to scrub the pool which resulted in permanent error reported in metadata by `zpool status`. After two more reboots, system no longer boots. It's a bit odd because it *did* continue previously even though same (or very similar) metaslab error showed at same exact moment, during pool import. At this point I can no longer boot the system.
A fix/workaround for this issue was previously discussed by OpenZFS devs with no final resolve:
https://github.com/openzfs/zfs/pull/17094
I noticed kernel task hangups in dmesg just yesterday. This is on a system I use with ZFS RAID1 and ECC memory:
Code:
[ 246.544382] INFO: task z_metaslab:1462 blocked for more than 122 seconds.
[ 246.544385] Tainted: P S O 6.17.4-1-pve #1
[ 246.544388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.544391] task:z_metaslab state:D stack:0 pid:1462 tgid:1462 ppid:2 task_flags:0x288040 flags:0x00004000
[ 246.544396] Call Trace:
[ 246.544398] <TASK>
[ 246.544400] __schedule+0x468/0x1310
[ 246.544403] ? srso_alias_return_thunk+0x5/0xfbef5
[ 246.544406] ? srso_alias_return_thunk+0x5/0xfbef5
[ 246.544409] ? update_entity_lag+0x76/0x80
[ 246.544413] ? srso_alias_return_thunk+0x5/0xfbef5
[ 246.544417] schedule+0x27/0xf0
[ 246.544420] cv_wait_common+0x10a/0x140 [spl]
[ 246.544425] ? __pfx_autoremove_wake_function+0x10/0x10
[ 246.544429] __cv_wait+0x15/0x30 [spl]
[ 246.544434] metaslab_load+0x4a/0x910 [zfs]
[ 246.544482] ? spl_kmem_free+0x31/0x40 [spl]
[ 246.544487] ? srso_alias_return_thunk+0x5/0xfbef5
[ 246.544490] ? kfree+0x2dd/0x360
[ 246.544494] metaslab_preload+0x57/0xc0 [zfs]
[ 246.544542] taskq_thread+0x349/0x720 [spl]
[ 246.544548] ? __pfx_default_wake_function+0x10/0x10
[ 246.544553] ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 246.544558] kthread+0x108/0x220
[ 246.544561] ? __pfx_kthread+0x10/0x10
[ 246.544564] ret_from_fork+0x205/0x240
[ 246.544567] ? __pfx_kthread+0x10/0x10
[ 246.544570] ret_from_fork_asm+0x1a/0x30
[ 246.544575] </TASK>
[ 246.544577] INFO: task z_metaslab:1463 blocked for more than 122 seconds.
[ 246.544580] Tainted: P S O 6.17.4-1-pve #1
[ 246.544583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.544586] task:z_metaslab state:D stack:0 pid:1463 tgid:1463 ppid:2 task_flags:0x288040 flags:0x00004000
[ 246.544591] Call Trace:
[ 246.544593] <TASK>
[ 246.544595] __schedule+0x468/0x1310
[ 246.544601] ? srso_alias_return_thunk+0x5/0xfbef5
[ 246.544605] ? srso_alias_return_thunk+0x5/0xfbef5
[ 246.544608] ? update_entity_lag+0x76/0x80
[ 246.544611] ? srso_alias_return_thunk+0x5/0xfbef5
[ 246.544615] schedule+0x27/0xf0
[ 246.544618] cv_wait_common+0x10a/0x140 [spl]
[ 246.544623] ? __pfx_autoremove_wake_function+0x10/0x10
[ 246.544627] __cv_wait+0x15/0x30 [spl]
[ 246.544632] metaslab_load+0x4a/0x910 [zfs]
[ 246.544680] ? spl_kmem_free+0x31/0x40 [spl]
[ 246.544685] ? srso_alias_return_thunk+0x5/0xfbef5
[ 246.544688] ? kfree+0x2dd/0x360
[ 246.544692] metaslab_preload+0x57/0xc0 [zfs]
[ 246.544740] taskq_thread+0x349/0x720 [spl]
[ 246.544746] ? __pfx_default_wake_function+0x10/0x10
[ 246.544751] ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 246.544756] kthread+0x108/0x220
[ 246.544759] ? __pfx_kthread+0x10/0x10
[ 246.544763] ret_from_fork+0x205/0x240
[ 246.544765] ? __pfx_kthread+0x10/0x10
[ 246.544768] ret_from_fork_asm+0x1a/0x30
[ 246.544773] </TASK>
After rebooting the system, I saw kernel metaslab error, similar to one on this screenshot, except system would continue to boot:

I then tried to scrub the pool which resulted in permanent error reported in metadata by `zpool status`. After two more reboots, system no longer boots. It's a bit odd because it *did* continue previously even though same (or very similar) metaslab error showed at same exact moment, during pool import. At this point I can no longer boot the system.
A fix/workaround for this issue was previously discussed by OpenZFS devs with no final resolve:
https://github.com/openzfs/zfs/pull/17094
Last edited: