ZFS: Space map corruption on boot causes kernel hangup

wrobelda

Member
Apr 13, 2022
61
5
13
(FYI, I already filed a bug report for the same issue to OpenZFS here: https://github.com/openzfs/zfs/issues/18094. I am sharing here as well since I imagine this is of interest of Proxmox community as well)

I noticed kernel task hangups in dmesg just yesterday. This is on a system I use with ZFS RAID1 and ECC memory:

Code:
[ 246.544382] INFO: task z_metaslab:1462 blocked for more than 122 seconds.
[  246.544385]       Tainted: P S         O        6.17.4-1-pve #1
[  246.544388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.544391] task:z_metaslab      state:D stack:0     pid:1462  tgid:1462  ppid:2      task_flags:0x288040 flags:0x00004000
[  246.544396] Call Trace:
[  246.544398]  <TASK>
[  246.544400]  __schedule+0x468/0x1310
[  246.544403]  ? srso_alias_return_thunk+0x5/0xfbef5
[  246.544406]  ? srso_alias_return_thunk+0x5/0xfbef5
[  246.544409]  ? update_entity_lag+0x76/0x80
[  246.544413]  ? srso_alias_return_thunk+0x5/0xfbef5
[  246.544417]  schedule+0x27/0xf0
[  246.544420]  cv_wait_common+0x10a/0x140 [spl]
[  246.544425]  ? __pfx_autoremove_wake_function+0x10/0x10
[  246.544429]  __cv_wait+0x15/0x30 [spl]
[  246.544434]  metaslab_load+0x4a/0x910 [zfs]
[  246.544482]  ? spl_kmem_free+0x31/0x40 [spl]
[  246.544487]  ? srso_alias_return_thunk+0x5/0xfbef5
[  246.544490]  ? kfree+0x2dd/0x360
[  246.544494]  metaslab_preload+0x57/0xc0 [zfs]
[  246.544542]  taskq_thread+0x349/0x720 [spl]
[  246.544548]  ? __pfx_default_wake_function+0x10/0x10
[  246.544553]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[  246.544558]  kthread+0x108/0x220
[  246.544561]  ? __pfx_kthread+0x10/0x10
[  246.544564]  ret_from_fork+0x205/0x240
[  246.544567]  ? __pfx_kthread+0x10/0x10
[  246.544570]  ret_from_fork_asm+0x1a/0x30
[  246.544575]  </TASK>
[  246.544577] INFO: task z_metaslab:1463 blocked for more than 122 seconds.
[  246.544580]       Tainted: P S         O        6.17.4-1-pve #1
[  246.544583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.544586] task:z_metaslab      state:D stack:0     pid:1463  tgid:1463  ppid:2      task_flags:0x288040 flags:0x00004000
[  246.544591] Call Trace:
[  246.544593]  <TASK>
[  246.544595]  __schedule+0x468/0x1310
[  246.544601]  ? srso_alias_return_thunk+0x5/0xfbef5
[  246.544605]  ? srso_alias_return_thunk+0x5/0xfbef5
[  246.544608]  ? update_entity_lag+0x76/0x80
[  246.544611]  ? srso_alias_return_thunk+0x5/0xfbef5
[  246.544615]  schedule+0x27/0xf0
[  246.544618]  cv_wait_common+0x10a/0x140 [spl]
[  246.544623]  ? __pfx_autoremove_wake_function+0x10/0x10
[  246.544627]  __cv_wait+0x15/0x30 [spl]
[  246.544632]  metaslab_load+0x4a/0x910 [zfs]
[  246.544680]  ? spl_kmem_free+0x31/0x40 [spl]
[  246.544685]  ? srso_alias_return_thunk+0x5/0xfbef5
[  246.544688]  ? kfree+0x2dd/0x360
[  246.544692]  metaslab_preload+0x57/0xc0 [zfs]
[  246.544740]  taskq_thread+0x349/0x720 [spl]
[  246.544746]  ? __pfx_default_wake_function+0x10/0x10
[  246.544751]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[  246.544756]  kthread+0x108/0x220
[  246.544759]  ? __pfx_kthread+0x10/0x10
[  246.544763]  ret_from_fork+0x205/0x240
[  246.544765]  ? __pfx_kthread+0x10/0x10
[  246.544768]  ret_from_fork_asm+0x1a/0x30
[  246.544773]  </TASK>

After rebooting the system, I saw kernel metaslab error, similar to one on this screenshot, except system would continue to boot:
IMG_20251230_083325850.jpg

I then tried to scrub the pool which resulted in permanent error reported in metadata by `zpool status`. After two more reboots, system no longer boots. It's a bit odd because it *did* continue previously even though same (or very similar) metaslab error showed at same exact moment, during pool import. At this point I can no longer boot the system.

A fix/workaround for this issue was previously discussed by OpenZFS devs with no final resolve:
https://github.com/openzfs/zfs/pull/17094
 
Last edited: