Hello, I'm using Proxmox for the first time and have installed a few weeks ago version 6.4.
So far it's been completely stable but in the last few days I've had two instances of kernel panic in zio.c which hangs the only VM in use and renders the system unstable.
Here's the dmesg info;
The messsage above repeats. The VM is still running and somewhat responsive but unable to be stopped. PVE shutdown takes a half hour to stop with a lot of "Unmounting <pool>" and "Failed unmounting <pool>" msgs before "[ OK ] Reached target Unmount All Filesystems" and then a bunch of "systemd-shutdown[1]: Sending SIGKILL to PID xxxx" for the following:
sd-sync
sync
umount
Then it tries "Remounting '/mnt/POOLNAME' read-only in with options 'xattr,noacl'" for multiple pools with next message being:
"Failed to remount '/mnt/POOLNAME' read-only: Device or resource busy"
Eventually it gives up with 5 filesytems remaining that cannot be unmounted, "Syncing filesystems and block devices" does a time-out, SIGKILL is issued and shutdown completes.
The second instance just happened, after the first time the system rebooted Ok, scrub showed no problems on the pools and the VM restarted eventually.
I'm using Micron ECC RAM from the Gigabyte QLV list for the board and memtest'd quite a bit before installing Proxmox.
I'm using Ubuntu 21.04 for the VM with ZFS as it's filesystem too. (The VM storage is a variety of NVME, SSD and HD's (scsi qcow2 with write-back cache enabled and discard=on for the NVME/SSD devices.)
The only instances of the specific panic error message I'm seeing on google have to do with importing pools, so not too helpful in my case. (Or at least I don't think so.)
Any other details needed let me know. Thank you.
So far it's been completely stable but in the last few days I've had two instances of kernel panic in zio.c which hangs the only VM in use and renders the system unstable.
Here's the dmesg info;
[76382.347243] VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)
[76382.349657] PANIC at zio.c:314:zio_data_buf_alloc()
[76382.350850] Showing stack for process 19185
[76382.352006] CPU: 3 PID: 19185 Comm: kvm Tainted: P O 5.4.106-1-pve #1
[76382.353129] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS MASTER/X570 AORUS MASTER, BIOS F33j 04/23/2021
[76382.354248] Call Trace:
[76382.354254] dump_stack+0x6d/0x8b
[76382.354259] spl_dumpstack+0x29/0x2b [spl]
[76382.354261] spl_panic+0xd3/0xfb [spl]
[76382.354263] ? ___slab_alloc+0x2ae/0x580
[76382.354265] ? _cond_resched+0x19/0x30
[76382.354265] ? kmem_cache_alloc+0x17e/0x240
[76382.354267] ? spl_kmem_cache_alloc+0x7c/0x770 [spl]
[76382.354268] ? spl_kmem_cache_alloc+0x14d/0x770 [spl]
[76382.354268] ? _cond_resched+0x19/0x30
[76382.354269] ? _cond_resched+0x19/0x30
[76382.354269] ? mutex_lock+0x12/0x30
[76382.354297] zio_data_buf_alloc+0x58/0x60 [zfs]
[76382.354307] abd_alloc_linear+0x88/0xc0 [zfs]
[76382.354318] abd_alloc+0x8e/0xd0 [zfs]
[76382.354329] arc_get_data_abd.isra.44+0x45/0x70 [zfs]
[76382.354341] arc_hdr_alloc_abd+0x5d/0xb0 [zfs]
[76382.354352] arc_hdr_alloc+0xec/0x160 [zfs]
[76382.354363] arc_alloc_buf+0x4c/0xd0 [zfs]
[76382.354375] dbuf_alloc_arcbuf_from_arcbuf+0xf6/0x180 [zfs]
[76382.354376] ? _cond_resched+0x19/0x30
[76382.354376] ? _cond_resched+0x19/0x30
[76382.354388] dbuf_hold_copy.isra.24+0x36/0xb0 [zfs]
[76382.354404] dbuf_hold_impl+0x43b/0x600 [zfs]
[76382.354416] dbuf_hold+0x33/0x60 [zfs]
[76382.354428] dmu_buf_hold_noread+0x8a/0x110 [zfs]
[76382.354440] dmu_buf_hold+0x3c/0x90 [zfs]
[76382.354469] zfs_get_data+0x197/0x340 [zfs]
[76382.354488] zil_commit_impl+0x9d6/0xdb0 [zfs]
[76382.354510] zil_commit+0x3d/0x60 [zfs]
[76382.354528] zfs_fsync+0x77/0x100 [zfs]
[76382.354544] zpl_fsync+0x6c/0xa0 [zfs]
[76382.354547] vfs_fsync_range+0x48/0x80
[76382.354548] ? __fget_light+0x59/0x70
[76382.354548] do_fsync+0x3d/0x70
[76382.354549] __x64_sys_fdatasync+0x17/0x20
[76382.354551] do_syscall_64+0x57/0x190
[76382.354552] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[76382.354553] RIP: 0033:0x7ffa667ce2e7
[76382.354554] Code: b8 4b 00 00 00 0f 05 48 3d 00 f0 ff ff 77 3c c3 0f 1f 00 53 89 fb 48 83 ec 10 e8 74 54 01 00 89 df 89 c2 b8 4b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2b 89 d7 89 44 24 0c e8 b6 54 01 00 8b 44 24
[76382.354554] RSP: 002b:00007ff06dff7cf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[76382.354555] RAX: ffffffffffffffda RBX: 0000000000000018 RCX: 00007ffa667ce2e7
[76382.354555] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[76382.354556] RBP: 000055a5a2a52b92 R08: 0000000000000000 R09: 00000000ffffffff
[76382.354556] R10: 00007ff06dff7ce0 R11: 0000000000000293 R12: 000055a5a2dcd2e8
[76382.354556] R13: 000055a5a3bb4b58 R14: 000055a5a3bb4ae0 R15: 000055a5a3bdd100
[76608.205037] INFO: task z_wr_int:1620 blocked for more than 120 seconds.
[76608.206932] Tainted: P O 5.4.106-1-pve #1
[76608.208839] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[76608.210729] z_wr_int D 0 1620 2 0x80004000
[76608.212583] Call Trace:
[76608.214411] __schedule+0x2e6/0x700
[76608.216207] ? mutex_lock+0x12/0x30
[76608.217976] schedule+0x33/0xa0
[76608.219720] schedule_preempt_disabled+0xe/0x10
[76608.221408] __mutex_lock.isra.10+0x2c9/0x4c0
[76608.223032] __mutex_lock_slowpath+0x13/0x20
[76608.224639] mutex_lock+0x2c/0x30
[76608.226239] dbuf_write_done+0x43/0x220 [zfs]
[76608.227779] arc_write_done+0x8f/0x410 [zfs]
[76608.229274] zio_done+0x43f/0x1020 [zfs]
[76608.230750] zio_execute+0x99/0xf0 [zfs]
[76608.232188] taskq_thread+0x2f7/0x4e0 [spl]
[76608.233618] ? wake_up_q+0x80/0x80
[76608.235049] ? zio_taskq_member.isra.14.constprop.20+0x70/0x70 [zfs]
[76608.236465] kthread+0x120/0x140
[76608.237875] ? task_done+0xb0/0xb0 [spl]
[76608.239285] ? kthread_park+0x90/0x90
[76608.240694] ret_from_fork+0x22/0x40
[76608.242104] INFO: task txg_sync:1787 blocked for more than 120 seconds.
[76608.243527] Tainted: P O 5.4.106-1-pve #1
[76608.244950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[76608.246379] txg_sync D 0 1787 2 0x80004000
[76608.247788] Call Trace:
[76608.249188] __schedule+0x2e6/0x700
[76608.250580] schedule+0x33/0xa0
[76608.251955] schedule_timeout+0x152/0x330
[76608.253329] ? __next_timer_interrupt+0xd0/0xd0
[76608.254702] io_schedule_timeout+0x1e/0x50
[76608.256068] __cv_timedwait_common+0x138/0x170 [spl]
[76608.257425] ? wait_woken+0x80/0x80
[76608.258758] __cv_timedwait_io+0x19/0x20 [spl]
[76608.260110] zio_wait+0x139/0x280 [zfs]
[76608.261429] ? _cond_resched+0x19/0x30
[76608.262742] dsl_pool_sync+0xdc/0x510 [zfs]
[76608.264043] spa_sync+0x5a4/0xfe0 [zfs]
[76608.265310] ? mutex_lock+0x12/0x30
[76608.266583] ? spa_txg_history_init_io+0x104/0x110 [zfs]
[76608.267858] txg_sync_thread+0x2e1/0x4a0 [zfs]
[76608.269133] ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
[76608.270390] thread_generic_wrapper+0x74/0x90 [spl]
[76608.271651] kthread+0x120/0x140
[76608.272905] ? __thread_exit+0x20/0x20 [spl]
[76608.274157] ? kthread_park+0x90/0x90
[76608.275400] ret_from_fork+0x22/0x40
[76608.276677] INFO: task kvm:19167 blocked for more than 120 seconds.
[76608.277924] Tainted: P O 5.4.106-1-pve #1
[76608.279163] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[76608.280411] kvm D 0 19167 1 0x00004000
[76608.281657] Call Trace:
[76608.282881] __schedule+0x2e6/0x700
[76608.284106] schedule+0x33/0xa0
[76608.285318] cv_wait_common+0x104/0x130 [spl]
[76608.286529] ? wait_woken+0x80/0x80
[76608.287719] __cv_wait+0x15/0x20 [spl]
[76608.288906] zfs_rangelock_enter_impl+0x16a/0x5c0 [zfs]
[76608.290088] zfs_rangelock_enter+0x11/0x20 [zfs]
[76608.291239] zfs_extend+0x44/0x220 [zfs]
[76608.292375] ? sa_lookup+0x71/0x90 [zfs]
[76608.293486] zfs_freesp+0x21d/0x480 [zfs]
[76608.294548] ? _cond_resched+0x19/0x30
[76608.295596] ? mutex_lock+0x12/0x30
[76608.296642] ? rrw_exit+0x6a/0x160 [zfs]
[76608.297647] ? rrm_exit+0x46/0x80 [zfs]
[76608.298610] ? zfs_statvfs+0x191/0x4e0 [zfs]
[76608.299558] ? rrw_exit+0x6a/0x160 [zfs]
[76608.300491] ? zfs_space+0xd3/0x210 [zfs]
[76608.301425] zpl_fallocate_common+0x255/0x290 [zfs]
[76608.302351] ? common_file_perm+0x5e/0x140
[76608.303292] zpl_fallocate+0x12/0x20 [zfs]
[76608.304220] vfs_fallocate+0x147/0x280
[76608.305145] ksys_fallocate+0x41/0x80
[76608.306063] __x64_sys_fallocate+0x1e/0x30
[76608.306986] do_syscall_64+0x57/0x190
[76608.307906] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[76608.308835] RIP: 0033:0x7ffa667cc46d
[76608.309767] Code: Bad RIP value.
[76608.310691] RSP: 002b:00007ff08d7f6c70 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[76608.311649] RAX: ffffffffffffffda RBX: 0000000000000018 RCX: 00007ffa667cc46d
[76608.312615] RDX: 00000b1c05fd0000 RSI: 0000000000000000 RDI: 0000000000000018
[76608.313585] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
[76608.314541] R10: 00000000000a0000 R11: 0000000000000293 R12: 00000b1c05fd0000
[76608.315486] R13: 00000000000a0000 R14: 000055a5a3bb4ae0 R15: 000055a5a51f18c0
The messsage above repeats. The VM is still running and somewhat responsive but unable to be stopped. PVE shutdown takes a half hour to stop with a lot of "Unmounting <pool>" and "Failed unmounting <pool>" msgs before "[ OK ] Reached target Unmount All Filesystems" and then a bunch of "systemd-shutdown[1]: Sending SIGKILL to PID xxxx" for the following:
sd-sync
sync
umount
Then it tries "Remounting '/mnt/POOLNAME' read-only in with options 'xattr,noacl'" for multiple pools with next message being:
"Failed to remount '/mnt/POOLNAME' read-only: Device or resource busy"
Eventually it gives up with 5 filesytems remaining that cannot be unmounted, "Syncing filesystems and block devices" does a time-out, SIGKILL is issued and shutdown completes.
The second instance just happened, after the first time the system rebooted Ok, scrub showed no problems on the pools and the VM restarted eventually.
I'm using Micron ECC RAM from the Gigabyte QLV list for the board and memtest'd quite a bit before installing Proxmox.
I'm using Ubuntu 21.04 for the VM with ZFS as it's filesystem too. (The VM storage is a variety of NVME, SSD and HD's (scsi qcow2 with write-back cache enabled and discard=on for the NVME/SSD devices.)
The only instances of the specific panic error message I'm seeing on google have to do with importing pools, so not too helpful in my case. (Or at least I don't think so.)
Any other details needed let me know. Thank you.