very similar to this https://github.com/zfsonlinux/zfs/issues/7553
How should I proceed? return back to pve 5.4 ? How to disable monthly scrub, that will hang my server?
If i comment out line "24 0 8-14 * * root [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ] && /usr/lib/zfs-linux/scrub" in "/etc/cron.d/zfsutils-linux" will it turn off monthly scrub?
Hmm, the issue mentions that it is already an issue with ZFS 0.7.x...
You could also comment on that issue with your details to make upstream notice of this (we cannot reproduce, had various scrubs on various systems here without issues).
Also are you sure that the drives are OK, just to rule out basic stuff..
Looking at the dmesg kernel log (relevant excerpt of your attached one inline below) it behaves and seems like sort of a deadlock?
(all doing a live-wait and then hanging in the scheduler)
Code:
[Sun Jul 21 02:08:08 2019] INFO: task txg_sync:670 blocked for more than 120 sec onds.
[Sun Jul 21 02:08:08 2019] Tainted: P O 5.0.15-1-pve #1
[Sun Jul 21 02:08:08 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" di sables this message.
[Sun Jul 21 02:08:08 2019] txg_sync D 0 670 2 0x80000000
[Sun Jul 21 02:08:08 2019] Call Trace:
[Sun Jul 21 02:08:08 2019] __schedule+0x2d4/0x870
[Sun Jul 21 02:08:08 2019] schedule+0x2c/0x70
[Sun Jul 21 02:08:08 2019] cv_wait_common+0x104/0x130 [spl]
[Sun Jul 21 02:08:08 2019] ? wait_woken+0x80/0x80
[Sun Jul 21 02:08:08 2019] __cv_wait+0x15/0x20 [spl]
[Sun Jul 21 02:08:08 2019] spa_config_enter+0xfb/0x110 [zfs]
[Sun Jul 21 02:08:08 2019] spa_sync+0x199/0xfc0 [zfs]
[Sun Jul 21 02:08:08 2019] ? _cond_resched+0x19/0x30
[Sun Jul 21 02:08:08 2019] ? mutex_lock+0x12/0x30
[Sun Jul 21 02:08:08 2019] ? spa_txg_history_set.part.7+0xba/0xe0 [zfs]
[Sun Jul 21 02:08:08 2019] ? spa_txg_history_init_io+0x106/0x110 [zfs]
[Sun Jul 21 02:08:08 2019] txg_sync_thread+0x2d9/0x4c0 [zfs]
[Sun Jul 21 02:08:08 2019] ? txg_thread_exit.isra.11+0x60/0x60 [zfs]
[Sun Jul 21 02:08:08 2019] thread_generic_wrapper+0x74/0x90 [spl]
[Sun Jul 21 02:08:08 2019] kthread+0x120/0x140
[Sun Jul 21 02:08:08 2019] ? __thread_exit+0x20/0x20 [spl]
[Sun Jul 21 02:08:08 2019] ? __kthread_parkme+0x70/0x70
[Sun Jul 21 02:08:08 2019] ret_from_fork+0x35/0x40
[Sun Jul 21 02:08:08 2019] INFO: task zpool:22915 blocked for more than 120 seco nds.
[Sun Jul 21 02:08:08 2019] Tainted: P O 5.0.15-1-pve #1
[Sun Jul 21 02:08:08 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" di sables this message.
[Sun Jul 21 02:08:08 2019] zpool D 0 22915 21494 0x00000004
[Sun Jul 21 02:08:08 2019] Call Trace:
[Sun Jul 21 02:08:08 2019] __schedule+0x2d4/0x870
[Sun Jul 21 02:08:08 2019] schedule+0x2c/0x70
[Sun Jul 21 02:08:08 2019] taskq_wait+0x80/0xd0 [spl]
[Sun Jul 21 02:08:08 2019] ? wait_woken+0x80/0x80
[Sun Jul 21 02:08:08 2019] taskq_destroy+0x45/0x160 [spl]
[Sun Jul 21 02:08:08 2019] vdev_open_children+0x117/0x170 [zfs]
[Sun Jul 21 02:08:08 2019] vdev_root_open+0x3b/0x130 [zfs]
[Sun Jul 21 02:08:08 2019] vdev_open+0xa4/0x720 [zfs]
[Sun Jul 21 02:08:08 2019] ? mutex_lock+0x12/0x30
[Sun Jul 21 02:08:08 2019] vdev_reopen+0x33/0xc0 [zfs]
[Sun Jul 21 02:08:08 2019] dsl_scan+0x3a/0x120 [zfs]
[Sun Jul 21 02:08:08 2019] spa_scan+0x2d/0xc0 [zfs]
[Sun Jul 21 02:08:08 2019] zfs_ioc_pool_scan+0x5b/0xd0 [zfs]
[Sun Jul 21 02:08:08 2019] zfsdev_ioctl+0x6db/0x8f0 [zfs]
[Sun Jul 21 02:08:08 2019] ? lru_cache_add_active_or_unevictable+0x39/0xb0
[Sun Jul 21 02:08:08 2019] do_vfs_ioctl+0xa9/0x640
[Sun Jul 21 02:08:08 2019] ? handle_mm_fault+0xe1/0x210
[Sun Jul 21 02:08:08 2019] ksys_ioctl+0x67/0x90
[Sun Jul 21 02:08:08 2019] __x64_sys_ioctl+0x1a/0x20
[Sun Jul 21 02:08:08 2019] do_syscall_64+0x5a/0x110
[Sun Jul 21 02:08:08 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Sun Jul 21 02:08:08 2019] RIP: 0033:0x7fdaa46ce427
[Sun Jul 21 02:08:08 2019] Code: Bad RIP value.
[Sun Jul 21 02:08:08 2019] RSP: 002b:00007ffc72433918 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Sun Jul 21 02:08:08 2019] RAX: ffffffffffffffda RBX: 00007ffc72433950 RCX: 0000 7fdaa46ce427
[Sun Jul 21 02:08:08 2019] RDX: 00007ffc72433950 RSI: 0000000000005a07 RDI: 0000 000000000003
[Sun Jul 21 02:08:08 2019] RBP: 00007ffc72437340 R08: 0000000000000008 R09: 0000 7fdaa4719d90
[Sun Jul 21 02:08:08 2019] R10: 000056369d283010 R11: 0000000000000246 R12: 0000 000000000001
[Sun Jul 21 02:08:08 2019] R13: 000056369d286570 R14: 0000000000000000 R15: 0000 56369d284430
[Sun Jul 21 02:08:08 2019] INFO: task vdev_open:22916 blocked for more than 120 seconds.
[Sun Jul 21 02:08:08 2019] Tainted: P O 5.0.15-1-pve #1
[Sun Jul 21 02:08:08 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" di sables this message.
[Sun Jul 21 02:08:08 2019] vdev_open D 0 22916 2 0x80000000
[Sun Jul 21 02:08:08 2019] Call Trace:
[Sun Jul 21 02:08:08 2019] __schedule+0x2d4/0x870
[Sun Jul 21 02:08:08 2019] schedule+0x2c/0x70
[Sun Jul 21 02:08:08 2019] taskq_wait+0x80/0xd0 [spl]
[Sun Jul 21 02:08:08 2019] ? wait_woken+0x80/0x80
[Sun Jul 21 02:08:08 2019] taskq_destroy+0x45/0x160 [spl]
[Sun Jul 21 02:08:08 2019] vdev_open_children+0x117/0x170 [zfs]
[Sun Jul 21 02:08:08 2019] vdev_mirror_open+0x34/0x140 [zfs]
[Sun Jul 21 02:08:08 2019] vdev_open+0xa4/0x720 [zfs]
[Sun Jul 21 02:08:08 2019] vdev_open_child+0x22/0x40 [zfs]
[Sun Jul 21 02:08:08 2019] taskq_thread+0x2ec/0x4d0 [spl]
[Sun Jul 21 02:08:08 2019] ? __switch_to_asm+0x40/0x70
[Sun Jul 21 02:08:08 2019] ? wake_up_q+0x80/0x80
[Sun Jul 21 02:08:08 2019] kthread+0x120/0x140
[Sun Jul 21 02:08:08 2019] ? task_done+0xb0/0xb0 [spl]
[Sun Jul 21 02:08:08 2019] ? __kthread_parkme+0x70/0x70
[Sun Jul 21 02:08:08 2019] ret_from_fork+0x35/0x40
[Sun Jul 21 02:08:08 2019] INFO: task vdev_open:22918 blocked for more than 120 seconds.
[Sun Jul 21 02:08:08 2019] Tainted: P O 5.0.15-1-pve #1
[Sun Jul 21 02:08:08 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" di sables this message.
[Sun Jul 21 02:08:08 2019] vdev_open D 0 22918 2 0x80000000
[Sun Jul 21 02:08:08 2019] Call Trace:
[Sun Jul 21 02:08:08 2019] __schedule+0x2d4/0x870
[Sun Jul 21 02:08:08 2019] ? set_init_blocksize+0x80/0x80
[Sun Jul 21 02:08:08 2019] ? get_disk_and_module+0x40/0x70
[Sun Jul 21 02:08:08 2019] schedule+0x2c/0x70
[Sun Jul 21 02:08:08 2019] schedule_timeout+0x258/0x360
[Sun Jul 21 02:08:08 2019] wait_for_completion+0xb7/0x140
[Sun Jul 21 02:08:08 2019] ? wake_up_q+0x80/0x80
[Sun Jul 21 02:08:08 2019] call_usermodehelper_exec+0x14a/0x180
[Sun Jul 21 02:08:08 2019] call_usermodehelper+0x98/0xb0
[Sun Jul 21 02:08:08 2019] vdev_elevator_switch+0x112/0x1a0 [zfs]
[Sun Jul 21 02:08:08 2019] vdev_disk_open+0x25f/0x410 [zfs]
[Sun Jul 21 02:08:08 2019] vdev_open+0xa4/0x720 [zfs]
[Sun Jul 21 02:08:08 2019] vdev_open_child+0x22/0x40 [zfs]
[Sun Jul 21 02:08:08 2019] taskq_thread+0x2ec/0x4d0 [spl]
[Sun Jul 21 02:08:08 2019] ? __switch_to_asm+0x40/0x70
[Sun Jul 21 02:08:08 2019] ? wake_up_q+0x80/0x80
[Sun Jul 21 02:08:08 2019] kthread+0x120/0x140
[Sun Jul 21 02:08:08 2019] ? task_done+0xb0/0xb0 [spl]
[Sun Jul 21 02:08:08 2019] ? __kthread_parkme+0x70/0x70
[Sun Jul 21 02:08:08 2019] ret_from_fork+0x35/0x40
I have a slight hunch that this could be related to the block devices IO scheduler...
You could check the current one for all your SATA/SCSI "sXY" devices with following shell oneliner:
Code:
for blk in /sys/block/s*; do echo -n "$blk: "; cat "$blk/queue/scheduler"; done
I'd say that they're all now using mq-deadline, maybe it's worth to try setting them to "none" as a test...
(just `echo "none" > /sys/block/<BLKDEV>/queue/scheduler`)