After the update to PVE6 every once in a while it seems like some IO causes a deadlock while running backups and then anything calling sync (like every container that shuts down or reboots afterwards) also gets stuck in unkillable sleep waiting for the original deadlock. Both times this has come up have been from a backup job running against a container hosted on RBD. The Ceph cluster was healthy and running fine both times it locked up. We have a lot more containers than VMs and all of them are hosted in Ceph so I'm not sure if that's just a fluke that it was the same both times or if it's actually specific to that.
Right now the storage for the containers is not set up to use KRBD, that box is unchecked, but I'm not really sure how that actually works for the containers vs. VMs where qemu uses librbd directly.
# uname -r
5.0.15-1-pve
Stack trace for tar
# cat /proc/739002/stack
[<0>] io_schedule+0x16/0x40
[<0>] __sync_dirty_buffer+0xe0/0xf0
[<0>] ext4_commit_super+0x213/0x2c0
[<0>] __ext4_error_inode+0xca/0x160
[<0>] ext4_lookup+0x20e/0x220
[<0>] __lookup_slow+0x9b/0x150
[<0>] lookup_slow+0x3a/0x60
[<0>] walk_component+0x1bf/0x330
[<0>] path_lookupat.isra.46+0x6d/0x220
[<0>] filename_lookup.part.60+0xa0/0x170
[<0>] user_path_at_empty+0x3e/0x50
[<0>] vfs_statx+0x76/0xe0
[<0>] __do_sys_newfstatat+0x35/0x70
[<0>] __x64_sys_newfstatat+0x1e/0x20
[<0>] do_syscall_64+0x5a/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
Stack trace for a kernel thread that *might* have coincided with when the backup deadlocked
# cat /proc/1357376/stack
[<0>] io_schedule+0x16/0x40
[<0>] __lock_page+0x122/0x220
[<0>] write_cache_pages+0x23e/0x4d0
[<0>] generic_writepages+0x56/0x90
[<0>] blkdev_writepages+0xe/0x10
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x40/0x350
[<0>] writeback_sb_inodes+0x211/0x500
[<0>] __writeback_inodes_wb+0x67/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x175/0x3f0
[<0>] process_one_work+0x20f/0x410
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
And then the stack trace for the first sync call (all the rest are just waiting on this one to finish)
# cat /proc/1371307/stack
[<0>] io_schedule+0x16/0x40
[<0>] __block_write_full_page+0x1c7/0x440
[<0>] block_write_full_page+0xb8/0x130
[<0>] blkdev_writepage+0x18/0x20
[<0>] __writepage+0x1d/0x50
[<0>] write_cache_pages+0x1de/0x4d0
[<0>] generic_writepages+0x56/0x90
[<0>] blkdev_writepages+0xe/0x10
[<0>] do_writepages+0x41/0xd0
[<0>] __filemap_fdatawrite_range+0xc5/0x100
[<0>] filemap_fdatawrite+0x1f/0x30
[<0>] fdatawrite_one_bdev+0x16/0x20
[<0>] iterate_bdevs+0xb7/0x153
[<0>] ksys_sync+0x70/0xb0
[<0>] __ia32_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5a/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
I'm at a loss as to how to track this down much further. Has anyone else started seeing sporadic deadlocks like this? At this point when I shut the node down for maintenance tonight I'm going to try to turn on KRBD for the container storage but I'm not sure that's actually going to do anything for LXC as opposed to qemu.
Right now the storage for the containers is not set up to use KRBD, that box is unchecked, but I'm not really sure how that actually works for the containers vs. VMs where qemu uses librbd directly.
# uname -r
5.0.15-1-pve
Stack trace for tar
# cat /proc/739002/stack
[<0>] io_schedule+0x16/0x40
[<0>] __sync_dirty_buffer+0xe0/0xf0
[<0>] ext4_commit_super+0x213/0x2c0
[<0>] __ext4_error_inode+0xca/0x160
[<0>] ext4_lookup+0x20e/0x220
[<0>] __lookup_slow+0x9b/0x150
[<0>] lookup_slow+0x3a/0x60
[<0>] walk_component+0x1bf/0x330
[<0>] path_lookupat.isra.46+0x6d/0x220
[<0>] filename_lookup.part.60+0xa0/0x170
[<0>] user_path_at_empty+0x3e/0x50
[<0>] vfs_statx+0x76/0xe0
[<0>] __do_sys_newfstatat+0x35/0x70
[<0>] __x64_sys_newfstatat+0x1e/0x20
[<0>] do_syscall_64+0x5a/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
Stack trace for a kernel thread that *might* have coincided with when the backup deadlocked
# cat /proc/1357376/stack
[<0>] io_schedule+0x16/0x40
[<0>] __lock_page+0x122/0x220
[<0>] write_cache_pages+0x23e/0x4d0
[<0>] generic_writepages+0x56/0x90
[<0>] blkdev_writepages+0xe/0x10
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x40/0x350
[<0>] writeback_sb_inodes+0x211/0x500
[<0>] __writeback_inodes_wb+0x67/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x175/0x3f0
[<0>] process_one_work+0x20f/0x410
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
And then the stack trace for the first sync call (all the rest are just waiting on this one to finish)
# cat /proc/1371307/stack
[<0>] io_schedule+0x16/0x40
[<0>] __block_write_full_page+0x1c7/0x440
[<0>] block_write_full_page+0xb8/0x130
[<0>] blkdev_writepage+0x18/0x20
[<0>] __writepage+0x1d/0x50
[<0>] write_cache_pages+0x1de/0x4d0
[<0>] generic_writepages+0x56/0x90
[<0>] blkdev_writepages+0xe/0x10
[<0>] do_writepages+0x41/0xd0
[<0>] __filemap_fdatawrite_range+0xc5/0x100
[<0>] filemap_fdatawrite+0x1f/0x30
[<0>] fdatawrite_one_bdev+0x16/0x20
[<0>] iterate_bdevs+0xb7/0x153
[<0>] ksys_sync+0x70/0xb0
[<0>] __ia32_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5a/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
I'm at a loss as to how to track this down much further. Has anyone else started seeing sporadic deadlocks like this? At this point when I shut the node down for maintenance tonight I'm going to try to turn on KRBD for the container storage but I'm not sure that's actually going to do anything for LXC as opposed to qemu.