Hi,
I have a new 3 node install that isn't yet running any VM or containers, yet seems rather unstable.
A short while after booting, I'll start to get messages such as:
The particular services that get stuck, however, aren't always predictable. On other nodes, for example, I'll get messages like the following
I'm currently on kernel 5.4.65-1-pve, which I realize is a little old. I saw identical behavior on the default kernel, 5.4.73-1-pve. Are there any suggestions on how to proceed?
I have a new 3 node install that isn't yet running any VM or containers, yet seems rather unstable.
A short while after booting, I'll start to get messages such as:
Code:
[ 243.066934] INFO: task pve-startall-de:993 blocked for more than 120 seconds.
[ 243.067064] Tainted: P O 5.4.65-1-pve #1
[ 243.067098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.067143] pve-startall-de D 0 993 1 0x00004000
[ 243.067146] Call Trace:
[ 243.067158] __schedule+0x2e6/0x6f0
[ 243.067160] ? _cond_resched+0x19/0x30
[ 243.067163] schedule+0x33/0xa0
[ 243.067164] rwsem_down_read_slowpath+0x370/0x470
[ 243.067169] ? link_path_walk.part.42+0x29e/0x540
[ 243.067170] down_read+0x95/0xa0
[ 243.067172] path_openat+0xa33/0x16f0
[ 243.067176] ? generic_file_read_iter+0xa06/0xd70
[ 243.067179] ? _cond_resched+0x19/0x30
[ 243.067180] ? down_read+0x12/0xa0
[ 243.067183] ? _copy_to_user+0x2b/0x40
[ 243.067186] do_filp_open+0x93/0x100
[ 243.067189] ? __alloc_fd+0x46/0x150
[ 243.067191] do_sys_open+0x177/0x280
[ 243.067193] __x64_sys_openat+0x20/0x30
[ 243.067197] do_syscall_64+0x57/0x190
[ 243.067199] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 243.067202] RIP: 0033:0x7f870d9d1c8b
[ 243.067209] Code: Bad RIP value.
[ 243.067211] RSP: 002b:00007ffe5962f030 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 243.067213] RAX: ffffffffffffffda RBX: 00007ffe5962f2b0 RCX: 00007f870d9d1c8b
[ 243.067213] RDX: 0000000000080000 RSI: 000055ebf5338590 RDI: 00000000ffffff9c
[ 243.067214] RBP: 000055ebf530a260 R08: 00007ffe5962f2b0 R09: 00000000ffffffff
[ 243.067215] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 243.067216] R13: 000055ebf5338590 R14: 0000000000000000 R15: 0000000000000000
[ 363.886927] INFO: task pve-startall-de:993 blocked for more than 241 seconds.
[ 363.886977] Tainted: P O 5.4.65-1-pve #1
[ 363.887009] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.887053] pve-startall-de D 0 993 1 0x00004000
[ 363.887055] Call Trace:
[ 363.887066] __schedule+0x2e6/0x6f0
[ 363.887069] ? _cond_resched+0x19/0x30
[ 363.887071] schedule+0x33/0xa0
[ 363.887073] rwsem_down_read_slowpath+0x370/0x470
[ 363.887078] ? link_path_walk.part.42+0x29e/0x540
[ 363.887079] down_read+0x95/0xa0
[ 363.887081] path_openat+0xa33/0x16f0
[ 363.887085] ? generic_file_read_iter+0xa06/0xd70
[ 363.887088] ? _cond_resched+0x19/0x30
[ 363.887089] ? down_read+0x12/0xa0
[ 363.887093] ? _copy_to_user+0x2b/0x40
[ 363.887095] do_filp_open+0x93/0x100
[ 363.887098] ? __alloc_fd+0x46/0x150
[ 363.887100] do_sys_open+0x177/0x280
[ 363.887103] __x64_sys_openat+0x20/0x30
[ 363.887106] do_syscall_64+0x57/0x190
[ 363.887108] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 363.887111] RIP: 0033:0x7f870d9d1c8b
[ 363.887118] Code: Bad RIP value.
[ 363.887119] RSP: 002b:00007ffe5962f030 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 363.887121] RAX: ffffffffffffffda RBX: 00007ffe5962f2b0 RCX: 00007f870d9d1c8b
[ 363.887122] RDX: 0000000000080000 RSI: 000055ebf5338590 RDI: 00000000ffffff9c
[ 363.887123] RBP: 000055ebf530a260 R08: 00007ffe5962f2b0 R09: 00000000ffffffff
[ 363.887123] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 363.887124] R13: 000055ebf5338590 R14: 0000000000000000 R15: 0000000000000000
[ 484.712688] INFO: task pve-startall-de:993 blocked for more than 362 seconds.
[ 484.712739] Tainted: P O 5.4.65-1-pve #1
[ 484.712771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.712815] pve-startall-de D 0 993 1 0x00004000
[ 484.712819] Call Trace:
[ 484.712830] __schedule+0x2e6/0x6f0
[ 484.712833] ? _cond_resched+0x19/0x30
[ 484.712835] schedule+0x33/0xa0
[ 484.712837] rwsem_down_read_slowpath+0x370/0x470
[ 484.712841] ? link_path_walk.part.42+0x29e/0x540
[ 484.712843] down_read+0x95/0xa0
[ 484.712844] path_openat+0xa33/0x16f0
[ 484.712849] ? generic_file_read_iter+0xa06/0xd70
[ 484.712851] ? _cond_resched+0x19/0x30
[ 484.712852] ? down_read+0x12/0xa0
[ 484.712856] ? _copy_to_user+0x2b/0x40
[ 484.712858] do_filp_open+0x93/0x100
[ 484.712861] ? __alloc_fd+0x46/0x150
[ 484.712863] do_sys_open+0x177/0x280
[ 484.712866] __x64_sys_openat+0x20/0x30
[ 484.712869] do_syscall_64+0x57/0x190
[ 484.712871] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 484.712874] RIP: 0033:0x7f870d9d1c8b
[ 484.712880] Code: Bad RIP value.
[ 484.712881] RSP: 002b:00007ffe5962f030 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 484.712883] RAX: ffffffffffffffda RBX: 00007ffe5962f2b0 RCX: 00007f870d9d1c8b
[ 484.712884] RDX: 0000000000080000 RSI: 000055ebf5338590 RDI: 00000000ffffff9c
[ 484.712885] RBP: 000055ebf530a260 R08: 00007ffe5962f2b0 R09: 00000000ffffffff
[ 484.712885] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 484.712886] R13: 000055ebf5338590 R14: 0000000000000000 R15: 0000000000000000
[ 605.531891] INFO: task pve-startall-de:993 blocked for more than 483 seconds.
[ 605.531941] Tainted: P O 5.4.65-1-pve #1
[ 605.531974] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.532017] pve-startall-de D 0 993 1 0x00004000
[ 605.532020] Call Trace:
[ 605.532032] __schedule+0x2e6/0x6f0
[ 605.532035] ? _cond_resched+0x19/0x30
[ 605.532037] schedule+0x33/0xa0
[ 605.532039] rwsem_down_read_slowpath+0x370/0x470
[ 605.532043] ? link_path_walk.part.42+0x29e/0x540
[ 605.532045] down_read+0x95/0xa0
[ 605.532047] path_openat+0xa33/0x16f0
[ 605.532050] ? generic_file_read_iter+0xa06/0xd70
[ 605.532053] ? _cond_resched+0x19/0x30
[ 605.532054] ? down_read+0x12/0xa0
[ 605.532057] ? _copy_to_user+0x2b/0x40
[ 605.532060] do_filp_open+0x93/0x100
[ 605.532063] ? __alloc_fd+0x46/0x150
[ 605.532065] do_sys_open+0x177/0x280
[ 605.532067] __x64_sys_openat+0x20/0x30
[ 605.532071] do_syscall_64+0x57/0x190
[ 605.532073] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 605.532076] RIP: 0033:0x7f870d9d1c8b
[ 605.532082] Code: Bad RIP value.
[ 605.532083] RSP: 002b:00007ffe5962f030 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 605.532085] RAX: ffffffffffffffda RBX: 00007ffe5962f2b0 RCX: 00007f870d9d1c8b
[ 605.532086] RDX: 0000000000080000 RSI: 000055ebf5338590 RDI: 00000000ffffff9c
[ 605.532087] RBP: 000055ebf530a260 R08: 00007ffe5962f2b0 R09: 00000000ffffffff
[ 605.532088] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 605.532089] R13: 000055ebf5338590 R14: 0000000000000000 R15: 0000000000000000
The particular services that get stuck, however, aren't always predictable. On other nodes, for example, I'll get messages like the following
Code:
[ 5075.820911] INFO: task pvesr:12479 blocked for more than 362 seconds.
[ 5075.820980] Tainted: P O 5.4.65-1-pve #1
[ 5075.821013] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5075.821057] pvesr D 0 12479 1 0x00000004
[ 5075.821060] Call Trace:
[ 5075.821071] __schedule+0x2e6/0x6f0
[ 5075.821076] ? filename_parentat.isra.57.part.58+0xf7/0x180
[ 5075.821078] schedule+0x33/0xa0
[ 5075.821083] rwsem_down_write_slowpath+0x2ed/0x4a0
[ 5075.821085] down_write+0x3d/0x40
[ 5075.821087] filename_create+0x8e/0x180
[ 5075.821089] do_mkdirat+0x59/0x110
[ 5075.821092] __x64_sys_mkdir+0x1b/0x20
[ 5075.821095] do_syscall_64+0x57/0x190
[ 5075.821097] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5075.821100] RIP: 0033:0x7fc47e10b0d7
[ 5075.821106] Code: Bad RIP value.
[ 5075.821107] RSP: 002b:00007ffccf3ab0b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 5075.821110] RAX: ffffffffffffffda RBX: 0000556c84bdc260 RCX: 00007fc47e10b0d7
[ 5075.821110] RDX: 0000556c841fe3d4 RSI: 00000000000001ff RDI: 0000556c88f95af0
[ 5075.821111] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007fc47e15cb80
[ 5075.821112] R10: 0000000000000000 R11: 0000000000000246 R12: 0000556c860192e8
[ 5075.821113] R13: 0000556c88f95af0 R14: 0000556c88c3d6d0 R15: 00000000000001ff
[ 5075.821116] INFO: task vzdump:12531 blocked for more than 362 seconds.
[ 5075.821153] Tainted: P O 5.4.65-1-pve #1
[ 5075.821185] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5075.821228] vzdump D 0 12531 1 0x00000000
[ 5075.821230] Call Trace:
[ 5075.821233] __schedule+0x2e6/0x6f0
[ 5075.821235] ? filename_parentat.isra.57.part.58+0xf7/0x180
[ 5075.821237] schedule+0x33/0xa0
[ 5075.821239] rwsem_down_write_slowpath+0x2ed/0x4a0
[ 5075.821242] down_write+0x3d/0x40
[ 5075.821244] filename_create+0x8e/0x180
[ 5075.821246] do_mkdirat+0x59/0x110
[ 5075.821248] __x64_sys_mkdir+0x1b/0x20
[ 5075.821250] do_syscall_64+0x57/0x190
[ 5075.821252] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5075.821253] RIP: 0033:0x7f5be74fd0d7
[ 5075.821255] Code: Bad RIP value.
[ 5075.821256] RSP: 002b:00007ffdd1df4088 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 5075.821257] RAX: ffffffffffffffda RBX: 0000557228b58260 RCX: 00007f5be74fd0d7
[ 5075.821258] RDX: 0000557226c223d4 RSI: 00000000000001ff RDI: 000055722d0ddb30
[ 5075.821259] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f5be754eb80
[ 5075.821260] R10: 0000000000000000 R11: 0000000000000246 R12: 0000557228f3e3c8
[ 5075.821261] R13: 000055722d0ddb30 R14: 000055722cc7e878 R15: 00000000000001ff
I'm currently on kernel 5.4.65-1-pve, which I realize is a little old. I saw identical behavior on the default kernel, 5.4.73-1-pve. Are there any suggestions on how to proceed?