Hi,
we're running on Proxmox 6.2. When we add a certain amount of nodes to the cluster, proxmox starts to loose connection to all of the nodes and the cluster completly shuts down. All nodes are still available over ssh. This state remains, until we remove about 1-2 nodes from the cluster. After that it is stable again.
It does not matter which of the nodes we remove and there is no order required for the nodes to be added to reproduce this behavior.
dmesg reports, that some task is hanging
At first we thought, that it was network related due to a bug in the intel firmware. But it should be fixed with a bios update applied yesterday.
We tried upgrading the kernel as well as installing the intel-microcode package.
We're currently a bit out of ideas. Does anyone know which task exactly is hanging?
we're running on Proxmox 6.2. When we add a certain amount of nodes to the cluster, proxmox starts to loose connection to all of the nodes and the cluster completly shuts down. All nodes are still available over ssh. This state remains, until we remove about 1-2 nodes from the cluster. After that it is stable again.
It does not matter which of the nodes we remove and there is no order required for the nodes to be added to reproduce this behavior.
dmesg reports, that some task is hanging
Bash:
[ 4714.510601] INFO: task pvesr:2692 blocked for more than 362 seconds.
[ 4714.510632] Tainted: P IO 5.4.60-1-pve #1
[ 4714.510649] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4714.510672] pvesr D 0 2692 1 0x00000000
[ 4714.510674] Call Trace:
[ 4714.510683] __schedule+0x2e6/0x6f0
[ 4714.510687] ? filename_parentat.isra.57.part.58+0xf7/0x180
[ 4714.510689] schedule+0x33/0xa0
[ 4714.510692] rwsem_down_write_slowpath+0x2ed/0x4a0
[ 4714.510694] down_write+0x3d/0x40
[ 4714.510696] filename_create+0x8e/0x180
[ 4714.510697] do_mkdirat+0x59/0x110
[ 4714.510699] __x64_sys_mkdir+0x1b/0x20
[ 4714.510702] do_syscall_64+0x57/0x190
[ 4714.510704] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 4714.510706] RIP: 0033:0x7f67b5d920d7
[ 4714.510710] Code: Bad RIP value.
[ 4714.510711] RSP: 002b:00007ffed8d437b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 4714.510712] RAX: ffffffffffffffda RBX: 0000560f367db260 RCX: 00007f67b5d920d7
[ 4714.510713] RDX: 0000560f3621b3d4 RSI: 00000000000001ff RDI: 0000560f3a877be0
[ 4714.510713] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
[ 4714.510714] R10: 0000000000000000 R11: 0000000000000246 R12: 0000560f37c149e8
[ 4714.510715] R13: 0000560f3a877be0 R14: 0000560f3a4f0f70 R15: 00000000000001ff
At first we thought, that it was network related due to a bug in the intel firmware. But it should be fixed with a bios update applied yesterday.
We tried upgrading the kernel as well as installing the intel-microcode package.
We're currently a bit out of ideas. Does anyone know which task exactly is hanging?