I applied updates to some of the nodes in my Proxmox cluster this morning. This included the new kernel 6.5.11-8-pve. I rebooted the third server and it came back up with some issues. It didn't seem to have reported in with the Proxmox cluster. There was a little red x next to the Server in the GUI. However, it is also a Ceph node and Ceph reported that all the OSDs came back online and the Ceph database was healthy.
The following eventually appeared on my serial console:
Other servers in the cluster (7 nodes total) started complaining about various issues and the red x started to propagate to them whether they were running the new kernel or not. Yet Ceph remained healthy.
I finally restored order by rebooting each server starting with the three that had the new kernel and rolling them back to 6.5.11-7-pve. Then rebooting the other four servers that I hadn't yet updated. And everything was fine again.
Too much excitement for one morning!
The following eventually appeared on my serial console:
Code:
xeon1230v2 login: [ 244.159592] INFO: task pvescheduler:8378 blocked for more .
[ 244.166598] Tainted: P O 6.5.11-8-pve #1
[ 244.172468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this.
[ 244.180330] task:pvescheduler state:D stack:0 pid:8378 ppid:8377 f6
[ 244.188708] Call Trace:
[ 244.191168] <TASK>
[ 244.193358] __schedule+0x3fd/0x1450
[ 244.196961] ? try_to_unlazy+0x60/0xe0
[ 244.200775] ? terminate_walk+0x65/0x100
[ 244.204791] ? path_parentat+0x49/0x90
[ 244.208563] schedule+0x63/0x110
[ 244.211862] schedule_preempt_disabled+0x15/0x30
[ 244.216558] rwsem_down_write_slowpath+0x392/0x6a0
[ 244.221375] down_write+0x5c/0x80
[ 244.224780] filename_create+0xaf/0x1b0
[ 244.228697] do_mkdirat+0x5d/0x170
[ 244.232223] __x64_sys_mkdir+0x4a/0x70
[ 244.236074] do_syscall_64+0x5b/0x90
[ 244.239793] ? __x64_sys_alarm+0x76/0xd0
[ 244.243807] ? exit_to_user_mode_prepare+0x39/0x190
[ 244.248897] ? syscall_exit_to_user_mode+0x37/0x60
[ 244.253833] ? do_syscall_64+0x67/0x90
[ 244.257618] ? exit_to_user_mode_prepare+0x39/0x190
[ 244.262575] ? irqentry_exit_to_user_mode+0x17/0x20
[ 244.267489] ? irqentry_exit+0x43/0x50
[ 244.271316] ? exc_page_fault+0x94/0x1b0
[ 244.275302] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 244.280554] RIP: 0033:0x7f3ba6179e27
[ 244.284181] RSP: 002b:00007ffc6b732c38 EFLAGS: 00000246 ORIG_RAX: 00000000003
[ 244.291803] RAX: ffffffffffffffda RBX: 0000561f6f19f2a0 RCX: 00007f3ba6179e27
[ 244.298949] RDX: 0000000000000026 RSI: 00000000000001ff RDI: 0000561f6f1d3ee0
[ 244.306144] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ 244.313299] R10: 0000000000000000 R11: 0000000000000246 R12: 0000561f6f1a4c88
[ 244.320478] R13: 0000561f6f1d3ee0 R14: 0000561f6f444768 R15: 00000000000001ff
[ 244.327783] </TASK>
Other servers in the cluster (7 nodes total) started complaining about various issues and the red x started to propagate to them whether they were running the new kernel or not. Yet Ceph remained healthy.
I finally restored order by rebooting each server starting with the three that had the new kernel and rolling them back to 6.5.11-7-pve. Then rebooting the other four servers that I hadn't yet updated. And everything was fine again.
Too much excitement for one morning!