[SOLVED] multiple Hosts in cluster locking up after latest update to kernel 6.8.4

bfwdd · May 13, 2024

Hi,
we are running proxmox+ceph since 2017, 15 hosts, amd AMD Opteron(tm) Processor 6380 (2 Sockets) + AMD EPYC 7513.
After the latest update on 8.Mai 2024 three opteron hosts are locked - red X and (no ping, no ssh, all vms with grey (?) mark)
after reboot everything ok.
After 6 hours two other host are locked (opteron)
And then one AMD EPYC 7513.

after reboot ceph storage has grey (?) marks and some vms are not starting any longer AND restore failed:

no lock found trying to remove 'create' lock
error before or during data restore, some or all disks were not completely restored. VM 206 state is NOT cleaned up.

ceph health detail:

HEALTH_OK

ceph osd df also ok (maX 71%)

Found this in the logs:

May 13 15:32:52 benno pvestatd[2186]: got timeout
May 13 15:32:57 benno pvestatd[2186]: got timeout
May 13 15:32:57 benno pvestatd[2186]: status update time (15.338 seconds)
May 13 15:33:02 benno pvestatd[2186]: got timeout
May 13 15:33:08 benno pvestatd[2186]: got timeout
May 13 15:33:13 benno pvestatd[2186]: got timeout
May 13 15:33:13 benno pvestatd[2186]: status update time (15.351 seconds)
May 13 15:33:18 benno pvestatd[2186]: got timeout
May 13 15:33:18 benno pmxcfs[467101]: [status] notice: received log
May 13 15:33:23 benno pvestatd[2186]: got timeout
May 13 15:33:28 benno pvestatd[2186]: got timeout
May 13 15:33:28 benno pvestatd[2186]: status update time (15.335 seconds)
May 13 15:33:33 benno pvestatd[2186]: got timeout
May 13 15:33:38 benno pvestatd[2186]: got timeout
May 13 15:33:41 benno pmxcfs[467101]: [status] notice: received log
May 13 15:33:43 benno pvestatd[2186]: got timeout
May 13 15:33:43 benno pvestatd[2186]: status update time (15.335 seconds)
May 13 15:33:48 benno pvestatd[2186]: got timeout
May 13 15:33:53 benno pvestatd[2186]: got timeout

ansible cluster -m shell -a "grep microcode /proc/cpuinfo | uniq"
udo.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
kalle.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
bruno.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
felix.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
daniel.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
egon.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
fritz.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
andre.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
bernd.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
otto.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
moritz.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
kulle.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
ralf.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
benno.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
paul.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832

What can we do???
With regards
Konrad

bfwdd · May 13, 2024

2024-05-13T17:24:37.057563+02:00 udo pvestatd[2406]: VM 182 qmp command failed - VM 182 qmp command 'query-proxmox-support' failed - unable to connect to VM 182 qmp socket - timeout after 51 retries
2024-05-13T17:24:42.105803+02:00 udo pvestatd[2406]: got timeout
2024-05-13T17:24:47.118265+02:00 udo pvestatd[2406]: got timeout
2024-05-13T17:24:52.365757+02:00 udo pvestatd[2406]: got timeout
2024-05-13T17:24:52.551600+02:00 udo pvestatd[2406]: status update time (23.545 seconds)
2024-05-13T17:24:54.605518+02:00 udo pvedaemon[2813]: VM 182 qmp command failed - VM 182 qmp command 'query-proxmox-support' failed - unable to connect to VM 182 qmp socket - timeout after 51 retries
2024-05-13T17:25:00.594784+02:00 udo pvestatd[2406]: VM 182 qmp command failed - VM 182 qmp command 'query-proxmox-support' failed - unable to connect to VM 182 qmp socket - timeout after 51 retries
2024-05-13T17:25:00.621285+02:00 udo pvedaemon[12352]: got timeout
2024-05-13T17:25:00.621462+02:00 udo pvedaemon[12352]: volume deactivation failed: ceph-ssd:vm-182-disk-1 at /usr/share/perl5/PVE/Storage.pm line 1264.
2024-05-13T17:25:00.622008+02:00 udo pvedaemon[12352]: start failed: command '/usr/bin/kvm -id 182 -name 'max,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/182.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/182.pid -daemonize -smbios 'type=1,uuid=1acacb84-483f-413d-99a2-62aca2564b66' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/182.vnc,password=on' -cpu 'Opteron_G3,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,-rdtscp,vendor=AuthenticAMD' -m 1024 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'qxl-vga,id=vga,max_outputs=4,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/182.qga,server=on,wait=off,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' -chardev 'spicevmc,id=vdagent,name=vdagent' -device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' -spice 'tls-port=61000,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:82eaa4ccd85c' -drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:ssd/vm-182-disk-1:mon_host=10.41.0.88;10.41.0.29;10.41.0.44:auth_supported=cephx:id=admin:keyring=/etc/pve/priv/ceph/ceph-ssd.keyring,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=io_uring,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap182i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:01:01:00:00:86,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap182i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:01:01:00:00:87,netdev=net1,bus=pci.0,addr=0x13,id=net1,rx_queue_size=1024,tx_queue_size=256,bootindex=301' -machine 'type=pc+pve0'' failed: got timeout
2024-05-13T17:25:00.631678+02:00 udo pvedaemon[2814]: <root@pam> end task UPID:udo:00003040:00013735:66423072:qmstart:182:root@pam: start failed: command '/usr/bin/kvm -id 182 -name 'max,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/182.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/182.pid -daemonize -smbios 'type=1,uuid=1acacb84-483f-413d-99a2-62aca2564b66' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/182.vnc,password=on' -cpu 'Opteron_G3,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,-rdtscp,vendor=AuthenticAMD' -m 1024 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'qxl-vga,id=vga,max_outputs=4,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/182.qga,server=on,wait=off,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' -chardev 'spicevmc,id=vdagent,name=vdagent' -device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' -spice 'tls-port=61000,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:82eaa4ccd85c' -drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:ssd/vm-182-disk-1:mon_host=10.41.0.88;10.41.0.29;10.41.0.44:auth_supported=cephx:id=admin:keyring=/etc/pve/priv/ceph/ceph-ssd.keyring,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=io_uring,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap182i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:01:01:00:00:86,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap182i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:01:01:00:00:87,netdev=net1,bus=pci.0,addr=0x13,id=net1,rx_queue_size=1024,tx_queue_size=256,bootindex=301' -machine 'type=pc+pve0'' failed: got timeout
2024-05-13T17:25:01.579100+02:00 udo CRON[13086]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
2024-05-13T17:25:05.611901+02:00 udo pvestatd[2406]: got timeout

Last Installed packaged in dpkg.log

zzz09700 · May 13, 2024

Welcome to the broken world of PVE kernel 6.8.
Roll back to previous PVE version if accessible.
If not, do a off-line reinstall of PVE 8.1 or switch PVE kernel to 6.5

I see they still refuse to take down toxic updates from their repo. Sad.

bfwdd · May 14, 2024

Thanks, cluster is running 6.5, BUT:

pvesm status
got timeout
got timeout
got timeout
Name Type Status Total Used Available %
backup nfs active 19260043264 11389441024 7870602240 59.14%
ceph-hdd rbd inactive 0 0 0 0.00%
ceph-sata rbd inactive 0 0 0 0.00%
ceph-ssd rbd inactive 0 0 0 0.00%

how can I activate these storage pools or debug the case?
ceph is healthy...

narrateourale · May 14, 2024

You could also pin an older kernel version:

Code:

proxmox-boot-tool kernel list
proxmox-boot-tool kernel pin {kernel}

No need to downgrade the full node.

bfwdd said:
AMD Opteron(tm) Processor 6380 (2 Sockets)

Those were released in 2012 if I am not mistaken? I would be prepared that they will be more and more of a problem in the future.

bfwdd · May 14, 2024

Thanks, kernel pinned and yes we are replacing them.

BUT pvesm status still shows my pools as inactive
got timeout
got timeout
got timeout
Name Type Status Total Used Available %
backup nfs active 19260043264 11389441024 7870602240 59.14%
ceph-hdd rbd inactive 0 0 0 0.00%
ceph-sata rbd inactive 0 0 0 0.00%
ceph-ssd rbd inactive 0 0 0 0.00%

how can I activate these storage pools or debug the case?

aaron · May 14, 2024

How is Ceph doing?

ceph -s
ceph health detail
ceph osd df tree

And please put the output inside [code][/code] tags for better readabilty. There are also buttons for that on top of the editor.

bfwdd · May 14, 2024

Ahhhhh, I found the culprit - creating new monitors in Ceph/Monitor DOES NOT update Storage Definition in Datacenter/Storage!

So no monitor was found....

It would really be nice, if at least pvesm status is telling the root cause (got timeout from mon xyz)

Best regards
Konrad

Search

Search

[SOLVED] multiple Hosts in cluster locking up after latest update to kernel 6.8.4

bfwdd

Renowned Member

bfwdd

Renowned Member

Attachments

zzz09700

Active Member

bfwdd

Renowned Member

narrateourale

Well-Known Member

bfwdd

Renowned Member

aaron

Proxmox Staff Member

bfwdd

Renowned Member