[SOLVED] multiple Hosts in cluster locking up after latest update to kernel 6.8.4

bfwdd

Renowned Member
Mar 29, 2016
22
3
68
Dresden
www.bfw-dresden.de
Hi,
we are running proxmox+ceph since 2017, 15 hosts, amd AMD Opteron(tm) Processor 6380 (2 Sockets) + AMD EPYC 7513.
After the latest update on 8.Mai 2024 three opteron hosts are locked - red X and (no ping, no ssh, all vms with grey (?) mark)
after reboot everything ok.
After 6 hours two other host are locked (opteron)
And then one AMD EPYC 7513.

after reboot ceph storage has grey (?) marks and some vms are not starting any longer AND restore failed:

no lock found trying to remove 'create' lock
error before or during data restore, some or all disks were not completely restored. VM 206 state is NOT cleaned up.

ceph health detail:

HEALTH_OK

ceph osd df also ok (maX 71%)

Found this in the logs:

May 13 15:32:52 benno pvestatd[2186]: got timeout
May 13 15:32:57 benno pvestatd[2186]: got timeout
May 13 15:32:57 benno pvestatd[2186]: status update time (15.338 seconds)
May 13 15:33:02 benno pvestatd[2186]: got timeout
May 13 15:33:08 benno pvestatd[2186]: got timeout
May 13 15:33:13 benno pvestatd[2186]: got timeout
May 13 15:33:13 benno pvestatd[2186]: status update time (15.351 seconds)
May 13 15:33:18 benno pvestatd[2186]: got timeout
May 13 15:33:18 benno pmxcfs[467101]: [status] notice: received log
May 13 15:33:23 benno pvestatd[2186]: got timeout
May 13 15:33:28 benno pvestatd[2186]: got timeout
May 13 15:33:28 benno pvestatd[2186]: status update time (15.335 seconds)
May 13 15:33:33 benno pvestatd[2186]: got timeout
May 13 15:33:38 benno pvestatd[2186]: got timeout
May 13 15:33:41 benno pmxcfs[467101]: [status] notice: received log
May 13 15:33:43 benno pvestatd[2186]: got timeout
May 13 15:33:43 benno pvestatd[2186]: status update time (15.335 seconds)
May 13 15:33:48 benno pvestatd[2186]: got timeout
May 13 15:33:53 benno pvestatd[2186]: got timeout

ansible cluster -m shell -a "grep microcode /proc/cpuinfo | uniq"
udo.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
kalle.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
bruno.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
felix.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
daniel.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
egon.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
fritz.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
andre.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
bernd.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
otto.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
moritz.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
kulle.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
ralf.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
benno.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
paul.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832


What can we do???
With regards
Konrad
 
2024-05-13T17:24:37.057563+02:00 udo pvestatd[2406]: VM 182 qmp command failed - VM 182 qmp command 'query-proxmox-support' failed - unable to connect to VM 182 qmp socket - timeout after 51 retries
2024-05-13T17:24:42.105803+02:00 udo pvestatd[2406]: got timeout
2024-05-13T17:24:47.118265+02:00 udo pvestatd[2406]: got timeout
2024-05-13T17:24:52.365757+02:00 udo pvestatd[2406]: got timeout
2024-05-13T17:24:52.551600+02:00 udo pvestatd[2406]: status update time (23.545 seconds)
2024-05-13T17:24:54.605518+02:00 udo pvedaemon[2813]: VM 182 qmp command failed - VM 182 qmp command 'query-proxmox-support' failed - unable to connect to VM 182 qmp socket - timeout after 51 retries
2024-05-13T17:25:00.594784+02:00 udo pvestatd[2406]: VM 182 qmp command failed - VM 182 qmp command 'query-proxmox-support' failed - unable to connect to VM 182 qmp socket - timeout after 51 retries
2024-05-13T17:25:00.621285+02:00 udo pvedaemon[12352]: got timeout
2024-05-13T17:25:00.621462+02:00 udo pvedaemon[12352]: volume deactivation failed: ceph-ssd:vm-182-disk-1 at /usr/share/perl5/PVE/Storage.pm line 1264.
2024-05-13T17:25:00.622008+02:00 udo pvedaemon[12352]: start failed: command '/usr/bin/kvm -id 182 -name 'max,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/182.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/182.pid -daemonize -smbios 'type=1,uuid=1acacb84-483f-413d-99a2-62aca2564b66' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/182.vnc,password=on' -cpu 'Opteron_G3,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,-rdtscp,vendor=AuthenticAMD' -m 1024 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'qxl-vga,id=vga,max_outputs=4,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/182.qga,server=on,wait=off,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' -chardev 'spicevmc,id=vdagent,name=vdagent' -device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' -spice 'tls-port=61000,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:82eaa4ccd85c' -drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:ssd/vm-182-disk-1:mon_host=10.41.0.88;10.41.0.29;10.41.0.44:auth_supported=cephx:id=admin:keyring=/etc/pve/priv/ceph/ceph-ssd.keyring,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=io_uring,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap182i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:01:01:00:00:86,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap182i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:01:01:00:00:87,netdev=net1,bus=pci.0,addr=0x13,id=net1,rx_queue_size=1024,tx_queue_size=256,bootindex=301' -machine 'type=pc+pve0'' failed: got timeout
2024-05-13T17:25:00.631678+02:00 udo pvedaemon[2814]: <root@pam> end task UPID:udo:00003040:00013735:66423072:qmstart:182:root@pam: start failed: command '/usr/bin/kvm -id 182 -name 'max,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/182.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/182.pid -daemonize -smbios 'type=1,uuid=1acacb84-483f-413d-99a2-62aca2564b66' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/182.vnc,password=on' -cpu 'Opteron_G3,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,-rdtscp,vendor=AuthenticAMD' -m 1024 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'qxl-vga,id=vga,max_outputs=4,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/182.qga,server=on,wait=off,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' -chardev 'spicevmc,id=vdagent,name=vdagent' -device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' -spice 'tls-port=61000,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:82eaa4ccd85c' -drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:ssd/vm-182-disk-1:mon_host=10.41.0.88;10.41.0.29;10.41.0.44:auth_supported=cephx:id=admin:keyring=/etc/pve/priv/ceph/ceph-ssd.keyring,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=io_uring,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap182i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:01:01:00:00:86,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap182i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:01:01:00:00:87,netdev=net1,bus=pci.0,addr=0x13,id=net1,rx_queue_size=1024,tx_queue_size=256,bootindex=301' -machine 'type=pc+pve0'' failed: got timeout
2024-05-13T17:25:01.579100+02:00 udo CRON[13086]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
2024-05-13T17:25:05.611901+02:00 udo pvestatd[2406]: got timeout



Last Installed packaged in dpkg.log
 

Attachments

  • dpkg.log
    53.5 KB · Views: 0
Last edited:
Welcome to the broken world of PVE kernel 6.8.
Roll back to previous PVE version if accessible.
If not, do a off-line reinstall of PVE 8.1 or switch PVE kernel to 6.5

I see they still refuse to take down toxic updates from their repo. Sad.
 
Thanks, cluster is running 6.5, BUT:

pvesm status
got timeout
got timeout
got timeout
Name Type Status Total Used Available %
backup nfs active 19260043264 11389441024 7870602240 59.14%
ceph-hdd rbd inactive 0 0 0 0.00%
ceph-sata rbd inactive 0 0 0 0.00%
ceph-ssd rbd inactive 0 0 0 0.00%

how can I activate these storage pools or debug the case?
ceph is healthy...
 
You could also pin an older kernel version:
Code:
proxmox-boot-tool kernel list
proxmox-boot-tool kernel pin {kernel}
No need to downgrade the full node.


AMD Opteron(tm) Processor 6380 (2 Sockets)
Those were released in 2012 if I am not mistaken? I would be prepared that they will be more and more of a problem in the future.
 
Thanks, kernel pinned and yes we are replacing them.

BUT pvesm status still shows my pools as inactive
got timeout
got timeout
got timeout
Name Type Status Total Used Available %
backup nfs active 19260043264 11389441024 7870602240 59.14%
ceph-hdd rbd inactive 0 0 0 0.00%
ceph-sata rbd inactive 0 0 0 0.00%
ceph-ssd rbd inactive 0 0 0 0.00%

how can I activate these storage pools or debug the case?
 
How is Ceph doing?
  • ceph -s
  • ceph health detail
  • ceph osd df tree

And please put the output inside [code][/code] tags for better readabilty. There are also buttons for that on top of the editor.
 
Ahhhhh, I found the culprit - creating new monitors in Ceph/Monitor DOES NOT update Storage Definition in Datacenter/Storage!

So no monitor was found....

It would really be nice, if at least pvesm status is telling the root cause (got timeout from mon xyz)

Best regards
Konrad
 
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!