Update from 8.0.4 to 8.1.10 Broke all my VM's

ZachARoo

New Member
Apr 11, 2024
4
0
1
So currently I have a 5 node cluster. 3 nodes are running ceph while the other two are running local storage currently. Ceph in the future.

I went to update the nodes to the current version to get them all on the same version. As before they were slightly different. the ceph cluster being on 8.0.4 and the other two being 8.1.2 I believe. I had some issues with host verification and thought updating may help since they were on different versions.

The newest of the two that are not on ceph work completely fine. And my verification issues resolved. The three nodes that run ceph don't allow me to access any VM's. I cannot change any configurations on any of the VM's or view them in the console. I can change node options and configurations, just not VM info. They show that they start but the task bar below shows failure. I have a list of what it says when I do different things with the VM's.

Starting shows this -
TASK ERROR: start failed: command '/usr/bin/kvm -id 205 -name 'PokehaanCraft2,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/205.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/205.pid -daemonize -smbios 'type=1,uuid=f4be94ab-a2e4-4a64-ae0e-22f608c45932' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/205.vnc,password=on' -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 16384 -object 'iothread,id=iothread-virtioscsi0' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' -device 'vmgenid,guid=48b412a3-6ff1-4421-8986-a830fee1050e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:cbe9613f797c' -drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=rbd:ceph-pool/vm-205-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/ceph-pool.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap205i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=CA:B7:59:7F:D0:83,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=102' -machine 'type=pc+pve0'' failed: got timeout



Changing any options does not technically error. It just never changes anything and endlessly tries to process it.

Trying to open consoles gives this error -
VM 205 qmp command 'set_password' failed - unable to connect to VM 205 qmp socket - timeout after 51 retries
TASK ERROR: Failed to run vncproxy.

I am seeing that two of my OSD's are down and won't start for any reason. Although Proxmox says Done! but they never start. Is this an issue with a new version of Ceph? Is it something where the nodes upgraded but the VM's ar ebased on old versions?
 
So currently I have a 5 node cluster. 3 nodes are running ceph while the other two are running local storage currently. Ceph in the future.

I went to update the nodes to the current version to get them all on the same version. As before they were slightly different. the ceph cluster being on 8.0.4 and the other two being 8.1.2 I believe. I had some issues with host verification and thought updating may help since they were on different versions.

The newest of the two that are not on ceph work completely fine. And my verification issues resolved. The three nodes that run ceph don't allow me to access any VM's. I cannot change any configurations on any of the VM's or view them in the console. I can change node options and configurations, just not VM info. They show that they start but the task bar below shows failure. I have a list of what it says when I do different things with the VM's.

Starting shows this -
TASK ERROR: start failed: command '/usr/bin/kvm -id 205 -name 'PokehaanCraft2,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/205.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/205.pid -daemonize -smbios 'type=1,uuid=f4be94ab-a2e4-4a64-ae0e-22f608c45932' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/205.vnc,password=on' -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 16384 -object 'iothread,id=iothread-virtioscsi0' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' -device 'vmgenid,guid=48b412a3-6ff1-4421-8986-a830fee1050e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:cbe9613f797c' -drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=rbd:ceph-pool/vm-205-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/ceph-pool.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap205i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=CA:B7:59:7F:D0:83,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=102' -machine 'type=pc+pve0'' failed: got timeout



Changing any options does not technically error. It just never changes anything and endlessly tries to process it.

Trying to open consoles gives this error -
VM 205 qmp command 'set_password' failed - unable to connect to VM 205 qmp socket - timeout after 51 retries
TASK ERROR: Failed to run vncproxy.

I am seeing that two of my OSD's are down and won't start for any reason. Although Proxmox says Done! but they never start. Is this an issue with a new version of Ceph? Is it something where the nodes upgraded but the VM's ar ebased on old versions?
Hi,
this sounds like the node in question has no quorum. Please post the output of pvecm status and systemctl status pve-cluster.service corosync.service from that node.
 
Sorry for some reason I was not getting email updates on the thread. So I didn't noticed your responded already. I removed some of the ip info cause I didn't want to post those. I ran these on one of the servers. 3 of the 5 are having this issue. Specifically the three that were on 8.0 and not 8.1. I have since solved the OSD issue by waiting and then rebooting. I think it might have been doing some rebuilding after and was slowing things down.

pvecm status shows -
Cluster information
-------------------
Name: LaffinServers
Config Version: 11
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Apr 11 19:09:21 2024
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000001
Ring ID: 1.26b
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.x.x (local)
0x00000002 1 192.168.x.x
0x00000003 1 192.168.x.x
0x00000004 1 192.168.x.x
0x00000005 1 192.168.x.x


systemctl status pve-cluster.service corosync.service shows -

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Wed 2024-04-10 22:41:14 CDT; 20h ago
Main PID: 851 (pmxcfs)
Tasks: 7 (limit: 38314)
Memory: 65.4M
CPU: 1min 24.204s
CGroup: /system.slice/pve-cluster.service
└─851 /usr/bin/pmxcfs

Apr 11 09:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful
Apr 11 10:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful
Apr 11 11:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful
Apr 11 12:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful
Apr 11 13:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful
Apr 11 14:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful
Apr 11 15:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful
Apr 11 16:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful
Apr 11 17:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful
Apr 11 18:41:13 laffinserver1 pmxcfs[851]: [dcdb] notice: data verification successful

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Wed 2024-04-10 22:41:15 CDT; 20h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 955 (corosync)
Tasks: 9 (limit: 38314)
Memory: 139.1M
CPU: 15min 44.933s
CGroup: /system.slice/corosync.service
└─955 /usr/sbin/corosync -f

Apr 10 22:45:44 laffinserver1 corosync[955]: [KNET ] rx: host: 5 link: 0 is up
Apr 10 22:45:44 laffinserver1 corosync[955]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Apr 10 22:45:44 laffinserver1 corosync[955]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Apr 10 22:45:44 laffinserver1 corosync[955]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Apr 10 22:45:44 laffinserver1 corosync[955]: [KNET ] pmtud: Global data MTU changed to: 1397
Apr 10 22:45:44 laffinserver1 corosync[955]: [QUORUM] Sync members[5]: 1 2 3 4 5
Apr 10 22:45:44 laffinserver1 corosync[955]: [QUORUM] Sync joined[1]: 5
Apr 10 22:45:44 laffinserver1 corosync[955]: [TOTEM ] A new membership (1.26b) was formed. Members joined: 5
Apr 10 22:45:44 laffinserver1 corosync[955]: [QUORUM] Members[5]: 1 2 3 4 5
Apr 10 22:45:44 laffinserver1 corosync[955]: [MAIN ] Completed service synchronization, ready to provide service.
 
I literally just clicked post reply on the previous post and twhen to check out if the issue still persisted today. VM's will start now but one of the OSD's will not start still. The one in the post above is the one with the OSD down. I'm obviously getting degraded messages and PG warnings because of it but It keeps kicking the OSD's to down and out. Currently it is just laffinserver1. I brought it back to in but it will not start.

I am trying to get more drives to add more OSD's but they haven't arrived yet.

It must be something with CEPH as those 3 servers run CEPH. The other two have their own storage until the new drives arrive to add them to CEPH.
 
Well I rebooted that server again. It is now been like 8 reboots since yesterday. But now it all seems to be working well.

I did read somewhere that after reboots and sometimes updates that the CEPH services does some rebuilding. Not sure how accurate that is. But if its true then I just didn't wait long enough.

Regardless, I should just have more patience after larger updates, knowing that its going to make large changes. I apologize for wasting time here.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!