PVE 7 to 8: VM crashes after migrating, OSD not found

matt.

Renowned Member
May 4, 2011
16
1
68
I run a 3-node PVE with CEPH.
I migrated all VMs away from node 3, upgraded to the latest CEPH (Quincy) and then started the PVE 7 to 8 upgrade on node 3.

After rebooting node 3 (now PVE 8), everything seemed to work well. So I migrated two VMs, one each from node 1 (still on PVE 7) and node 2 (also (still on PVE 7) over to node 3. Once the migrating task was done, these two VMs showed almost 100% CPU load, one became immediately unresponsive in the GUI console and the other did a kernel panic. I fear this would happen with other VMs once I migrate them.

In addition, when I click "Details" on one of the OSDs displayed in the PVE GUI under "Ceph" -> "OSD", I get the message (e.g.) "OSD '15' does not exist on host 'node3' (500)". This happens with all OSDs and from each of the node's GUIs. However, the OSD list looks like it did before and the CEPH cluster is healthy.

I am wondering what went wrong here and if that is related to the VMs crashing and how I can fix this.
I wanted to upgrade PVE upgrade node by node and migrate the VMs to one of the respective nodes after the upgrade is complete so that the VMs do not have downtime.

Any help is greatly appreciated.
 
Last edited:
Code:
# pveceph status
  cluster:
    id:     xyz
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum node1,node2,node3 (age 18h)
    mgr: node1(active, since 24h), standbys: node2, node3
    mds: 1/1 daemons up, 2 standby
    osd: 21 osds: 21 up (since 18h), 21 in (since 22M)
 
  data:
    volumes: 1/1 healthy
    pools:   6 pools, 641 pgs
    objects: 1.70M objects, 6.5 TiB
    usage:   19 TiB used, 33 TiB / 52 TiB avail
    pgs:     641 active+clean


Meanwhile I was able to establish that we can't live migrate VMs between PVE7 and PVE8 versions and that this leads to the aforementioned problems. Most likely because of different kernel versions. Offline migration however works fine and those VMs also come up as expected. (I would've liked that this had been mentioned in the "Known Issues" section of the instructions in the Proxmox wiki.)

What remains is the issue with the CEPH OSDs as mentioned in my opening post:

[..] when I click "Details" on one of the OSDs displayed in the PVE GUI under "Ceph" -> "OSD", I get the message (e.g.) "OSD '15' does not exist on host 'node3' (500)". This happens with all OSDs and from each of the node's GUIs. However, the OSD list looks like it did before and the CEPH cluster is healthy.
 
Last edited:
Hi,

The issue occurs, on all nodes through Web UI, or only the node that has been upgraded? If the only upgraded node, I would try to restart the pvedaemon service. Otherwise, did you check the Ceph logs for an interesting message?
 
This problem exists across all three nodes and viewed from all three nodes (PVE 7.4-15 / PVE 8.0.3)

ceph.log, ceph.audit.log, and a random ceph-osd.10.log and ceph-mgr.node2.log don't show any warnings or errors. So basically everything looks like it ever does, except the error in the GUI, which looks pretty worrying.
 

Attachments

  • screenshot.jpg
    screenshot.jpg
    15.7 KB · Views: 6
Last edited:
This problem only existed in the Proxmox GUI (everything was running smoothly and the CLI indicated no problem). After changing /etc/hostname to only the host part and rebooting, the problem disappeared.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!