PVE 7 to 8: VM crashes after migrating, OSD not found

matt. · Jul 6, 2023

I run a 3-node PVE with CEPH.
I migrated all VMs away from node 3, upgraded to the latest CEPH (Quincy) and then started the PVE 7 to 8 upgrade on node 3.

After rebooting node 3 (now PVE 8), everything seemed to work well. So I migrated two VMs, one each from node 1 (still on PVE 7) and node 2 (also (still on PVE 7) over to node 3. Once the migrating task was done, these two VMs showed almost 100% CPU load, one became immediately unresponsive in the GUI console and the other did a kernel panic. I fear this would happen with other VMs once I migrate them.

In addition, when I click "Details" on one of the OSDs displayed in the PVE GUI under "Ceph" -> "OSD", I get the message (e.g.) "OSD '15' does not exist on host 'node3' (500)". This happens with all OSDs and from each of the node's GUIs. However, the OSD list looks like it did before and the CEPH cluster is healthy.

I am wondering what went wrong here and if that is related to the VMs crashing and how I can fix this.
I wanted to upgrade PVE upgrade node by node and migrate the VMs to one of the respective nodes after the upgrade is complete so that the VMs do not have downtime.

Any help is greatly appreciated.

Moayad · Jul 7, 2023

Hello,

What the Ceph status says `pveceph status`?

matt. · Jul 7, 2023

Code:

# pveceph status
  cluster:
    id:     xyz
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum node1,node2,node3 (age 18h)
    mgr: node1(active, since 24h), standbys: node2, node3
    mds: 1/1 daemons up, 2 standby
    osd: 21 osds: 21 up (since 18h), 21 in (since 22M)
 
  data:
    volumes: 1/1 healthy
    pools:   6 pools, 641 pgs
    objects: 1.70M objects, 6.5 TiB
    usage:   19 TiB used, 33 TiB / 52 TiB avail
    pgs:     641 active+clean

Meanwhile I was able to establish that we can't live migrate VMs between PVE7 and PVE8 versions and that this leads to the aforementioned problems. Most likely because of different kernel versions. Offline migration however works fine and those VMs also come up as expected. (I would've liked that this had been mentioned in the "Known Issues" section of the instructions in the Proxmox wiki.)

What remains is the issue with the CEPH OSDs as mentioned in my opening post:

[..] when I click "Details" on one of the OSDs displayed in the PVE GUI under "Ceph" -> "OSD", I get the message (e.g.) "OSD '15' does not exist on host 'node3' (500)". This happens with all OSDs and from each of the node's GUIs. However, the OSD list looks like it did before and the CEPH cluster is healthy.

Moayad · Jul 7, 2023

Hi,

The issue occurs, on all nodes through Web UI, or only the node that has been upgraded? If the only upgraded node, I would try to restart the pvedaemon service. Otherwise, did you check the Ceph logs for an interesting message?

matt. · Jul 7, 2023

This problem exists across all three nodes and viewed from all three nodes (PVE 7.4-15 / PVE 8.0.3)

ceph.log, ceph.audit.log, and a random ceph-osd.10.log and ceph-mgr.node2.log don't show any warnings or errors. So basically everything looks like it ever does, except the error in the GUI, which looks pretty worrying.

matt. · Jul 23, 2023

This problem only existed in the Proxmox GUI (everything was running smoothly and the CLI indicated no problem). After changing /etc/hostname to only the host part and rebooting, the problem disappeared.

Search

Search

PVE 7 to 8: VM crashes after migrating, OSD not found

matt.

Renowned Member

Moayad

Proxmox Staff Member

matt.

Renowned Member

Moayad

Proxmox Staff Member

matt.

Renowned Member

Attachments

matt.

Renowned Member