Migrating VMs from disconnected Cluster Node (broken corosync)

maho · Aug 27, 2024

Moin,
i have a 5 Node PVE cluster with a dedicated pbs. One Node lost connection and is now out of sync. the corosync.conf has a version mismatch on the broken node which cannot be resolved (because it is RO).

So i decided to shutdown all VMs on that node and back them up, so that i can restore them after i destroyed recreated the currently orphaned node. Problem is, since this node has no quorum and corosync.service failed. there is no way to to
root@ukfmephy-c85002:~# vzdump 86041 --storage pbs-ds02 --mode snapshot
INFO: starting new backup job: vzdump 86041 --mode snapshot --storage pbs-ds02
INFO: Starting Backup of VM 86041 (qemu)
INFO: Backup started at 2024-08-27 09:53:04
INFO: status = stopped
ERROR: Backup of VM 86041 failed - unable to open file '/etc/pve/nodes/ukfmephy-c85002/qemu-server/86041.conf.tmp.33573' - Permission denied
INFO: Failed at 2024-08-27 09:53:04
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
job errors

pvecm status
Cluster information
-------------------
Name: physiocluster
Config Version: 20
Transport: knet
Secure auth: on

Cannot initialize CMAP service

root@ukfmephy-c85002:~# systemctl status corosync.service
× corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Tue 2024-08-27 11:01:46 CEST; 15min ago
Duration: 9.061s
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 77481 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=0/SUCCESS)
Process: 77597 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=1/FAILURE)
Main PID: 77481 (code=exited, status=0/SUCCESS)
CPU: 223ms

Aug 27 11:01:45 ukfmephy-c85002 corosync[77481]: [SERV ] Service engine unloaded: corosync watchdog service
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [MAIN ] Corosync Cluster Engine exiting normally
Aug 27 11:01:46 ukfmephy-c85002 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Aug 27 11:01:46 ukfmephy-c85002 systemd[1]: corosync.service: Failed with result 'exit-code'.

root@ukfmephy-c85002:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Tue 2024-08-27 09:05:53 CEST; 2h 12min ago
Main PID: 2404 (pmxcfs)
Tasks: 6 (limit: 618650)
Memory: 48.2M
CPU: 4.322s
CGroup: /system.slice/pve-cluster.service
└─2404 /usr/bin/pmxcfs

Aug 27 11:17:48 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:17:48 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [quorum] crit: quorum_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [confdb] crit: cmap_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [quorum] crit: quorum_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [confdb] crit: cmap_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2

I have some days-old backups of all VMs, but i would really like to get the most recent version, i do not care about the cluster node. Any ideas?
Thx in advance
Kind regards from Kiel

maho · Aug 28, 2024

Ok,
i was able to resolve fundamental problem (beeing the node not able to connect to the cluster because of corosync.conf missmatch).

Classic RTFM https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf ->there are 2 corosync.conf files...

I just had to scp the current corosync.conf to the LOCAL version of the broken-pve-node, this i writeable.

from a working node one simply #scp /etc/pve/corosync.conf root@broken-pve-node:/etc/corosync/corosync.conf

reboot
and voila

But still, my question, how to backup vms from orphaned cluster node is still open. The only solution i found so far is to copy all virtual drives and config files manualy. since i am using lvm, that is really anoying since al lot of zero data will be generated.

Search

Search

Migrating VMs from disconnected Cluster Node (broken corosync)

maho

Member

maho

Member

We value your privacy