Migrating VMs from disconnected Cluster Node (broken corosync)

maho

Member
Sep 21, 2021
5
1
8
59
Moin,
i have a 5 Node PVE cluster with a dedicated pbs. One Node lost connection and is now out of sync. the corosync.conf has a version mismatch on the broken node which cannot be resolved (because it is RO).

So i decided to shutdown all VMs on that node and back them up, so that i can restore them after i destroyed recreated the currently orphaned node. Problem is, since this node has no quorum and corosync.service failed. there is no way to to
root@ukfmephy-c85002:~# vzdump 86041 --storage pbs-ds02 --mode snapshot
INFO: starting new backup job: vzdump 86041 --mode snapshot --storage pbs-ds02
INFO: Starting Backup of VM 86041 (qemu)
INFO: Backup started at 2024-08-27 09:53:04
INFO: status = stopped
ERROR: Backup of VM 86041 failed - unable to open file '/etc/pve/nodes/ukfmephy-c85002/qemu-server/86041.conf.tmp.33573' - Permission denied
INFO: Failed at 2024-08-27 09:53:04
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
job errors

pvecm status
Cluster information
-------------------
Name: physiocluster
Config Version: 20
Transport: knet
Secure auth: on

Cannot initialize CMAP service

root@ukfmephy-c85002:~# systemctl status corosync.service
× corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Tue 2024-08-27 11:01:46 CEST; 15min ago
Duration: 9.061s
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 77481 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=0/SUCCESS)
Process: 77597 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=1/FAILURE)
Main PID: 77481 (code=exited, status=0/SUCCESS)
CPU: 223ms

Aug 27 11:01:45 ukfmephy-c85002 corosync[77481]: [SERV ] Service engine unloaded: corosync watchdog service
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [MAIN ] Corosync Cluster Engine exiting normally
Aug 27 11:01:46 ukfmephy-c85002 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Aug 27 11:01:46 ukfmephy-c85002 systemd[1]: corosync.service: Failed with result 'exit-code'.

root@ukfmephy-c85002:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Tue 2024-08-27 09:05:53 CEST; 2h 12min ago
Main PID: 2404 (pmxcfs)
Tasks: 6 (limit: 618650)
Memory: 48.2M
CPU: 4.322s
CGroup: /system.slice/pve-cluster.service
└─2404 /usr/bin/pmxcfs

Aug 27 11:17:48 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:17:48 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [quorum] crit: quorum_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [confdb] crit: cmap_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [quorum] crit: quorum_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [confdb] crit: cmap_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2



I have some days-old backups of all VMs, but i would really like to get the most recent version, i do not care about the cluster node. Any ideas?
Thx in advance
Kind regards from Kiel
 
Ok,
i was able to resolve fundamental problem (beeing the node not able to connect to the cluster because of corosync.conf missmatch).

Classic RTFM https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf ->there are 2 corosync.conf files...:eek:

I just had to scp the current corosync.conf to the LOCAL version of the broken-pve-node, this i writeable.

from a working node one simply #scp /etc/pve/corosync.conf root@broken-pve-node:/etc/corosync/corosync.conf

reboot
and voila

But still, my question, how to backup vms from orphaned cluster node is still open. The only solution i found so far is to copy all virtual drives and config files manualy. since i am using lvm, that is really anoying since al lot of zero data will be generated.