Moin,
i have a 5 Node PVE cluster with a dedicated pbs. One Node lost connection and is now out of sync. the corosync.conf has a version mismatch on the broken node which cannot be resolved (because it is RO).
So i decided to shutdown all VMs on that node and back them up, so that i can restore them after i destroyed recreated the currently orphaned node. Problem is, since this node has no quorum and corosync.service failed. there is no way to to
root@ukfmephy-c85002:~# vzdump 86041 --storage pbs-ds02 --mode snapshot
INFO: starting new backup job: vzdump 86041 --mode snapshot --storage pbs-ds02
INFO: Starting Backup of VM 86041 (qemu)
INFO: Backup started at 2024-08-27 09:53:04
INFO: status = stopped
ERROR: Backup of VM 86041 failed - unable to open file '/etc/pve/nodes/ukfmephy-c85002/qemu-server/86041.conf.tmp.33573' - Permission denied
INFO: Failed at 2024-08-27 09:53:04
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
job errors
pvecm status
Cluster information
-------------------
Name: physiocluster
Config Version: 20
Transport: knet
Secure auth: on
Cannot initialize CMAP service
root@ukfmephy-c85002:~# systemctl status corosync.service
× corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Tue 2024-08-27 11:01:46 CEST; 15min ago
Duration: 9.061s
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 77481 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=0/SUCCESS)
Process: 77597 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=1/FAILURE)
Main PID: 77481 (code=exited, status=0/SUCCESS)
CPU: 223ms
Aug 27 11:01:45 ukfmephy-c85002 corosync[77481]: [SERV ] Service engine unloaded: corosync watchdog service
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [MAIN ] Corosync Cluster Engine exiting normally
Aug 27 11:01:46 ukfmephy-c85002 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Aug 27 11:01:46 ukfmephy-c85002 systemd[1]: corosync.service: Failed with result 'exit-code'.
root@ukfmephy-c85002:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Tue 2024-08-27 09:05:53 CEST; 2h 12min ago
Main PID: 2404 (pmxcfs)
Tasks: 6 (limit: 618650)
Memory: 48.2M
CPU: 4.322s
CGroup: /system.slice/pve-cluster.service
└─2404 /usr/bin/pmxcfs
Aug 27 11:17:48 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:17:48 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [quorum] crit: quorum_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [confdb] crit: cmap_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [quorum] crit: quorum_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [confdb] crit: cmap_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
I have some days-old backups of all VMs, but i would really like to get the most recent version, i do not care about the cluster node. Any ideas?
Thx in advance
Kind regards from Kiel
i have a 5 Node PVE cluster with a dedicated pbs. One Node lost connection and is now out of sync. the corosync.conf has a version mismatch on the broken node which cannot be resolved (because it is RO).
So i decided to shutdown all VMs on that node and back them up, so that i can restore them after i destroyed recreated the currently orphaned node. Problem is, since this node has no quorum and corosync.service failed. there is no way to to
root@ukfmephy-c85002:~# vzdump 86041 --storage pbs-ds02 --mode snapshot
INFO: starting new backup job: vzdump 86041 --mode snapshot --storage pbs-ds02
INFO: Starting Backup of VM 86041 (qemu)
INFO: Backup started at 2024-08-27 09:53:04
INFO: status = stopped
ERROR: Backup of VM 86041 failed - unable to open file '/etc/pve/nodes/ukfmephy-c85002/qemu-server/86041.conf.tmp.33573' - Permission denied
INFO: Failed at 2024-08-27 09:53:04
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
job errors
pvecm status
Cluster information
-------------------
Name: physiocluster
Config Version: 20
Transport: knet
Secure auth: on
Cannot initialize CMAP service
root@ukfmephy-c85002:~# systemctl status corosync.service
× corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Tue 2024-08-27 11:01:46 CEST; 15min ago
Duration: 9.061s
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 77481 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=0/SUCCESS)
Process: 77597 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=1/FAILURE)
Main PID: 77481 (code=exited, status=0/SUCCESS)
CPU: 223ms
Aug 27 11:01:45 ukfmephy-c85002 corosync[77481]: [SERV ] Service engine unloaded: corosync watchdog service
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Aug 27 11:01:46 ukfmephy-c85002 corosync[77481]: [MAIN ] Corosync Cluster Engine exiting normally
Aug 27 11:01:46 ukfmephy-c85002 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Aug 27 11:01:46 ukfmephy-c85002 systemd[1]: corosync.service: Failed with result 'exit-code'.
root@ukfmephy-c85002:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Tue 2024-08-27 09:05:53 CEST; 2h 12min ago
Main PID: 2404 (pmxcfs)
Tasks: 6 (limit: 618650)
Memory: 48.2M
CPU: 4.322s
CGroup: /system.slice/pve-cluster.service
└─2404 /usr/bin/pmxcfs
Aug 27 11:17:48 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:17:48 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [quorum] crit: quorum_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [confdb] crit: cmap_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:17:54 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [quorum] crit: quorum_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [confdb] crit: cmap_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [dcdb] crit: cpg_initialize failed: 2
Aug 27 11:18:00 ukfmephy-c85002 pmxcfs[2404]: [status] crit: cpg_initialize failed: 2
I have some days-old backups of all VMs, but i would really like to get the most recent version, i do not care about the cluster node. Any ideas?
Thx in advance
Kind regards from Kiel