cluster and corosync finally dead

yjjoe

Member
Dec 27, 2021
34
2
13
24
Hi!
After I've messed up while trying to delete a node, I did too much step that I've picked up from the forum but sadly I've lost control. I have 6 nodes. My "quorum" node, the node that I've created to datacenter on, is not even displaying the web-interface. I can ssh in hopefully.
Now my main issue is that the cluster is broken. I can still connect to the individual IP of my nodes, but quorum is missing and my main nodes is fully dulled. Problems has started happening when I've tried to increment config_version in /etc/pve/corosync.com. I did also an error at first, I've incremented version instead of config_version. Since then, I've lost control.

Here are some informations:

pveversion
pve-manager/7.4-16/0f39f621 (running kernel: 5.15.108-1-pve)

pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2023-07-13 04:35:06 EDT; 7h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 3427 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 3427 (code=exited, status=8)
CPU: 8ms

Jul 13 04:35:06 quorum corosync[3427]: [MAIN ] Corosync Cluster Engine 3.1.7 starting up
Jul 13 04:35:06 quorum corosync[3427]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf>
Jul 13 04:35:06 quorum corosync[3427]: [MAIN ] parse error in config: This totem parser can only parse version 2 co>
Jul 13 04:35:06 quorum corosync[3427]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1445.
Jul 13 04:35:06 quorum systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Jul 13 04:35:06 quorum systemd[1]: corosync.service: Failed with result 'exit-code'.
Jul 13 04:35:06 quorum systemd[1]: Failed to start Corosync Cluster Engine.
Jul 13 04:35:06 quorum systemd[1]: corosync.service: Start request repeated too quickly.
Jul 13 04:35:06 quorum systemd[1]: corosync.service: Failed with result 'exit-code'.
Jul 13 04:35:06 quorum systemd[1]: Failed to start Corosync Cluster Engine.

systemctl start corosync
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xe" for details.

journalctl -xe
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit corosync.service has begun execution.
░░
░░ The job identifier is 6885.
Jul 13 11:43:34 quorum corosync[6150]: [MAIN ] Corosync Cluster Engine 3.1.7 starting up
Jul 13 11:43:34 quorum corosync[6150]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf>
Jul 13 11:43:34 quorum corosync[6150]: [MAIN ] parse error in config: This totem parser can only parse version 2 co>
Jul 13 11:43:34 quorum corosync[6150]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1445.
Jul 13 11:43:34 quorum systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit corosync.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 8.
Jul 13 11:43:34 quorum systemd[1]: corosync.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit corosync.service has entered the 'failed' state with result 'exit-code'.
Jul 13 11:43:34 quorum systemd[1]: Failed to start Corosync Cluster Engine.
░░ Subject: A start job for unit corosync.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit corosync.service has finished with a failure.
░░
░░ The job identifier is 6885 and the job result is failed.

systemctl restart pve-cluster
Job for pve-cluster.service failed because the control process exited with error code.
See "systemctl status pve-cluster.service" and "journalctl -xe" for details.



systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2023-07-13 11:49:04 EDT; 11s ago
Process: 6205 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
CPU: 9ms

Jul 13 11:49:04 quorum systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Jul 13 11:49:04 quorum systemd[1]: Stopped The Proxmox VE cluster filesystem.
Jul 13 11:49:04 quorum systemd[1]: pve-cluster.service: Start request repeated too quickly.
Jul 13 11:49:04 quorum systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jul 13 11:49:04 quorum systemd[1]: Failed to start The Proxmox VE cluster filesystem.


Any route I should take to fix that? Let me know if you need more infos.
Cheers
 
I don't see how I can delete the other one. There were an issue when I've posted it. The post was not present in the forum so I've step back and post it again and than it worked.

Now my post is far in the forum...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!