Made mistake in corosync.conf; now cannot edit

celadon-one · Oct 8, 2020

I have (had) a 3 node Proxmox VE 6.2-11 and Ceph cluster. I'm modifying my config after install and some light use. Ceph is now on its own 10Gx2 LAN. I decided to dedicate a 1Gb interface and create a VLAN for corosync and attempted to modify corosync.conf before understanding exactly what corosync does and how it works.

I would like to revert corosync.conf to a backup version or directly edit the file, but I cannot modify /etc/pve and when attempting to edit corosync.conf, I am presented with an error that it is still being edited. Indeed there is a .swp file present in /etc/pve.

I attempted to use pmxcfs -l, however that errors out, reporting it is unable to acquire a lock. I'm stumped.

As it stands I can manage each node from any node's web interface, but all each of the other two nodes appear down. So if I'm on node 1, node 2 and 3 appear down. If I'm on node 2, node 1 and 3 appear down. Etc.

Any guidance on restoring order would be much appreciated.

fabian · Oct 9, 2020

IMPORTANT: don't do anything else on the nodes while doing the following steps, you are disabling safety checks that prevent bad things from happening!

since you haven't written anything about nodes actually going down/being fenced, I assume you don't have HA enabled/active

if you do, you need to stop HA services first!

Code:

# stop corosync and pmxcfs on all nodes
$ systemctl stop corosync pve-cluster

# start pmxcfs in local mode on all nodes
$ pmxcfs -l

# put correct corosync config into local pmxcfs and corosync config dir (make sure to bump the 'config_version' inside the config file)
$ cp correct_corosync.conf /etc/pve/corosync.conf
$ cp correct_corosync.conf /etc/corosync/corosync.conf

# kill local pmxcfs
$ killall pmxcfs

# start corosync and pmxcfs again
$ systemctl start pve-cluster corosync

# check status
$ journalctl --since '-5min' -u pve-cluster -u corosync
$ pvecm status

celadon-one · Oct 9, 2020

Thank you for your guidance. I actually did have HA enabled for a git server.
PROCESS
I put Ceph in maintenance mode...

Code:

for cmd in {norecover,norebalance,nobackfill,noout}; do ceph osd set $cmd; done

...because I've noticed that stopping pve-cluster and corosync seems to have the effect of restarting the node on which they were stopped, then logged into node 1 via ssh and executed:

Code:

pvecm e 1
ha-manager set vm:<vm number> --state disabled

I then proceeded as directed on each of the three nodes.
After a couple minutes, everything came back green.
I rebooted each node for good measure, though that was probably unnecessary.

Finally, Ceph was brought back up, and let it do it's deep scrub:

Code:

for cmd in {norecover,norebalance,nobackfill,noout}; do ceph osd unset $cmd; done

STATUS
The cluster appears operational, however, I have one VM that autostarts which I cannot stop right away, and one which is supposed to autostart but does not on the first attempt.

The one that does not start complains of mnt-pve-iso.mount not starting. Journalctl does show that it fails to start at first, but it does successfully start about 11 seconds later. I emptied the media for that virtual optical drive, and can now proceed without issue.
The one that does start, but will not shut down complains of a timeout. If I give the cluster a little bit to settle, I am able to power it down.

These are unexpected behaviors, as I was not having these problems before. In fact, the cluster as a whole feels a little less responsive. Is this expected behavior?

fabian · Oct 12, 2020

celadon-one said:
STATUS
The cluster appears operational, however, I have one VM that autostarts which I cannot stop right away, and one which is supposed to autostart but does not on the first attempt.

The one that does not start complains of mnt-pve-iso.mount not starting. Journalctl does show that it fails to start at first, but it does successfully start about 11 seconds later. I emptied the media for that virtual optical drive, and can now proceed without issue.

The one that does start, but will not shut down complains of a timeout. If I give the cluster a little bit to settle, I am able to power it down.

These are unexpected behaviors, as I was not having these problems before. In fact, the cluster as a whole feels a little less responsive. Is this expected behavior?

maybe mnt-pve-iso.mount is not properly tracked on bootup (pve-storage.target should depend on it, if it does not, try to add it there)? then the mount might happen after the start on boot stuff runs..

the shutdown might just be high load caused by boot up, could you post the exact task log and system logs from around that time?

celadon-one · Oct 28, 2020

Just want to follow up for completeness. The issue ended up being that the Ceph storage cluster went into a degraded state because writes had been committed without rebalancing during the time the PVE cluster was down. Fortunately, after about twenty minutes, scrubbing and rebalancing completed and the cluster became responsive again.

Stepan Santalov · Dec 18, 2021

fabian said:
IMPORTANT: don't do anything else on the nodes while doing the following steps, you are disabling safety checks that prevent bad things from happening!

since you haven't written anything about nodes actually going down/being fenced, I assume you don't have HA enabled/active if you do, you need to stop HA services first!

Code:

# stop corosync and pmxcfs on all nodes $ systemctl stop corosync pve-cluster # start pmxcfs in local mode on all nodes $ pmxcfs -l # put correct corosync config into local pmxcfs and corosync config dir (make sure to bump the 'config_version' inside the config file) $ cp correct_corosync.conf /etc/pve/corosync.conf $ cp correct_corosync.conf /etc/corosync/corosync.conf # kill local pmxcfs $ killall pmxcfs # start corosync and pmxcfs again $ systemctl start pve-cluster corosync # check status $ journalctl --since '-5min' -u pve-cluster -u corosync $ pvecm status

Oh, man, you nearly saved my life.

Tony · May 28, 2023

fabian said:
IMPORTANT: don't do anything else on the nodes while doing the following steps, you are disabling safety checks that prevent bad things from happening!

since you haven't written anything about nodes actually going down/being fenced, I assume you don't have HA enabled/active if you do, you need to stop HA services first!

Code:

# stop corosync and pmxcfs on all nodes $ systemctl stop corosync pve-cluster # start pmxcfs in local mode on all nodes $ pmxcfs -l # put correct corosync config into local pmxcfs and corosync config dir (make sure to bump the 'config_version' inside the config file) $ cp correct_corosync.conf /etc/pve/corosync.conf $ cp correct_corosync.conf /etc/corosync/corosync.conf # kill local pmxcfs $ killall pmxcfs # start corosync and pmxcfs again $ systemctl start pve-cluster corosync # check status $ journalctl --since '-5min' -u pve-cluster -u corosync $ pvecm status

this is an old thread, but I cannot resist the urge to express my thanks to @fabian; your post literally saved my a.. ahem my day.

Regards,
Tony

wchesley · Oct 27, 2023

fabian said:
IMPORTANT: don't do anything else on the nodes while doing the following steps, you are disabling safety checks that prevent bad things from happening!

since you haven't written anything about nodes actually going down/being fenced, I assume you don't have HA enabled/active if you do, you need to stop HA services first!

Code:

# stop corosync and pmxcfs on all nodes $ systemctl stop corosync pve-cluster # start pmxcfs in local mode on all nodes $ pmxcfs -l # put correct corosync config into local pmxcfs and corosync config dir (make sure to bump the 'config_version' inside the config file) $ cp correct_corosync.conf /etc/pve/corosync.conf $ cp correct_corosync.conf /etc/corosync/corosync.conf # kill local pmxcfs $ killall pmxcfs # start corosync and pmxcfs again $ systemctl start pve-cluster corosync # check status $ journalctl --since '-5min' -u pve-cluster -u corosync $ pvecm status

Thank you for this! Really saved me after a node crashed and lost quorum with the other two nodes.

sipelaut · Jul 24, 2024

fabian said:
IMPORTANT: don't do anything else on the nodes while doing the following steps, you are disabling safety checks that prevent bad things from happening!

since you haven't written anything about nodes actually going down/being fenced, I assume you don't have HA enabled/active if you do, you need to stop HA services first!

Code:

# stop corosync and pmxcfs on all nodes $ systemctl stop corosync pve-cluster # start pmxcfs in local mode on all nodes $ pmxcfs -l # put correct corosync config into local pmxcfs and corosync config dir (make sure to bump the 'config_version' inside the config file) $ cp correct_corosync.conf /etc/pve/corosync.conf $ cp correct_corosync.conf /etc/corosync/corosync.conf # kill local pmxcfs $ killall pmxcfs # start corosync and pmxcfs again $ systemctl start pve-cluster corosync # check status $ journalctl --since '-5min' -u pve-cluster -u corosync $ pvecm status

u really safe my live sir
many thanks
but I want to add a few things as followsafter editing the corosync.conf fileYou can run this command to fix the file
# corosync-cfgtool -s
# corosync-cfgtool -R
many thanks before

bonsi · Oct 9, 2024

Hi.

Recently i ran into a similar problem (because i was an idiot and one of my nodes was offline while changing corosync.conf)

This post was a bit hard to find (took me ~20-30 Minutes of googling) and: Thank you very much for this and the solution.

It worked quite well and my homelab is happy again

serverguy9375 · Oct 31, 2024

fabian said:
IMPORTANT: don't do anything else on the nodes while doing the following steps, you are disabling safety checks that prevent bad things from happening!

since you haven't written anything about nodes actually going down/being fenced, I assume you don't have HA enabled/active if you do, you need to stop HA services first!

Code:

# stop corosync and pmxcfs on all nodes $ systemctl stop corosync pve-cluster # start pmxcfs in local mode on all nodes $ pmxcfs -l # put correct corosync config into local pmxcfs and corosync config dir (make sure to bump the 'config_version' inside the config file) $ cp correct_corosync.conf /etc/pve/corosync.conf $ cp correct_corosync.conf /etc/corosync/corosync.conf # kill local pmxcfs $ killall pmxcfs # start corosync and pmxcfs again $ systemctl start pve-cluster corosync # check status $ journalctl --since '-5min' -u pve-cluster -u corosync $ pvecm status

This should be in a docs somewhere... very helpful!

Sorry to bump an old thread but this has saved my last hair from being pulled out!!!

Search

Search

Made mistake in corosync.conf; now cannot edit

celadon-one

Member

fabian

Proxmox Staff Member

celadon-one

Member

fabian

Proxmox Staff Member

celadon-one

Member

Stepan Santalov

Member

Tony

Renowned Member

wchesley

Member

sipelaut

New Member

bonsi

New Member

serverguy9375

New Member