Made mistake in corosync.conf; now cannot edit

celadon-one

New Member
Oct 8, 2020
10
2
3
I have (had) a 3 node Proxmox VE 6.2-11 and Ceph cluster. I'm modifying my config after install and some light use. Ceph is now on its own 10Gx2 LAN. I decided to dedicate a 1Gb interface and create a VLAN for corosync and attempted to modify corosync.conf before understanding exactly what corosync does and how it works.

I would like to revert corosync.conf to a backup version or directly edit the file, but I cannot modify /etc/pve and when attempting to edit corosync.conf, I am presented with an error that it is still being edited. Indeed there is a .swp file present in /etc/pve.

I attempted to use pmxcfs -l, however that errors out, reporting it is unable to acquire a lock. I'm stumped.

As it stands I can manage each node from any node's web interface, but all each of the other two nodes appear down. So if I'm on node 1, node 2 and 3 appear down. If I'm on node 2, node 1 and 3 appear down. Etc.

Any guidance on restoring order would be much appreciated.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,463
1,392
164
IMPORTANT: don't do anything else on the nodes while doing the following steps, you are disabling safety checks that prevent bad things from happening!

since you haven't written anything about nodes actually going down/being fenced, I assume you don't have HA enabled/active ;) if you do, you need to stop HA services first!

Code:
# stop corosync and pmxcfs on all nodes
$ systemctl stop corosync pve-cluster

# start pmxcfs in local mode on all nodes
$ pmxcfs -l

# put correct corosync config into local pmxcfs and corosync config dir (make sure to bump the 'config_version' inside the config file)
$ cp correct_corosync.conf /etc/pve/corosync.conf
$ cp correct_corosync.conf /etc/corosync/corosync.conf

# kill local pmxcfs
$ killall pmxcfs

# start corosync and pmxcfs again
$ systemctl start pve-cluster corosync

# check status
$ journalctl --since '-5min' -u pve-cluster -u corosync
$ pvecm status
 

celadon-one

New Member
Oct 8, 2020
10
2
3
Thank you for your guidance. I actually did have HA enabled for a git server.
PROCESS
I put Ceph in maintenance mode...
Code:
for cmd in {norecover,norebalance,nobackfill,noout}; do ceph osd set $cmd; done
...because I've noticed that stopping pve-cluster and corosync seems to have the effect of restarting the node on which they were stopped, then logged into node 1 via ssh and executed:
Code:
pvecm e 1
ha-manager set vm:<vm number> --state disabled
I then proceeded as directed on each of the three nodes.
After a couple minutes, everything came back green.
I rebooted each node for good measure, though that was probably unnecessary.

Finally, Ceph was brought back up, and let it do it's deep scrub:
Code:
for cmd in {norecover,norebalance,nobackfill,noout}; do ceph osd unset $cmd; done

STATUS
The cluster appears operational, however, I have one VM that autostarts which I cannot stop right away, and one which is supposed to autostart but does not on the first attempt.
  • The one that does not start complains of mnt-pve-iso.mount not starting. Journalctl does show that it fails to start at first, but it does successfully start about 11 seconds later. I emptied the media for that virtual optical drive, and can now proceed without issue.
  • The one that does start, but will not shut down complains of a timeout. If I give the cluster a little bit to settle, I am able to power it down.
These are unexpected behaviors, as I was not having these problems before. In fact, the cluster as a whole feels a little less responsive. Is this expected behavior?
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,463
1,392
164
STATUS
The cluster appears operational, however, I have one VM that autostarts which I cannot stop right away, and one which is supposed to autostart but does not on the first attempt.
  • The one that does not start complains of mnt-pve-iso.mount not starting. Journalctl does show that it fails to start at first, but it does successfully start about 11 seconds later. I emptied the media for that virtual optical drive, and can now proceed without issue.
  • The one that does start, but will not shut down complains of a timeout. If I give the cluster a little bit to settle, I am able to power it down.
These are unexpected behaviors, as I was not having these problems before. In fact, the cluster as a whole feels a little less responsive. Is this expected behavior?

maybe mnt-pve-iso.mount is not properly tracked on bootup (pve-storage.target should depend on it, if it does not, try to add it there)? then the mount might happen after the start on boot stuff runs..

the shutdown might just be high load caused by boot up, could you post the exact task log and system logs from around that time?
 

celadon-one

New Member
Oct 8, 2020
10
2
3
Just want to follow up for completeness. The issue ended up being that the Ceph storage cluster went into a degraded state because writes had been committed without rebalancing during the time the PVE cluster was down. Fortunately, after about twenty minutes, scrubbing and rebalancing completed and the cluster became responsive again.
 
  • Like
Reactions: Dominic
Aug 31, 2017
9
1
23
44
Moscow
postgrespro.ru
IMPORTANT: don't do anything else on the nodes while doing the following steps, you are disabling safety checks that prevent bad things from happening!

since you haven't written anything about nodes actually going down/being fenced, I assume you don't have HA enabled/active ;) if you do, you need to stop HA services first!

Code:
# stop corosync and pmxcfs on all nodes
$ systemctl stop corosync pve-cluster

# start pmxcfs in local mode on all nodes
$ pmxcfs -l

# put correct corosync config into local pmxcfs and corosync config dir (make sure to bump the 'config_version' inside the config file)
$ cp correct_corosync.conf /etc/pve/corosync.conf
$ cp correct_corosync.conf /etc/corosync/corosync.conf

# kill local pmxcfs
$ killall pmxcfs

# start corosync and pmxcfs again
$ systemctl start pve-cluster corosync

# check status
$ journalctl --since '-5min' -u pve-cluster -u corosync
$ pvecm status
Oh, man, you nearly saved my life.
 
  • Like
Reactions: fabian

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!