apt dist-upgrade (minor) has broken PVE. Corosync won't start

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
I have a cluster of 3 nodes. They were all running happily as below.

root@pve2:/etc/pve# pveversion -v
proxmox-ve: 5.1-30 (running kernel: 4.13.8-3-pve)
pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777)
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.13.8-3-pve: 4.13.8-30
libpve-http-server-perl: 2.0-7
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-22
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-3
pve-container: 2.0-17
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

I upgraded one host (pve1) with apt dist-upgrade and it now looks like this.

root@pve1:/etc/apt# pveversion -v
proxmox-ve: 5.1-42 (running kernel: 4.13.16-2-pve)
pve-manager: 5.1-51 (running version: 5.1-51/96be5354)
pve-kernel-4.13: 5.1-44
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.10.17-2-pve: 4.10.17-20
corosync: 2.4.2-pve4
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-18
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-25
pve-container: 2.0-21
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-2
pve-zsync: 1.6-15
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

The process was error free, but after a reboot, it will no longer join the cluster because corosync won't start.

Here is the output from journalctl -xe

-- Subject: Unit corosync.service has begun start-up
-- Defined-By: systemd
-- Support: yadayada
--
-- Unit corosync.service has begun starting up.
Apr 16 20:05:23 pve1 corosync[60658]: [MAIN ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Apr 16 20:05:23 pve1 corosync[60658]: notice [MAIN ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Apr 16 20:05:23 pve1 corosync[60658]: info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd s
Apr 16 20:05:23 pve1 corosync[60658]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie
Apr 16 20:05:23 pve1 corosync[60658]: error [MAIN ] parse error in config: This totem parser can only parse version 2 configurations.
Apr 16 20:05:23 pve1 corosync[60658]: error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1308.
Apr 16 20:05:23 pve1 corosync[60658]: [MAIN ] parse error in config: This totem parser can only parse version 2 configurations.
Apr 16 20:05:23 pve1 corosync[60658]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1308.
Apr 16 20:05:23 pve1 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Apr 16 20:05:23 pve1 systemd[1]: Failed to start Corosync Cluster Engine.
-- Subject: Unit corosync.service has failed
-- Defined-By: systemd
-- Support: yadayada
--
-- Unit corosync.service has failed.
--
-- The result is failed.
Apr 16 20:05:23 pve1 systemd[1]: corosync.service: Unit entered failed state.
Apr 16 20:05:23 pve1 systemd[1]: corosync.service: Failed with result 'exit-code'.
Apr 16 20:05:29 pve1 pmxcfs[2393]: [quorum] crit: quorum_initialize failed: 2
Apr 16 20:05:29 pve1 pmxcfs[2393]: [confdb] crit: cmap_initialize failed: 2
Apr 16 20:05:29 pve1 pmxcfs[2393]: [dcdb] crit: cpg_initialize failed: 2
Apr 16 20:05:29 pve1 pmxcfs[2393]: [status] crit: cpg_initialize failed: 2

I am lost. I don't dare ugrade the other hosts in the hope that it is just a version issue. I have enough capacity to run with 2 hosts, but not 1.

Can anyone point me in the right direction please?
 

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
8,062
993
163
34
Vienna
can you post the content of /etc/pve/corosync.conf ?
 

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
Sure. 192.168.44.1 is pve2

root@pve1:~# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: pve1
nodeid: 3
quorum_votes: 1
ring0_addr: pve1
}
node {
name: pve2
nodeid: 1
quorum_votes: 1
ring0_addr: pve2
}
node {
name: pve3
nodeid: 2
quorum_votes: 1
ring0_addr: pve3
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: datamerge
config_version: 7
interface {
bindnetaddr: 192.168.44.1
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 20
}
 

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
And from pve2 for context.

root@pve2:~# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: pve1
nodeid: 3
quorum_votes: 1
ring0_addr: pve1
}
node {
name: pve2
nodeid: 1
quorum_votes: 1
ring0_addr: pve2
}
node {
name: pve3
nodeid: 2
quorum_votes: 1
ring0_addr: pve3
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: datamerge
config_version: 7
interface {
bindnetaddr: 192.168.44.1
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 20
}
 

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
Yes but it was always there on all three HVs I think, unless it changed after the upgrade to pve1.

The timestamp on the file is from the last reboot.
-rw-r--r-- 1 root root 520 Dec 16 14:06 corosync.conf

If I change it, could it bring the other nodes down. As it is not the master will it propagate?

Thank you for looking.
 

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
Dietmar, I just changed pve1 corosync.conf to fix the version and it started immediately. However, in the web interface the other hvs don't see it as active. Should I edit the corosync files on all HVs and restart corosync?
 

GadgetPig

Member
Apr 26, 2016
138
24
18
52
Hi Mark,

I'm also watching this thread, I'm not an expert by any means, but thru my detective work I checked 2 different man pages and found this:

https://www.unix.com/man-page/centos/5/corosync.conf/
https://linux.die.net/man/5/corosync.conf

"version
This specifies the version of the configuration file. Currently the only valid
version for this directive is 2."

Perhaps you could try changing 20 to 2 on all nodes and restart corosync (during off-peak hours). Dietmar please confirm?
 

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
Thanks for that confirmation.

I'm obviously doing it all wrong because I change the version to 2 and when I restart corosync it changes back to 20. I tried taking pve3 offline with pmxcfs -l and it rebooted, taking all the VMs with it. When it came back up it was back to 20 in the version. I then rebooted pve1 again which had version set to 2 and was running corosync OK, but wouldn't start pve-cluster. After reboot it came up fully online and guess what? The version was 20! Argh! Now I have three hosts up, one updated all with version 20 in their corosync.conf and I am too scared to touch anything.
 

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
UPDATE: I rebooted pve1 again and it is back to not starting. I suspect that what happens is I change corosync file version to 2. When I reboot it can start corosync, but it then replicates corosync.conf from the other HVs (pve2 is the master) and changes to 20. When it reboots after that it is broken again. I think I have proved this by running pmxcfs -l on pve1, editing /etc/corosync/corosync.conf with correct version and then starting the services corosync, pve-cluster, pvedaemon and pveproxy. It all starts up, corosync.conf version changes to 20, but it won't reboot next time.

Seems that 5.1-38 can handle this wrong version (I have no idea how it got to be 20 in the first place), but 5.1-51 can't deal with it. I just don't know how to get it to stay at 2. Do I need to do the changes on pve2 which is the master?
 

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
Good news (for me at least). I have it all stable with the correct version. Servers boot properly. Thank you to Dietmar and GadgetPig. Your feedback led me to the right answer. When it was up with all three nodes I was able to edit /etc/pve/corosync.conf and the new version propagated to all hosts.

Cheers
Mark
 
  • Like
Reactions: GadgetPig

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
17,124
521
133
Austria
www.proxmox.com
You need to edit /etc/pve/corosync.conf (set version to 2), and make sure you also increase config_version (else you changes gets overwritten).
 
  • Like
Reactions: GadgetPig
Mar 17, 2018
10
4
23
45
Wonder if OP uses vi? If you are on the version: 2 line, with cursor at end of line, and used the "0" command to move to beginning of line, not realizing you were in INSERT mode, it would append a 0 after the 2. Just a thought. Also, though his config_version: 7 doesn't seem like a likely candidate, mine is 19, and I can imagine a bad day where I was mistakenly on the version: 2 line, close to the config_version: 19 line, and accidentally changing the 2 to 20, still seeing the 19, moving up to 19, bumping it to 20 and not realizing I had made the change on the version: line as well. Just a thought.
 

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
All plausible Jonathon. I am not a VI master, so I would not have used the 0 like you said. However I just learnt a new vi command. I may have set it to 20 instead of the config_version as you said. I am just happy it is all fixed. The problem was compounded as it only happened after the update.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,628
1,441
164
All plausible Jonathon. I am not a VI master, so I would not have used the 0 like you said. However I just learnt a new vi command. I may have set it to 20 instead of the config_version as you said. I am just happy it is all fixed. The problem was compounded as it only happened after the update.

it's possible that the config was broken quite a while before the update, and it just went unnoticed (corosync only gets restarted on node reboot and upgrades of the corosync packages, which are rather rare)
 

Mark Dutton

New Member
Apr 16, 2018
10
1
3
61
I think you are right Fabian. It was up for 128 days prior, a testament to the reliability of Proxmox and Linux in general.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!