apt dist-upgrade (minor) has broken PVE. Corosync won't start

Mark Dutton · Apr 16, 2018

I have a cluster of 3 nodes. They were all running happily as below.

root@pve2:/etc/pve# pveversion -v
proxmox-ve: 5.1-30 (running kernel: 4.13.8-3-pve)
pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777)
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.13.8-3-pve: 4.13.8-30
libpve-http-server-perl: 2.0-7
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-22
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-3
pve-container: 2.0-17
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

I upgraded one host (pve1) with apt dist-upgrade and it now looks like this.

root@pve1:/etc/apt# pveversion -v
proxmox-ve: 5.1-42 (running kernel: 4.13.16-2-pve)
pve-manager: 5.1-51 (running version: 5.1-51/96be5354)
pve-kernel-4.13: 5.1-44
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.10.17-2-pve: 4.10.17-20
corosync: 2.4.2-pve4
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-18
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-25
pve-container: 2.0-21
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-2
pve-zsync: 1.6-15
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

The process was error free, but after a reboot, it will no longer join the cluster because corosync won't start.

Here is the output from journalctl -xe

-- Subject: Unit corosync.service has begun start-up
-- Defined-By: systemd
-- Support: yadayada
--
-- Unit corosync.service has begun starting up.
Apr 16 20:05:23 pve1 corosync[60658]: [MAIN ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Apr 16 20:05:23 pve1 corosync[60658]: notice [MAIN ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Apr 16 20:05:23 pve1 corosync[60658]: info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd s
Apr 16 20:05:23 pve1 corosync[60658]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie
Apr 16 20:05:23 pve1 corosync[60658]: error [MAIN ] parse error in config: This totem parser can only parse version 2 configurations.
Apr 16 20:05:23 pve1 corosync[60658]: error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1308.
Apr 16 20:05:23 pve1 corosync[60658]: [MAIN ] parse error in config: This totem parser can only parse version 2 configurations.
Apr 16 20:05:23 pve1 corosync[60658]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1308.
Apr 16 20:05:23 pve1 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Apr 16 20:05:23 pve1 systemd[1]: Failed to start Corosync Cluster Engine.
-- Subject: Unit corosync.service has failed
-- Defined-By: systemd
-- Support: yadayada
--
-- Unit corosync.service has failed.
--
-- The result is failed.
Apr 16 20:05:23 pve1 systemd[1]: corosync.service: Unit entered failed state.
Apr 16 20:05:23 pve1 systemd[1]: corosync.service: Failed with result 'exit-code'.
Apr 16 20:05:29 pve1 pmxcfs[2393]: [quorum] crit: quorum_initialize failed: 2
Apr 16 20:05:29 pve1 pmxcfs[2393]: [confdb] crit: cmap_initialize failed: 2
Apr 16 20:05:29 pve1 pmxcfs[2393]: [dcdb] crit: cpg_initialize failed: 2
Apr 16 20:05:29 pve1 pmxcfs[2393]: [status] crit: cpg_initialize failed: 2

I am lost. I don't dare ugrade the other hosts in the hope that it is just a version issue. I have enough capacity to run with 2 hosts, but not 1.

Can anyone point me in the right direction please?

dcsapak · Apr 16, 2018

can you post the content of /etc/pve/corosync.conf ?

Mark Dutton · Apr 16, 2018

Sure. 192.168.44.1 is pve2

root@pve1:~# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: pve1
nodeid: 3
quorum_votes: 1
ring0_addr: pve1
}
node {
name: pve2
nodeid: 1
quorum_votes: 1
ring0_addr: pve2
}
node {
name: pve3
nodeid: 2
quorum_votes: 1
ring0_addr: pve3
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: datamerge
config_version: 7
interface {
bindnetaddr: 192.168.44.1
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 20
}

Mark Dutton · Apr 16, 2018

And from pve2 for context.

root@pve2:~# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: pve1
nodeid: 3
quorum_votes: 1
ring0_addr: pve1
}
node {
name: pve2
nodeid: 1
quorum_votes: 1
ring0_addr: pve2
}
node {
name: pve3
nodeid: 2
quorum_votes: 1
ring0_addr: pve3
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: datamerge
config_version: 7
interface {
bindnetaddr: 192.168.44.1
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 20
}

dietmar · Apr 16, 2018

That 'version: 20' looks strange to me (should be 2).

Mark Dutton · Apr 16, 2018

Yes but it was always there on all three HVs I think, unless it changed after the upgrade to pve1.

The timestamp on the file is from the last reboot.
-rw-r--r-- 1 root root 520 Dec 16 14:06 corosync.conf

If I change it, could it bring the other nodes down. As it is not the master will it propagate?

Thank you for looking.

Mark Dutton · Apr 16, 2018

Dietmar, I just changed pve1 corosync.conf to fix the version and it started immediately. However, in the web interface the other hvs don't see it as active. Should I edit the corosync files on all HVs and restart corosync?

GadgetPig · Apr 16, 2018

Hi Mark,

I'm also watching this thread, I'm not an expert by any means, but thru my detective work I checked 2 different man pages and found this:

https://www.unix.com/man-page/centos/5/corosync.conf/
https://linux.die.net/man/5/corosync.conf

"version
This specifies the version of the configuration file. Currently the only valid
version for this directive is 2."

Perhaps you could try changing 20 to 2 on all nodes and restart corosync (during off-peak hours). Dietmar please confirm?

Mark Dutton · Apr 16, 2018

Thanks for that confirmation.

I'm obviously doing it all wrong because I change the version to 2 and when I restart corosync it changes back to 20. I tried taking pve3 offline with pmxcfs -l and it rebooted, taking all the VMs with it. When it came back up it was back to 20 in the version. I then rebooted pve1 again which had version set to 2 and was running corosync OK, but wouldn't start pve-cluster. After reboot it came up fully online and guess what? The version was 20! Argh! Now I have three hosts up, one updated all with version 20 in their corosync.conf and I am too scared to touch anything.

Mark Dutton · Apr 16, 2018

UPDATE: I rebooted pve1 again and it is back to not starting. I suspect that what happens is I change corosync file version to 2. When I reboot it can start corosync, but it then replicates corosync.conf from the other HVs (pve2 is the master) and changes to 20. When it reboots after that it is broken again. I think I have proved this by running pmxcfs -l on pve1, editing /etc/corosync/corosync.conf with correct version and then starting the services corosync, pve-cluster, pvedaemon and pveproxy. It all starts up, corosync.conf version changes to 20, but it won't reboot next time.

Seems that 5.1-38 can handle this wrong version (I have no idea how it got to be 20 in the first place), but 5.1-51 can't deal with it. I just don't know how to get it to stay at 2. Do I need to do the changes on pve2 which is the master?

Mark Dutton · Apr 16, 2018

Good news (for me at least). I have it all stable with the correct version. Servers boot properly. Thank you to Dietmar and GadgetPig. Your feedback led me to the right answer. When it was up with all three nodes I was able to edit /etc/pve/corosync.conf and the new version propagated to all hosts.

Cheers
Mark

dietmar · Apr 16, 2018

You need to edit /etc/pve/corosync.conf (set version to 2), and make sure you also increase config_version (else you changes gets overwritten).

Jonathan Hankins · Apr 16, 2018

Wonder if OP uses vi? If you are on the version: 2 line, with cursor at end of line, and used the "0" command to move to beginning of line, not realizing you were in INSERT mode, it would append a 0 after the 2. Just a thought. Also, though his config_version: 7 doesn't seem like a likely candidate, mine is 19, and I can imagine a bad day where I was mistakenly on the version: 2 line, close to the config_version: 19 line, and accidentally changing the 2 to 20, still seeing the 19, moving up to 19, bumping it to 20 and not realizing I had made the change on the version: line as well. Just a thought.

Mark Dutton · Apr 16, 2018

All plausible Jonathon. I am not a VI master, so I would not have used the 0 like you said. However I just learnt a new vi command. I may have set it to 20 instead of the config_version as you said. I am just happy it is all fixed. The problem was compounded as it only happened after the update.

fabian · Apr 17, 2018

Mark Dutton said:
All plausible Jonathon. I am not a VI master, so I would not have used the 0 like you said. However I just learnt a new vi command. I may have set it to 20 instead of the config_version as you said. I am just happy it is all fixed. The problem was compounded as it only happened after the update.

it's possible that the config was broken quite a while before the update, and it just went unnoticed (corosync only gets restarted on node reboot and upgrades of the corosync packages, which are rather rare)

Mark Dutton · Apr 17, 2018

I think you are right Fabian. It was up for 128 days prior, a testament to the reliability of Proxmox and Linux in general.

Search

Search

apt dist-upgrade (minor) has broken PVE. Corosync won't start

Mark Dutton

New Member

dcsapak

Proxmox Staff Member

Mark Dutton

New Member

Mark Dutton

New Member

dietmar

Proxmox Staff Member

Mark Dutton

New Member

Mark Dutton

New Member

GadgetPig

Member

Mark Dutton

New Member

Mark Dutton

New Member

Mark Dutton

New Member

dietmar

Proxmox Staff Member

Jonathan Hankins

Active Member

Mark Dutton

New Member

fabian

Proxmox Staff Member

Mark Dutton

New Member

We value your privacy