Last update broke cluster synchronization

rahman

Renowned Member
Nov 1, 2010
63
0
71
Hi,

Latest update seems to broke our cluster setup. On all nodes cman stops working. Manually tried to start it on all nodes:

root@kvm45:~# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Unfencing self... [ OK ]

But they stop in a few seconds as you see.

root@kvm45:~# pveversion -v
pve-manager: 2.1-12 (pve-manager/2.1/be112d89)
running kernel: 2.6.32-13-pve
proxmox-ve-2.6.32: 2.1-72
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-13-pve: 2.6.32-72
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.92-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.8-1
pve-cluster: 1.0-27
qemu-server: 2.0-45
pve-firmware: 1.0-17
libpve-common-perl: 1.0-28
libpve-access-control: 1.0-24
libpve-storage-perl: 2.0-27
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.1-6
ksm-control-daemon: 1.1-1


Edit: I can open webadmins of the nodes one by one and start the VMs via each nodes webadmins. But on each node webadmin, the other nodes seem offline.

Edit2: I get these on nodes:
Jul 26 13:17:19 corosync [CMAN ] Activity suspended on this node
Jul 26 13:17:19 corosync [CMAN ] Error reloading the configuration, will retry every second
Jul 26 13:17:20 corosync [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration


Jul 26 13:17:20 corosync [CMAN ] Can't get updated config version 6: New configuration version has to be newer than current running configuration
.
Jul 26 13:17:20 corosync [CMAN ] Activity suspended on this node
Jul 26 13:17:20 corosync [CMAN ] Error reloading the configuration, will retry every second
Jul 26 13:17:21 corosync [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration


Jul 26 13:17:21 corosync [CMAN ] Can't get updated config version 6: New configuration version has to be newer than current running configuration
.
Jul 26 13:17:21 corosync [CMAN ] Activity suspended on this node
Jul 26 13:17:21 corosync [CMAN ] Error reloading the configuration, will retry every second



How can I fix this?
 
Yes, I read it. But I don't have any HA/fencing setup and aptitude did not asked about replacing any config file (maybe becouse I don't use fencing?)

So How can I solve this issue? Should I clear all the cluster setup and rebuild it? If yes how?

Edit: Also it seems cman started to work but with the errors I post before.

root@kvm44:~# pvecm status
Version: 6.2.0
Config Version: 4
Cluster Name: SYT-PVE-CLUSTER
Cluster Id: 62420
Cluster Member: Yes
Cluster Generation: 648
Membership state: Cluster-Member
Nodes: 3
Expected votes: 4
Total votes: 3
Node votes: 1
Quorum: 3
Active subsystems: 5
Flags: Error
Ports Bound: 0
Node name: kvm44
Node ID: 4
Multicast addresses: 239.192.243.200
Node addresses: xxx.xxx.xxx.xxx
 
Last edited:
I think I found the culprit? on the error giving 2 nodes pvecm status shows "Config Version: 4" but the third one is "Config Version: 6". It seems I cant change /etc/pve/cluster.conf and fix the <cluster name="SYT-PVE-CLUSTER" config_version="4"> line with nano as it cant write the changes. Any hints on this.
 
yes, the config version mismatch is the problem. but this cannot be due to the upgrade.

you need to gain quorum so the files are writable again and you can fix it.

try to set the expected votes to 1.

> pvecm -e 1

but find out why you got different versions, this cannot happen under normal operations.
 
I think I found the culprit? on the error giving 2 nodes pvecm status shows "Config Version: 4" but the third one is "Config Version: 6". It seems I cant change /etc/pve/cluster.conf and fix the <cluster name="SYT-PVE-CLUSTER" config_version="4"> line with nano as it cant write the changes. Any hints on this.

Please copy the actual version (6) to /etc/cluster/cluster.conf (on all nodes with wrong version), then restart those nodes.
 
I fixed it. I was able to change the cluster.conf file on the third node. Then I run "service cman restart" on all nodes. This fixed the "Jul 26 13:17:21 corosync [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration" errors an the two problematic nodes. Then I needed to run "service pve-cluster restart" an all nodes so the cluster was up again in webadmin.

I don't know why I got this issue. What I did is: "aptitude update && aptitude full-upgrade" and reboot all nodes simultaneously wtihout waiting each other.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!