Last update broke cluster synchronization

rahman · Jul 26, 2012

Hi,

Latest update seems to broke our cluster setup. On all nodes cman stops working. Manually tried to start it on all nodes:

root@kvm45:~# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Unfencing self... [ OK ]

But they stop in a few seconds as you see.

root@kvm45:~# pveversion -v
pve-manager: 2.1-12 (pve-manager/2.1/be112d89)
running kernel: 2.6.32-13-pve
proxmox-ve-2.6.32: 2.1-72
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-13-pve: 2.6.32-72
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.92-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.8-1
pve-cluster: 1.0-27
qemu-server: 2.0-45
pve-firmware: 1.0-17
libpve-common-perl: 1.0-28
libpve-access-control: 1.0-24
libpve-storage-perl: 2.0-27
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.1-6
ksm-control-daemon: 1.1-1

Edit: I can open webadmins of the nodes one by one and start the VMs via each nodes webadmins. But on each node webadmin, the other nodes seem offline.

Edit2: I get these on nodes:
Jul 26 13:17:19 corosync [CMAN ] Activity suspended on this node
Jul 26 13:17:19 corosync [CMAN ] Error reloading the configuration, will retry every second
Jul 26 13:17:20 corosync [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration

Jul 26 13:17:20 corosync [CMAN ] Can't get updated config version 6: New configuration version has to be newer than current running configuration
.
Jul 26 13:17:20 corosync [CMAN ] Activity suspended on this node
Jul 26 13:17:20 corosync [CMAN ] Error reloading the configuration, will retry every second
Jul 26 13:17:21 corosync [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration

Jul 26 13:17:21 corosync [CMAN ] Can't get updated config version 6: New configuration version has to be newer than current running configuration
.
Jul 26 13:17:21 corosync [CMAN ] Activity suspended on this node
Jul 26 13:17:21 corosync [CMAN ] Error reloading the configuration, will retry every second

How can I fix this?

tom · Jul 26, 2012

did you followed the note regarding cluster, see http://forum.proxmox.com/threads/10408-KVM-1-1-and-new-Kernel

rahman · Jul 26, 2012

Yes, I read it. But I don't have any HA/fencing setup and aptitude did not asked about replacing any config file (maybe becouse I don't use fencing?)

So How can I solve this issue? Should I clear all the cluster setup and rebuild it? If yes how?

Edit: Also it seems cman started to work but with the errors I post before.

root@kvm44:~# pvecm status
Version: 6.2.0
Config Version: 4
Cluster Name: SYT-PVE-CLUSTER
Cluster Id: 62420
Cluster Member: Yes
Cluster Generation: 648
Membership state: Cluster-Member
Nodes: 3
Expected votes: 4
Total votes: 3
Node votes: 1
Quorum: 3
Active subsystems: 5
Flags: Error
Ports Bound: 0
Node name: kvm44
Node ID: 4
Multicast addresses: 239.192.243.200
Node addresses: xxx.xxx.xxx.xxx

rahman · Jul 26, 2012

I think I found the culprit? on the error giving 2 nodes pvecm status shows "Config Version: 4" but the third one is "Config Version: 6". It seems I cant change /etc/pve/cluster.conf and fix the <cluster name="SYT-PVE-CLUSTER" config_version="4"> line with nano as it cant write the changes. Any hints on this.

tom · Jul 26, 2012

yes, the config version mismatch is the problem. but this cannot be due to the upgrade.

you need to gain quorum so the files are writable again and you can fix it.

try to set the expected votes to 1.

> pvecm -e 1

but find out why you got different versions, this cannot happen under normal operations.

dietmar · Jul 26, 2012

rahman said:
I think I found the culprit? on the error giving 2 nodes pvecm status shows "Config Version: 4" but the third one is "Config Version: 6". It seems I cant change /etc/pve/cluster.conf and fix the <cluster name="SYT-PVE-CLUSTER" config_version="4"> line with nano as it cant write the changes. Any hints on this.

Please copy the actual version (6) to /etc/cluster/cluster.conf (on all nodes with wrong version), then restart those nodes.

rahman · Jul 26, 2012

I fixed it. I was able to change the cluster.conf file on the third node. Then I run "service cman restart" on all nodes. This fixed the "Jul 26 13:17:21 corosync [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration" errors an the two problematic nodes. Then I needed to run "service pve-cluster restart" an all nodes so the cluster was up again in webadmin.

I don't know why I got this issue. What I did is: "aptitude update && aptitude full-upgrade" and reboot all nodes simultaneously wtihout waiting each other.

Search

Search

Last update broke cluster synchronization

rahman

Renowned Member

tom

Proxmox Staff Member

rahman

Renowned Member

rahman

Renowned Member

tom

Proxmox Staff Member

dietmar

Proxmox Staff Member

rahman

Renowned Member

We value your privacy