Cluster stopped working after upgrade from 2.0 to 2.1 and 2.2

piccardi

New Member
Oct 20, 2012
12
0
1
Hi, I had a working cluster configuration with Proxmox 2.0. I had all four nodes running, online and with rgmanager active.
After upgrading a node to 2.1 the cluster stopped working. So I upgraded all the node, but the maximum I could get was having all nodes Online but rgmanager was never active and the HA configuration was not working anymore, and the VM could not be started.

I tried to solve the problem upgrading everything to 2.2 but now I have a even worse situation were 3 nodes are online, and the 4-th is as offline:

Code:
root@lama2:~# clustat 
Cluster Status for SiwebCluster @ Tue Nov 20 16:02:43 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 lama1                                       1 Online
 lama2                                       2 Online, Local
 lama9                                       3 Online
 lama10                                      4 Offline

Code:
root@lama10:~# clustat 
Could not connect to CMAN: Connection refused

In that node CMAN is offline and log files are full of messages like:

Code:
Nov 20 15:47:50 lama10 pmxcfs[2529]: [status] crit: cpg_send_message failed: 9
Nov 20 15:47:50 lama10 pmxcfs[2529]: [status] crit: cpg_send_message failed: 9
Nov 20 15:47:50 lama10 pmxcfs[2529]: [status] crit: cpg_send_message failed: 9
Nov 20 15:47:50 lama10 pmxcfs[2529]: [status] crit: cpg_send_message failed: 9
Nov 20 15:47:50 lama10 pmxcfs[2529]: [status] crit: cpg_send_message failed: 9

I tried to restart pve* services with:

/etc/init.d/pvestatd restart
/etc/init.d/pve-cluster

but nothing change apart some more messages.

Code:
Nov 20 15:48:08 lama10 pvestatd[3271]: server closing
Nov 20 15:48:08 lama10 pvestatd[26301]: starting server
Nov 20 15:48:18 lama10 pmxcfs[2529]: [status] crit: cpg_send_message failed: 9
Nov 20 15:48:18 lama10 pmxcfs[2529]: [status] crit: cpg_send_message failed: 9
...
Nov 20 15:48:18 lama10 pmxcfs[2529]: [status] crit: cpg_send_message failed: 9
Nov 20 15:48:18 lama10 pmxcfs[2529]: [status] crit: cpg_send_message failed: 9
Nov 20 15:48:19 lama10 pmxcfs[2529]: [main] notice: teardown filesystem
Nov 20 15:48:28 lama10 pvestatd[26301]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Nov 20 15:48:28 lama10 pvestatd[26301]: WARNING: ipcc_send_rec failed: Connection refused
Nov 20 15:48:28 lama10 pvestatd[26301]: WARNING: ipcc_send_rec failed: Connection refused
Nov 20 15:48:28 lama10 pvestatd[26301]: WARNING: ipcc_send_rec failed: Connection refused
Nov 20 15:48:28 lama10 pvestatd[26301]: WARNING: ipcc_send_rec failed: Connection refused
Nov 20 15:48:28 lama10 pvestatd[26301]: WARNING: ipcc_send_rec failed: Connection refused
Nov 20 15:48:31 lama10 pmxcfs[26319]: [quorum] crit: quorum_initialize failed: 6
Nov 20 15:48:31 lama10 pmxcfs[26319]: [quorum] crit: can't initialize service
Nov 20 15:48:31 lama10 pmxcfs[26319]: [confdb] crit: confdb_initialize failed: 6
Nov 20 15:48:31 lama10 pmxcfs[26319]: [quorum] crit: can't initialize service
Nov 20 15:48:31 lama10 pmxcfs[26319]: [dcdb] crit: cpg_initialize failed: 6
Nov 20 15:48:31 lama10 pmxcfs[26319]: [quorum] crit: can't initialize service
Nov 20 15:48:31 lama10 pmxcfs[26319]: [dcdb] crit: cpg_initialize failed: 6
Nov 20 15:48:31 lama10 pmxcfs[26319]: [quorum] crit: can't initialize service
Nov 20 15:48:37 lama10 pmxcfs[26319]: [quorum] crit: quorum_initialize failed: 6
Nov 20 15:48:37 lama10 pmxcfs[26319]: [confdb] crit: confdb_initialize failed: 6
Nov 20 15:48:37 lama10 pmxcfs[26319]: [dcdb] crit: cpg_initialize failed: 6
Nov 20 15:48:37 lama10 pmxcfs[26319]: [dcdb] crit: cpg_initialize failed: 6
Nov 20 15:48:38 lama10 pmxcfs[26319]: [status] crit: cpg_send_message failed: 9
Nov 20 15:48:38 lama10 pmxcfs[26319]: [status] crit: cpg_send_message failed: 9
Nov 20 15:48:38 lama10 pmxcfs[26319]: [status] crit: cpg_send_message failed: 9
Nov 20 15:48:38 lama10 pmxcfs[26319]: [status] crit: cpg_send_message failed: 9

I'm going to reboot that node to see what happens but I'd like to know if there are directions about the steps to take to restart a failed node.

Regards
Simone
 
You should start cman if it is not running

# service cman start

If that is successful do

# service pve-cluster restart
# service pvestatd restart

and finally

# service rgmanager start
 
I checked everything about. I also had multicast problem. But now both are solved, fencing is working (tested with fence_ipmilan), multicast is working (tested with ssmpingd and asmping), but still cluster is not working.

If a restart a node (rebooting it or restarting cman) most od the time I can't get a quorum (getting timeout instead). And doing the cman restart starts a storm of totem retrasmit messages and a lot of corosync and pmxcfs error. Then the most of the times another node is kicked out from the cluster, and cman is stopped.

Situation has been also worse the only time I got all nodes in the cluster, restarting cman in one of them (that did not had rgmanager running) caused the same error that leaded to the fencing of another node.

I'm going to delete the nodes and reinstall the cluster from scratch.

Simone
 
It took me some time because I could not use anymore all 4 blades, and I got to start using the new hardware. So now a single blade is running standalone with some VM and I'm using the other 3 blades to test the cluster. I carefully removed and purged all packages, and cleaned all directories with renmants from previous installation:

Code:
rm -fR /etc/cluster/ /var/log/cluster /var/lib/cluster /etc/pve/ \
/usr/share/fence /var/lib/pve-manager /var/lib/pve-cluster/

My network config is the following:

Code:
root@lama9:~# ifconfig 
bond0     Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:28  
          inet6 addr: fe80::67d:7bff:fef1:3928/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:44813502 errors:164 dropped:0 overruns:0 frame:0
          TX packets:165133563 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:20885997862 (19.4 GiB)  TX bytes:223670282817 (208.3 GiB)

bond1     Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:2a  
          inet addr:172.16.25.109  Bcast:172.16.25.255  Mask:255.255.255.0
          inet6 addr: fe80::67d:7bff:fef1:392a/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:5196211 errors:0 dropped:0 overruns:0 frame:0
          TX packets:241691 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:582008767 (555.0 MiB)  TX bytes:147799002 (140.9 MiB)

eth0      Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:28  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:42441219 errors:79 dropped:0 overruns:0 frame:0
          TX packets:165133563 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:20720269118 (19.2 GiB)  TX bytes:223670282817 (208.3 GiB)

eth1      Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:28  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:2372283 errors:85 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:165728744 (158.0 MiB)  TX bytes:0 (0.0 B)

eth2      Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:2a  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:2823929 errors:0 dropped:0 overruns:0 frame:0
          TX packets:241691 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:416280083 (396.9 MiB)  TX bytes:147799002 (140.9 MiB)

eth3      Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:2a  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:2372282 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:165728684 (158.0 MiB)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:368743 errors:0 dropped:0 overruns:0 frame:0
          TX packets:368743 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:61854490 (58.9 MiB)  TX bytes:61854490 (58.9 MiB)

venet0    Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          inet6 addr: fe80::1/128 Scope:Link
          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:3 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

vmbr0     Link encap:Ethernet  HWaddr 04:7d:7b:f1:39:28  
          inet addr:192.168.250.109  Bcast:192.168.251.255  Mask:255.255.254.0
          inet6 addr: fe80::67d:7bff:fef1:3928/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2984626 errors:0 dropped:0 overruns:0 frame:0
          TX packets:395532 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:236789773 (225.8 MiB)  TX bytes:109248994 (104.1 MiB)

I'm using bond1 and 172.16.25.0/24 for multicast, and it seems to work:
Code:
root@lama9:~# asmping 239.192.236.33 172.16.25.102
asmping joined (S,G) = (*,239.192.236.234)
pinging 172.16.25.102 from 172.16.25.109
multicast from 172.16.25.102, seq=1 dist=0 time=0.237 ms
  unicast from 172.16.25.102, seq=1 dist=0 time=0.857 ms
  unicast from 172.16.25.102, seq=2 dist=0 time=0.193 ms
multicast from 172.16.25.102, seq=2 dist=0 time=0.220 ms
  unicast from 172.16.25.102, seq=3 dist=0 time=0.203 ms
multicast from 172.16.25.102, seq=3 dist=0 time=0.231 ms
and:
Code:
root@lama2:~# ssmpingd
received request from 172.16.25.109
received request from 172.16.25.109
received request from 172.16.25.109
received request from 172.16.25.109

I started creating the cluster:
Code:
root@lama2:~# pvecm create Cluster
Restarting pve cluster filesystem: pve-cluster[dcdb] notice: wrote new cluster config '/etc/cluster/cluster.conf'
.
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Tuning DLM kernel config... [  OK  ]
   Unfencing self... [  OK  ]
root@lama2:~#

When I added the second node I got:

Code:
root@lama9:~# pvecm add 172.16.25.102
root@172.16.25.102's password: 
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'
.
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Tuning DLM kernel config... [  OK  ]
   Unfencing self... fence_node: cannot connect to cman
[FAILED]
waiting for quorum...
and waiting for minutes.

Logs are filled with:
Code:
Dec  6 15:37:20 lama9 pmxcfs[245031]: [quorum] crit: quorum_initialize failed: 6
Dec  6 15:37:20 lama9 pmxcfs[245031]: [confdb] crit: confdb_initialize failed: 6
Dec  6 15:37:20 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:20 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message failed: 9
Dec  6 15:37:24 lama9 pmxcfs[245031]: [status] crit: cpg_send_message failed: 9
Dec  6 15:37:26 lama9 pmxcfs[245031]: [quorum] crit: quorum_initialize failed: 6
Dec  6 15:37:26 lama9 pmxcfs[245031]: [confdb] crit: confdb_initialize failed: 6
Dec  6 15:37:26 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:26 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:32 lama9 pmxcfs[245031]: [quorum] crit: quorum_initialize failed: 6
Dec  6 15:37:32 lama9 pmxcfs[245031]: [confdb] crit: confdb_initialize failed: 6
Dec  6 15:37:32 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
Dec  6 15:37:32 lama9 pmxcfs[245031]: [dcdb] crit: cpg_initialize failed: 6
and:
Code:
Dec  6 15:21:54 lama2 corosync[35840]:   [QUORUM] Members[1]: 1
Dec  6 15:21:54 lama2 pmxcfs[35761]: [status] notice: update cluster info (cluster name  SiwebCluster, version = 2)
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] New Configuration:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.102) 
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] Members Left:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] Members Joined:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] New Configuration:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.102) 
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.109) 
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] Members Left:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] Members Joined:
Dec  6 15:21:57 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.109) 
Dec  6 15:21:57 lama2 corosync[35840]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec  6 15:21:57 lama2 corosync[35840]:   [QUORUM] Members[2]: 1 2
Dec  6 15:21:57 lama2 corosync[35840]:   [QUORUM] Members[2]: 1 2
Dec  6 15:21:57 lama2 corosync[35840]:   [QUORUM] Members[2]: 1 2
Dec  6 15:21:57 lama2 corosync[35840]:   [CPG   ] chosen downlist: sender r(0) ip(172.16.25.102) ; members(old:1 left:0)
Dec  6 15:21:57 lama2 corosync[35840]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec  6 15:22:11 lama2 corosync[35840]:   [TOTEM ] A processor failed, forming new configuration.
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] New Configuration:
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.102) 
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] Members Left:
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.109) 
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] Members Joined:
Dec  6 15:22:13 lama2 corosync[35840]:   [CMAN  ] quorum lost, blocking activity
Dec  6 15:22:13 lama2 corosync[35840]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec  6 15:22:13 lama2 pmxcfs[35761]: [status] notice: node lost quorum
Dec  6 15:22:13 lama2 corosync[35840]:   [QUORUM] Members[1]: 1
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] CLM CONFIGURATION CHANGE
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] New Configuration:
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] #011r(0) ip(172.16.25.102) 
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] Members Left:
Dec  6 15:22:13 lama2 corosync[35840]:   [CLM   ] Members Joined:
Dec  6 15:22:13 lama2 corosync[35840]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec  6 15:22:13 lama2 kernel: dlm: closing connection to node 2
Dec  6 15:22:13 lama2 corosync[35840]:   [CPG   ] chosen downlist: sender r(0) ip(172.16.25.102) ; members(old:2 left:1)
Dec  6 15:22:13 lama2 corosync[35840]:   [MAIN  ] Completed service synchronization, ready to provide service.

So up to now no way to get a working cluster.

Simone