Problem upgrading 3->4 : corosync !!

Francois Legrand · Feb 27, 2017

Hi,
I had a 8 nodes proxmox v3.4 cluster (named LPNHE-CLUSTER)
To migrate to v4, I did the following :
- Migrate all VM out from node 1 and 2
- Shutdown node 1 and 2 and reinstall in v4 with new name and IP
- create new cluster (with name different from the old v3 cluster, i.e. new name is LPNHE)
- move VM from old cluster to new one
and so on for the other nodes.
So far so good.... but when I reached the last node in th old cluster, I had the surprise that immediately after I stopped it, the new cluster went down (more precisely the nodes started to leave the new cluster and I lost the quorum). I can see all the nodes in red in the interface (except the on on which I connect to the web page).
If I turn on the last server on the old cluster, the new one recover (corosync says the the new nodes joined again, I regain the quorum and everything is ok).
I cannot figure out what is going on !
Any clue ?
Thanks
F.

PS: Here are my confs

### New cluster ###
>more /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: newnode1
nodeid: 3
quorum_votes: 1
ring0_addr: newnode1
}

node {
name: newnode2
nodeid: 2
quorum_votes: 1
ring0_addr: newnode2
}
.
.
quorum {
provider: corosync_votequorum
}

totem {
cluster_name: LPNHE
config_version: 6
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: ip-from-newnode1
ringnumber: 0
}

}

### Old cluster ###
>more /etc/pve/cluster.conf
<?xml version="1.0"?>
<cluster config_version="80" name="LPNHE-CLUSTER">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<quorumd votes="1" allow_kill="0" interval="1" label="proxmox_quorum_disk" tko="10"/>
<totem token="154000"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.xxx" lanplus="1" login="XXXX" name="fencenode1" passwd="XXXX" power_wait="5"/>
<fencedevice agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.xxx" lanplus="1" login="XXXX" name="fencenode2" passwd="XXXX" power_wait="5"/>
.
.
</fencedevices>
<clusternodes>

<clusternode name="oldnode1" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="fencenode1"/>
</method>
</fence>
</clusternode>
<clusternode name="oldnode2" nodeid="3" votes="1">
<fence>
<method name="1">
<device name="fencenode2"/>
</method>
</fence>
</clusternode>
.
.
</clusternodes>
<rm/>
</cluster>

t.lamprecht · Feb 28, 2017

Francois Legrand said:
So far so good.... but when I reached the last node in th old cluster, I had the surprise that immediately after I stopped it, the new cluster went down (more precisely the nodes started to leave the new cluster and I lost the quorum). I can see all the nodes in red in the interface (except the on on which I connect to the web page).
If I turn on the last server on the old cluster, the new one recover (corosync says the the new nodes joined again, I regain the quorum and everything is ok).
I cannot figure out what is going on !

Normally (!) the corosync of PVE 3.X and earlier and Corosync of PVE 4.X and newer are not even able to talk with each other, so this is really strange, to say at least...

Does a

Code:

service cman stop

Triggers this already?

Can you please give me the outputs from:

The remaining old node:

Code:

cman_tool nodes -a
pvecm status

On one of the news nodes:

Code:

pvecm status

And the logs regarding corosync would be nice, maybe from the node where the new cluster was created:

Code:

journalctl -u corosync

And you did really a full wipe? (Just to be sure)

Francois Legrand · Feb 28, 2017

Hi,
Thanks for your answer.
Yesterday night I already tryed to stop all the services on the old cluster (service cman stop, service pve-manager stop, service pve-manager stop, etc...) and it didn't trigger anything on the new cluster !
For now, I still keep 2 nodes up on the old cluster (to have redondancy until I figure out what is going on).

************ OLD CLUSTER **************
# cman_tool nodes -a
Node Sts Inc Joined Name
0 M 0 2017-02-27 20:56:58 /dev/block/8:17
2 X 0 node2
3 X 0 node3
4 M 22600 2017-02-27 20:56:45 node12
Addresses: xxx.xxx.xxx.xxx
5 M 22604 2017-02-27 20:56:47 node5
Addresses: yyy.yyy.yyy.yyy
6 X 0 node6
7 X 0 node11
9 X 0 node9
10 X 0 node10

# pvecm status
Version: 6.2.0
Config Version: 80
Cluster Name: LPNHE-CLUSTER
Cluster Id: 34772
Cluster Member: Yes
Cluster Generation: 22604
Membership state: Cluster-Member
Nodes: 2
Expected votes: 8
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 5 Activity blocked
Active subsystems: 4
Flags:
Ports Bound: 0 178
Node name: node12
Node ID: 4
Multicast addresses: 239.192.135.92
Node addresses: xxx.xxx.xxx.xxx

************ NEW CLUSTER ******************
# pvecm status
Quorum information
------------------
Date: Tue Feb 28 10:58:03 2017
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000003
Ring ID: 2/3360
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 aaa.aaa.aaa.aaa
0x00000001 1 bbb.bbb.bbb.bbb
0x00000003 1 ccc.ccc.ccc.ccc (local)
0x00000005 1 ddd.ddd.ddd.ddd
0x00000006 1 eee.eee.eee.eee
0x00000004 1 fff.fff.fff.fff

And what happen (in the new cluster) yesterday after I shutdown the last machine in the old cluster and turn it on back:
# journalctl -u corosync
.....
Feb 27 14:17:34 node115 corosync[2026]: [QUORUM] Members[3]: 1 5 4
Feb 27 14:17:34 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 14:17:39 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3348) was formed. Members
Feb 27 14:17:39 node115 corosync[2026]: [QUORUM] Members[3]: 1 5 4
Feb 27 14:17:39 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 14:17:48 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3352) was formed. Members left: 5 4
Feb 27 14:17:48 node115 corosync[2026]: [TOTEM ] Failed to receive the leave message. failed: 5 4
Feb 27 14:17:48 node115 corosync[2026]: [QUORUM] Members[1]: 1
Feb 27 14:17:48 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 14:26:29 node115 corosync[2026]: [TOTEM ] A new membership (aaa.aaa.aaa.aaa:3356) was formed. Members joined: 2 3 5 6
Feb 27 14:26:29 node115 corosync[2026]: [TOTEM ] A new membership (aaa.aaa.aaa.aaa:3360) was formed. Members joined: 4
Feb 27 14:26:29 node115 corosync[2026]: [QUORUM] This node is within the primary component and will provide service.
Feb 27 14:26:29 node115 corosync[2026]: [QUORUM] Members[6]: 2 1 3 5 6 4
Feb 27 14:26:29 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 14:26:30 node115 corosync[2026]: [TOTEM ] Retransmit List: ec ed ee ef f0 f1 f2 f3 f5 f6

I will try to unplug the network cable of the old machines to see if it trigger the problem (instead of shutting them down).

t.lamprecht · Feb 28, 2017

Francois Legrand said:
# cman_tool nodes -a
Node Sts Inc Joined Name
0 M 0 2017-02-27 20:56:58 /dev/block/8:17
2 X 0 node2
3 X 0 node3
4 M 22600 2017-02-27 20:56:45 node12
Addresses: xxx.xxx.xxx.xxx
5 M 22604 2017-02-27 20:56:47 node5
Addresses: yyy.yyy.yyy.yyy
6 X 0 node6
7 X 0 node11
9 X 0 node9
10 X 0 node10

I would remove the upgraded nodes from the cluster here, as if you (hypothetical) would go back to the old cluster you had to reinstall them and readd them anyway.

Code:

pvecm expected 1
pvecm delnode NODENAME

Francois Legrand said:
And what happen (in the new cluster) yesterday after I shutdown the last machine in the old cluster and turn it on back:
# journalctl -u corosync
.....
Feb 27 14:17:34 node115 corosync[2026]: [QUORUM] Members[3]: 1 5 4
Feb 27 14:17:34 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 14:17:39 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3348) was formed. Members
Feb 27 14:17:39 node115 corosync[2026]: [QUORUM] Members[3]: 1 5 4
Feb 27 14:17:39 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 14:17:48 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3352) was formed. Members left: 5 4
Feb 27 14:17:48 node115 corosync[2026]: [TOTEM ] Failed to receive the leave message. failed: 5 4
Feb 27 14:17:48 node115 corosync[2026]: [QUORUM] Members[1]: 1
Feb 27 14:17:48 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 14:26:29 node115 corosync[2026]: [TOTEM ] A new membership (aaa.aaa.aaa.aaa:3356) was formed. Members joined: 2 3 5 6
Feb 27 14:26:29 node115 corosync[2026]: [TOTEM ] A new membership (aaa.aaa.aaa.aaa:3360) was formed. Members joined: 4
Feb 27 14:26:29 node115 corosync[2026]: [QUORUM] This node is within the primary component and will provide service.
Feb 27 14:26:29 node115 corosync[2026]: [QUORUM] Members[6]: 2 1 3 5 6 4
Feb 27 14:26:29 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 14:26:30 node115 corosync[2026]: [TOTEM ] Retransmit List: ec ed ee ef f0 f1 f2 f3 f5 f6

I will try to unplug the network cable of the old machines to see if it trigger the problem (instead of shutting them down).

I suspect the network, or the network in combination with that the old cluster still has the nodes configured.
Do you have IGMP Snooping active on the Switches? if so there may be no IGMP querier active, or the old cluster gives wrong infos to the querier and so the multicast group member ship is not correct (as seen by the switch).
You could try to disable IGMP snooping temporary, if that fixes the problem I'd say your safe to upgrade the other ones too and re add them. It's a bit a shot in the dark, my experience with corosync 1.X (the one from PVE 3.4 and earlier) is a bit limited.

Francois Legrand · Feb 28, 2017

I shutdowned one of the two remaining machines in the old cluster and then unplug the network cable of the last one.
The new cluster went wrong :

# pvecm status
Quorum information
------------------
Date: Tue Feb 28 12:26:49 2017
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1/3476
Quorate: No

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 2
Quorum: 4 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 bbb.bbb.bbb.bbb (local)
0x00000005 1 fff.fff.fff.fff

I replug the network cable of the old machine and everything came back :

# pvecm status
Quorum information
------------------
Date: Tue Feb 28 12:27:43 2017
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000001
Ring ID: 2/3512
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 aaa.aaa.aaa.aaa
0x00000001 1 bbb.bbb.bbb.bbb (local)
0x00000003 1 ccc.ccc.ccc.ccc
0x00000005 1 ddd.ddd.ddd.ddd
0x00000006 1 eee.eee.eee.eee
0x00000004 1 fff.fff.fff.fff

# journalctl -u corosync
Feb 27 14:26:29 node115 corosync[2026]: [QUORUM] This node is within the primary component and will provide service.
Feb 27 14:26:29 node115 corosync[2026]: [QUORUM] Members[6]: 2 1 3 5 6 4
Feb 27 14:26:29 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 27 14:26:30 node115 corosync[2026]: [TOTEM ] Retransmit List: ec ed ee ef f0 f1 f2 f3 f5 f6
Feb 28 12:20:51 node115 corosync[2026]: [TOTEM ] Retransmit List: d237a d237b
Feb 28 12:20:51 node115 corosync[2026]: [TOTEM ] Retransmit List: d237a d237b d237d
Feb 28 12:20:51 node115 corosync[2026]: [TOTEM ] Retransmit List: d237a d237b d237d d237e
...
...
Feb 28 12:24:23 node115 corosync[2026]: [TOTEM ] Retransmit List: cfa cec cee cef cf0 cf1 cf2 cf3 cf5 d0b d0c d0d d0e d0f d10 d11 d12
Feb 28 12:24:23 node115 corosync[2026]: [TOTEM ] Retransmit List: cf1 cf2 cf3 cf5 cfa ce9 cea ceb cec d0b d0c d0d d0e d0f d10 d11 d12
Feb 28 12:24:27 node115 corosync[2026]: [TOTEM ] A processor failed, forming new configuration.
Feb 28 12:24:28 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3376) was formed. Members left: 2 3 6 4
Feb 28 12:24:28 node115 corosync[2026]: [TOTEM ] Failed to receive the leave message. failed: 2 3 6 4
Feb 28 12:24:32 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3380) was formed. Members
Feb 28 12:24:32 node115 corosync[2026]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 28 12:24:32 node115 corosync[2026]: [QUORUM] Members[2]: 1 5
Feb 28 12:24:32 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 28 12:24:38 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3384) was formed. Members
Feb 28 12:24:38 node115 corosync[2026]: [QUORUM] Members[2]: 1 5
Feb 28 12:24:38 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 28 12:24:45 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3388) was formed. Members
Feb 28 12:24:45 node115 corosync[2026]: [QUORUM] Members[2]: 1 5
Feb 28 12:24:45 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 28 12:24:49 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3392) was formed. Members
Feb 28 12:24:49 node115 corosync[2026]: [QUORUM] Members[2]: 1 5
Feb 28 12:24:49 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 28 12:24:54 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3396) was formed. Members
.....
Feb 28 12:27:09 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 28 12:27:14 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3500) was formed. Members
Feb 28 12:27:14 node115 corosync[2026]: [QUORUM] Members[2]: 1 5
Feb 28 12:27:14 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 28 12:27:25 node115 corosync[2026]: [TOTEM ] A new membership (bbb.bbb.bbb.bbb:3508) was formed. Members joined: 5 4 left: 5
Feb 28 12:27:25 node115 corosync[2026]: [TOTEM ] Failed to receive the leave message. failed: 5
Feb 28 12:27:25 node115 corosync[2026]: [TOTEM ] A new membership (aaa.aaa.aaa.aaa:3512) was formed. Members joined: 2 3 6
Feb 28 12:27:25 node115 corosync[2026]: [CPG ] downlist left_list: 0 received in state 0
Feb 28 12:27:25 node115 corosync[2026]: [CPG ] downlist left_list: 0 received in state 0
Feb 28 12:27:25 node115 corosync[2026]: [QUORUM] This node is within the primary component and will provide service.
Feb 28 12:27:25 node115 corosync[2026]: [QUORUM] Members[6]: 2 1 3 5 6 4
Feb 28 12:27:25 node115 corosync[2026]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 28 12:27:25 node115 corosync[2026]: [TOTEM ] Retransmit List: c3 c4 c5 c6 c7 c8 c9 ca

Francois Legrand · Feb 28, 2017

Hi,
I tryed what you suggested (pvecm delnode nodename for all the machines already reintalled in the new cluster).... but it didn't solve the problem ! After a while the new cluster went down !

Francois Legrand · Mar 1, 2017

I got the point !!!!
On our old cluster, all nodes were configured as igmp querier (/sys/devices/virtual/net/vmbr0/bridge/multicast_querier was set to 1)
But in the new cluster, it was set to 0 (so no querier). I still have to check, but I am pretty convinced that igmp querier was not activated on our switchs. Thus as long as the last node on the old cluster was down, there was no igmp querier anymore on the network, thus the communication between nodes of the new cluster stoped after a while.
I still have a last question.
Is that normal that multicast querier is not enabled by default on a new cluster install ?

Search

Search

Problem upgrading 3->4 : corosync !!

Francois Legrand

Active Member

t.lamprecht

Proxmox Staff Member

Francois Legrand

Active Member

t.lamprecht

Proxmox Staff Member

Francois Legrand

Active Member

Francois Legrand

Active Member

Francois Legrand

Active Member

We value your privacy