Problems with cluster...

bradmc

Member
Jan 16, 2013
20
0
21
Hello, I've been running PVE for a while now. Yesterday I updated to the latest PVE-no-subscription. It's a three-node cluster, and the first two nodes upgraded fine with no issues. The third node seemed to go fine, but after the reboot the third node initially joins the cluster, but after about five minutes the node goes red on the console, and I can see using 'pvecm' that the node status says "Activity blocked". Also, this node can't write to /etc/pve, either, but that doesn't surprise me.

root@spuproxmox02:~# pvecm nodes
Node Sts Inc Joined Name
1 X 18292 spuproxmox01
2 M 18292 2014-09-16 10:56:38 spuproxmox02
3 X 18292 spuproxmox03

root@spuproxmox02:~# pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: SPU
Cluster Id: 577
Cluster Member: Yes
Cluster Generation: 18296
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: spuproxmox02
Node ID: 2
Multicast addresses: 239.192.2.67
Node addresses: 156.74.237.69
root@spuproxmox02:~#

There are active VM's running on this node, and they seem fine, although the console doesn't see them any more. All VM's are on local storage so they can't be easily moved to another host.

Any ideas what the issue is? Any fixes? Thanks.

Brad
 
Hello, I've been running PVE for a while now. Yesterday I updated to the latest PVE-no-subscription. It's a three-node cluster, and the first two nodes upgraded fine with no issues. The third node seemed to go fine, but after the reboot the third node initially joins the cluster, but after about five minutes the node goes red on the console, and I can see using 'pvecm' that the node status says "Activity blocked". Also, this node can't write to /etc/pve, either, but that doesn't surprise me.

root@spuproxmox02:~# pvecm nodes
Node Sts Inc Joined Name
1 X 18292 spuproxmox01
2 M 18292 2014-09-16 10:56:38 spuproxmox02
3 X 18292 spuproxmox03

root@spuproxmox02:~# pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: SPU
Cluster Id: 577
Cluster Member: Yes
Cluster Generation: 18296
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: spuproxmox02
Node ID: 2
Multicast addresses: 239.192.2.67
Node addresses: 156.74.237.69
root@spuproxmox02:~#

There are active VM's running on this node, and they seem fine, although the console doesn't see them any more. All VM's are on local storage so they can't be easily moved to another host.

Any ideas what the issue is? Any fixes? Thanks.

Brad
Hi,
try
Code:
/etc/init.d/cman restart
/etc/init.d/pve-cluster restart
look also on the other node with "pvecm status".

Udo
 
Thanks for the reply. I did try what you suggested, but it didn't work. I have rebooted the node, and when it comes up it does join the cluster, and it is indicated at being green on the PVE console. However, it eventually drops out, goes red, and I find myself in the same situation. I'm temped to move the VM's off the server and rebuild it, re-adding it to the cluster.
 
I had the exact same problem...3 nodes clustered in version 3.2...upgraded to 3.3 and the cluster failed after 5 minutes or so.
Since the servers are (still) not in production, I've reinstalled the three of them with version 3.3....result: exact same behavior (5 minutes with the cluster ok, then failure).
(The syslog shows too many: "pmblade1 corosync[4147]: [TOTEM ] Retransmit List: 3b3 3b4 3b5 3c1 3c3 3c4 3aa 3ab 3ac 3ad 3ae 3af 3b0 3bb 3bc 3bd 3be 3bf 399 39a 39b 39c 39d 39e 39f 3a5 3a6 3a7 3a8 3a9"...etc, until it fails)

The hardware is an IBM Bladecenter H with 3 blades (HS23). I've configured 3 ports of the SCM module with LACP enabled. The three links go to a Cisco 3750 with etherchannel configured. IGMP snooping is disabled.
I did a test and disabled LACP in the SCM modules (and delete the channel groups in the Cisco, so it's only using 1 link per module...no link aggregation)...it also fails.
...the question is: what has changed?, why in the 3.2 version worked?, any clue?

Thanks very much,
 
I did a fresh install from the proxmox 3.3 ISO...the network card is detected automatically

Update: connecting the IBM chassis SCM to an unmanaged switch (old 10/100 dlink) solves the problem, so the problem lies in the cisco 3750...i'll keep looking monday.

Thanks,
Javier
 
Last edited:
I rebuilt the problem server with a fresh install of 3.1 then an upgrade to pve-no-subscription, which is what the other two nodes are running. I join the node to the cluster, and, still, after about five minutes is drops out of the cluster with the above "Retransmit List" issue described above. All three nodes are connected to the same Cisco switch. I'll attach the corosync.log to this post. The other two nodes are fine; still in cluster and working fine.

Everything was fine in 3.2. The servers are HP DL360's.


Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:42 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:43 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:43 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:43 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:43 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:44 corosync [TOTEM ] Retransmit List: 5f1
Sep 19 12:47:44 corosync [TOTEM ] Retransmit List: 5f0
Sep 19 12:47:44 corosync [TOTEM ] Retransmit List: 5e8 5f1
Sep 19 12:47:44 corosync [TOTEM ] FAILED TO RECEIVE
Sep 19 12:47:56 corosync [CLM ] CLM CONFIGURATION CHANGE
Sep 19 12:47:56 corosync [CLM ] New Configuration:
Sep 19 12:47:56 corosync [CLM ] r(0) ip(156.74.237.69)
Sep 19 12:47:56 corosync [CLM ] Members Left:
Sep 19 12:47:56 corosync [CLM ] r(0) ip(156.74.237.68)
Sep 19 12:47:56 corosync [CLM ] r(0) ip(156.74.237.70)
Sep 19 12:47:56 corosync [CLM ] Members Joined:
Sep 19 12:47:56 corosync [QUORUM] Members[2]: 2 3
Sep 19 12:47:56 corosync [CMAN ] quorum lost, blocking activity
Sep 19 12:47:56 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 19 12:47:56 corosync [QUORUM] Members[1]: 2
Sep 19 12:47:56 corosync [CLM ] CLM CONFIGURATION CHANGE
Sep 19 12:47:56 corosync [CLM ] New Configuration:
Sep 19 12:47:56 corosync [CLM ] r(0) ip(156.74.237.69)
Sep 19 12:47:56 corosync [CLM ] Members Left:
Sep 19 12:47:56 corosync [CLM ] Members Joined:
Sep 19 12:47:56 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 19 12:47:56 corosync [CPG ] chosen downlist: sender r(0) ip(156.74.237.69) ; members(old:3 left:2)
Sep 19 12:47:56 corosync [MAIN ] Completed service synchronization, ready to provide service.
 
Last edited:
.... All three nodes are connected to the same Cisco switch....

bradmc,
I bet that cisco is the issue...it's my exact problem....if you can, do the test I did....connect the servers to an unmanaged switch, or try other switch...I think it could be something related to broadcast storm control or some other feature of the cisco....I'll continue studing the problem on monday and I'll keep you informed.

Regards,
Javier
 
Hello! Did you try to disable igmp snooping on cisco switch?

conf t
no ip igmp snooping
 
Hello Am having the problem with Unmanaged Switch and like to get some of the information about this in this forum about the Unmanaged Switch. There are lots of sites about the Unmanaged Switch over the internet but if you want to find the more information then you can Google search about the Unmanaged Switch. Also Please me with the problem of this problem and please replay to the problem that Am having with the Unmanaged Switch.
 
Last edited:
I'll chime in to say that one of our Proxmox sites is a 3-node DL360G6 that we recently (3 days ago) updated to v3.3. These servers were upgraded from v3.1. There were no issues with the upgrade or after rebooting; the cluster reconnected correctly and has been stable since upgrading. We're using a Cisco 3560-X for that cluster with default IGMP / MTU / STP settings.
 
I'll chime in to say that one of our Proxmox sites is a 3-node DL360G6 that we recently (3 days ago) updated to v3.3. These servers were upgraded from v3.1. There were no issues with the upgrade or after rebooting; the cluster reconnected correctly and has been stable since upgrading. We're using a Cisco 3560-X for that cluster with default IGMP / MTU / STP settings.

Hello,sirtech! Does your heartbeat between nodes work through vmbr or it uses physical interface? Did you try to use ethX/bondX for heartbeat? This may be important 'cause vmbr is bridge and it acts as software switch for tap-devices connected to it and it has its own igmp snooping settings like any other network switches. (I suppose so, correct me if I'm wrong)
 
Last edited:
On that cluster, the hosts are communicating through the software bridges. Yes, bridges have their own L2 settings.

This is the default IGMP snoop setting for bridges in Proxmox:

root@X# cat /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping
1
root@X#
 
My cluster was running just fine using the latest 3.2 pve-no-subscription. I upgraded node 3, first, which went just fine. It joined the cluster after upgrade and remained in the cluster. Next, I upgraded node 1, which also went just fine. I waited several days, then I upgraded node 2. I rebooted it and it initially joined the cluster, then after five minutes it dropped out. I will say that node 2 and 3 are on the same physical switch, while node 1 is on a different switch connected to a different router. However, this network config has been in place for a long time and the cluster didn't have any issues with it prior to the upgrade. I am running the 3.10-4 kernel on all three nodes.
 
So, guys! I've also decided to update 3 nodes from 3.2 to 3.3 and updating went fine. I've been already monitoring these 3 nodes about 1 hour and it seems everything OK. BUT I don't use vmbr for heartbeating, I use physical interfaces.
 
Last edited:
How do you change the heartbeat from VMBR to physical?

auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

iface eth2 inet manual

iface eth3 inet manual

auto bond0
iface bond0 inet static
address 192.168.8.2
netmask 255.255.255.0
gateway 192.168.8.254
slaves eth0 eth1
bond_miimon 100
bond_mode 802.3ad

auto bond1
iface bond1 inet manual
slaves eth2 eth3
bond_miimon 100
bond_mode 802.3ad

auto vmbr0
iface vmbr0 inet manual
bridge_ports bond1
bridge_stp off
bridge_fd 0



# netstat -gn
IPv6/IPv4 Group Memberships
Interface RefCnt Group
--------------- ------ ---------------------
lo 1 224.0.0.1
eth0 1 224.0.0.1
eth1 1 224.0.0.1
eth2 1 224.0.0.1
eth3 1 224.0.0.1
bond0 1 239.192.64.244
bond0 1 224.0.0.1
bond1 1 224.0.0.1
vmbr0 1 224.0.0.1
venet0 1 224.0.0.1

As you can see physical int bond0 is subscripted to send/receive corosync multicast (239.192.64.244). Then bond1 is mapped to vmbr0 and it is used for VMs with any VLAN ID tags from GUI. So bond0 is for corosync (and it's not seen in VM creating wizard), bond1 through vmbr0 is for VMs and it's connected to trunk port-channel. I use cisco 4948 L3 switch for link aggregation.
 
Last edited:
More than 1 day passed since update :) Everything is still fine. So I guess you' problem lies in igmp snooping on vmbr.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!