Permanently lose quroum

Dmitry Panoff

New Member
Nov 7, 2012
12
0
1
Donetsk, DPR
Hi, aLL

I have 3 servers joined in cluster. Each server has 2 NIC, joined to bond0 on each server. Servers connected with Nortel BayStack 5510 switch, server cards from each server joined to Multilink Trunk. Switch allows multicast.
Usually, once or twice a week servers lose quorum. One, or two nodes lose it. It happens during daytime work, night backups, weekend no-work - no system in quorum lose at all.
Hardware are ok (2 Intel servers, 1 Dell PowerEdge server), network is ok (no errors on switch log), no power failures. Date are set from local NTP-server and the same on all nodes.
When quorum lose happens, first thing to do is restart services in such manner:

Code:
service pvestatd stop
service pvedaemon stop
service cman stop
service pve-cluster stop

sleep 2

service pve-cluster start
service cman start
service pvestatd start
service pvedaemon start

Sometimes this helps and nodes get quorum. And cluster works until next quorum faulure.
But usually only thing to do is reboot problem nodes. During boot they get quorum and cluster gets ready.

:~# pvecm status
Code:
Version: 6.2.0
Config Version: 6
Cluster Name: sdpi
Cluster Id: 1649
Cluster Member: Yes
Cluster Generation: 17280
Membership state: Cluster-Member
Nodes: 1
Expected votes: 2
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: virt3
Node ID: 3
Multicast addresses: 239.192.6.119
Node addresses: 192.168.0.213

No errors in logs - simply sudden lose quorum, "re-creating" cluster with one node:

corosync.log:
Code:
...
Nov 08 08:41:58 corosync [TOTEM ] Retransmit List: 2f713 2f715 2f716 2f717 2f718 2f719 2f6f8 2f70a 2f70b 2f6
Nov 08 08:42:08 corosync [TOTEM ] A processor failed, forming new configuration.
Nov 08 08:42:20 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 08 08:42:20 corosync [CLM   ] New Configuration:
Nov 08 08:42:20 corosync [CLM   ] <---->r(0) ip(192.168.0.213).
Nov 08 08:42:20 corosync [CLM   ] Members Left:
Nov 08 08:42:20 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 08 08:42:20 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 08 08:42:20 corosync [CLM   ] Members Joined:
Nov 08 08:42:20 corosync [QUORUM] Members[2]: 2 3
Nov 08 08:42:20 corosync [CMAN  ] quorum lost, blocking activity
Nov 08 08:42:20 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 08 08:42:20 corosync [QUORUM] Members[1]: 3
Nov 08 08:42:20 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 08 08:42:20 corosync [CLM   ] New Configuration:
Nov 08 08:42:20 corosync [CLM   ] <---->r(0) ip(192.168.0.213).
Nov 08 08:42:20 corosync [CLM   ] Members Left:
Nov 08 08:42:20 corosync [CLM   ] Members Joined:
Nov 08 08:42:20 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 08 08:42:20 corosync [CPG   ] chosen downlist: sender r(0) ip(192.168.0.213) ; members(old:3 left:2)
Nov 08 08:42:20 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Nov 08 09:35:24 corosync [SERV  ] Unloading all Corosync service engines.

This thread doesn't help: http://forum.proxmox.com/threads/10376-Interesting-Observations-and-solution-Cluster-issues-(quorum)
So, the questions are: if reboot helps and services restart not, what additional service do i need to restart to get quorum? Or what to do to get quorum without node rebooting. Because rebooting nodes is bad-bad idea...

P.S. Proxmox-2.2-26/c1614c8c.
 
Code:
root@virt2:~# service cman start
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

corosync.log
Code:
Nov 10 15:37:44 corosync [MAIN  ] Corosync Cluster Engine ('1.4.4'): started and ready to provide service.
Nov 10 15:37:44 corosync [MAIN  ] Corosync built-in features: nss
Nov 10 15:37:44 corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Nov 10 15:37:44 corosync [MAIN  ] Successfully parsed cman config
Nov 10 15:37:44 corosync [MAIN  ] Successfully configured openais services to load
Nov 10 15:37:44 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
Nov 10 15:37:44 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Nov 10 15:37:44 corosync [TOTEM ] The network interface [192.168.0.212] is now up.
Nov 10 15:37:44 corosync [QUORUM] Using quorum provider quorum_cman
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Nov 10 15:37:44 corosync [CMAN  ] CMAN 1349169030 (built Oct  2 2012 11:10:34) started
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: openais cluster membership service B.01.01
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: openais event service B.01.01
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: openais message service B.03.01
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: openais distributed locking service B.03.01
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: openais timer service A.01.01
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: corosync configuration service
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: corosync cluster closed process group service v1.
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: corosync profile loading service
Nov 10 15:37:44 corosync [QUORUM] Using quorum provider quorum_cman
Nov 10 15:37:44 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Nov 10 15:37:44 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Nov 10 15:37:44 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 10 15:37:44 corosync [CLM   ] New Configuration:
Nov 10 15:37:44 corosync [CLM   ] Members Left:
Nov 10 15:37:44 corosync [CLM   ] Members Joined:
Nov 10 15:37:44 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 10 15:37:44 corosync [CLM   ] New Configuration:
Nov 10 15:37:44 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 10 15:37:44 corosync [CLM   ] Members Left:
Nov 10 15:37:44 corosync [CLM   ] Members Joined:
Nov 10 15:37:44 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 10 15:37:44 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 10 15:37:44 corosync [QUORUM] Members[1]: 2
Nov 10 15:37:44 corosync [QUORUM] Members[1]: 2
Nov 10 15:37:44 corosync [CPG   ] chosen downlist: sender r(0) ip(192.168.0.212) ; members(old:0 left:0)
Now, at the time of the post was made, node is away from cluster, which works on two other nodes, which are OK.

P.S. cluster nodes: virt1 (192.168.0.211), virt2 (192.168.0.212), virt3 (192.168.0.213). virt1, virt2 are identical intel servers (S5520HC), virt3 - Dell PowerEdge 2900.
bond0 on all servers are in round-robin mode.
 
Multicast is OK - first thing, that was checked and re-checked.
Code:
root@virt2:~# asmping 224.0.0.1 192.168.0.211
asmping joined (S,G) = (*,224.0.0.234)
pinging 192.168.0.211 from 192.168.0.212
  unicast from 192.168.0.211, seq=1 dist=0 time=0.214 ms
multicast from 192.168.0.211, seq=1 dist=0 time=0.232 ms
  unicast from 192.168.0.211, seq=2 dist=0 time=0.206 ms
multicast from 192.168.0.211, seq=2 dist=0 time=0.219 ms
  unicast from 192.168.0.211, seq=3 dist=0 time=0.186 ms
multicast from 192.168.0.211, seq=3 dist=0 time=0.197 ms
  unicast from 192.168.0.211, seq=4 dist=0 time=0.191 ms
multicast from 192.168.0.211, seq=4 dist=0 time=0.202 ms
  unicast from 192.168.0.211, seq=5 dist=0 time=0.175 ms
multicast from 192.168.0.211, seq=5 dist=0 time=0.188 ms
^C
Switch software, which connects nodes, was updated to the latest version - preventing some firmware problems, network troubles etc.
Have I try to use unicast communication mechanism? Or are there any more ideas what to check?
And, asking again my question from post #1:
if reboot helps and services restart not, what additional service do i need to restart to get quorum?
 
Service 'cman' handles quorum.
It's clear from documentation.
Simple restart cman (and other services - see my post #1) doesn't work almost in all cases of quorum lose. But if I reboot problem node, it stops and starts these services too and rebooting helps. So, restarting cman with no reboot does nothing and restarting cman with reboot helps. So, there's some additional task to do to get quorum without reboot. What is this? Timeouts between stop and start cman? Additional services? Network services? Network drivers on hardware? Bonding?
 
I had the same issue. In our lab we had to shutdown everything due to a power outage. I could only solve this problem by turning on the last node turned off first. After that I turned on the other nodes and everything worked fine.
 
...
2 Intel servers, 1 Dell PowerEdge server), network is ok (no errors on switch log), no power failures.


Hi Dmitry

I had problems with NICs Broadcom (from DELL Servers) in bond with balance-alb and tlb without error messages of network, after I changed to active-backup, then I had no problems, I suggest start testing only with bridge.

Cesar
 
Last edited:
macday

Thanks for the reply.
I tried different things of node turning and off, rebooting in some order or randomly - found no system in this.

spirit

Thanks a lot.
I'm also thinking, that problem may be in bonding. No, i haven't try yet another bonding modes - old switch firmware can't do lacp for example. Will try another modes soon.
One thing is strange: balance-rr mode worked quite stable on cluster for a couple of weeks. Random node quorum-lose happened, but was solved restarting cluster services. After upgrading to 2.2 problems became usual.

And what kind of problems did you have with balance-rr ?
 
... After upgrading to 2.2 problems became usual.

And what kind of problems did you have with balance-rr ?

Hi Dmitry


Edited for best explication:
Only for test if bond is your problem don't use bond, only bridge, you can change it on the PVE GUI and after reboot the Host for apply changes, and after tell me the results. Currently I use active-backup and work well with NICs Broadcom (Broadcom come with DELL servers)

I have practices with PVE 1.x and 2.x with Broadcom NICs, and bond alb/tlb with the same Switch on PVE 1.x don't had problems, but with PVE 2.x had problems and I had to changed bond to active-backup and so far have had no problems (I think that is a bug in the code into of PVE 2.2).

If bond in mode active-backup works well for you, as his fail over is transparent for the applications, you do not will lose anything while yours applications are working in mode tcp (no udp), then the network connection are insured for your VMs (and this includes the PVE Cluster), bond in mode active-backup is the best option for the PVE Cluster and LAN connections for tcp connections when others bonds don't work well.

And balance-rr is generally used in mode crossover-cable between Hosts, not for use with switchs, with Broadcom and in mode crossover-cable between Hosts with PVE 2.2 I don't had problems, but I use it for other purposes.

Best regards
Cesar
 
Last edited:
cesarpk

Thanks a lot for the answer.
Ok, so seems to be problem with bonding type balance-rr. So, I'll change it, first, to balance-xor, for example...

Thanks all for the help. I'll change bond type and paste here results.
 
cesarpk

Thanks a lot for the answer.
Ok, so seems to be problem with bonding type balance-rr. So, I'll change it, first, to balance-xor, for example...

Thanks all for the help. I'll change bond type and paste here results.

Hi Dmitry

First you must be sure that bond isn't the problem, your first test must be without bond
I've re-edited my previous post, please see it

Best regards
Cesar
 
cesarpk

Thanks for the answers.

First, I tried with balance-xor with no success - cluster didn't collect neither with service restarting nor rebooting. Tried for 1 day. Then - active-backup - no success. Waited for 1 day.
Switched to bridge mode with no switch trunking and bonding on nodes- no success. Today at morning colleagues said that cluster became ready (and log confirms that), but now cluster with no quorum again...

corosync.log
Code:
Nov 21 09:15:46 corosync [TOTEM ] A processor failed, forming new configuration.
Nov 21 09:15:58 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:15:58 corosync [CLM   ] New Configuration:
Nov 21 09:15:58 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 21 09:15:58 corosync [CLM   ] Members Left:
Nov 21 09:15:58 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 21 09:15:58 corosync [CLM   ] <---->r(0) ip(192.168.0.213).
Nov 21 09:15:58 corosync [CLM   ] Members Joined:
Nov 21 09:15:58 corosync [QUORUM] Members[2]: 2 3
Nov 21 09:15:58 corosync [CMAN  ] quorum lost, blocking activity
Nov 21 09:15:58 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 21 09:15:58 corosync [QUORUM] Members[1]: 2
Nov 21 09:15:58 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:15:58 corosync [CLM   ] New Configuration:
Nov 21 09:15:58 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 21 09:15:58 corosync [CLM   ] Members Left:
Nov 21 09:15:58 corosync [CLM   ] Members Joined:
Nov 21 09:15:58 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed
Nov 21 09:15:58 corosync [CPG   ] chosen downlist: sender r(0) ip(192.168.0.212) ; members(old:3 left:2)
Nov 21 09:15:58 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Nov 21 09:28:01 corosync [SERV  ] Unloading all Corosync service engines.

and then..
Code:
Nov 21 09:28:16 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:28:16 corosync [CLM   ] New Configuration:
Nov 21 09:28:16 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 21 09:28:16 corosync [CLM   ] Members Left:
Nov 21 09:28:16 corosync [CLM   ] Members Joined:
Nov 21 09:28:16 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 21 09:28:16 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed
Nov 21 09:28:16 corosync [QUORUM] Members[1]: 2
Nov 21 09:28:16 corosync [QUORUM] Members[1]: 2
Nov 21 09:28:16 corosync [CPG   ] chosen downlist: sender r(0) ip(192.168.0.212) ; members(old:0 left:0)
Nov 21 09:28:16 corosync [MAIN  ] Completed service synchronization, ready to provide service.

at the end:

Code:
... (totem retransmiting data)
Nov 21 09:41:02 corosync [TOTEM ] FAILED TO RECEIVE
 
So the problem only occur on this specific node? if so, please can you test using another hw (replace the node)?
 
So the problem only occur on this specific node? if so, please can you test using another hw (replace the node)?
Not only with specific node (virt2). Here's logs from two other nodes:

virt1 - cluster main node:

Code:
Nov 21 09:15:46 corosync [TOTEM ] A processor failed, forming new configuration.
Nov 21 09:15:58 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:15:58 corosync [CLM   ] New Configuration:
Nov 21 09:15:58 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 21 09:15:58 corosync [CLM   ] Members Left:
Nov 21 09:15:58 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 21 09:15:58 corosync [CLM   ] <---->r(0) ip(192.168.0.213).
Nov 21 09:15:58 corosync [CLM   ] Members Joined:
Nov 21 09:15:58 corosync [QUORUM] Members[2]: 1 3
Nov 21 09:15:58 corosync [CMAN  ] quorum lost, blocking activity
Nov 21 09:15:58 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 21 09:15:58 corosync [QUORUM] Members[1]: 1
Nov 21 09:15:58 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:15:58 corosync [CLM   ] New Configuration:
Nov 21 09:15:58 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 21 09:15:58 corosync [CLM   ] Members Left:
Nov 21 09:15:58 corosync [CLM   ] Members Joined:
Nov 21 09:15:58 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 21 09:15:58 corosync [CPG   ] chosen downlist: sender r(0) ip(192.168.0.211) ; members(old:3 left:2)
Nov 21 09:15:58 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Nov 21 09:32:00 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:32:00 corosync [CLM   ] New Configuration:
Nov 21 09:32:00 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 21 09:32:00 corosync [CLM   ] Members Left:
Nov 21 09:32:00 corosync [CLM   ] Members Joined:
Nov 21 09:32:00 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:32:00 corosync [CLM   ] New Configuration:
Nov 21 09:32:00 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 21 09:32:00 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 21 09:32:00 corosync [CLM   ] Members Left:
Nov 21 09:32:00 corosync [CLM   ] Members Joined:

...

Nov 21 09:41:14 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:41:14 corosync [CLM   ] New Configuration:
Nov 21 09:41:14 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 21 09:41:14 corosync [CLM   ] Members Left:
Nov 21 09:41:14 corosync [CLM   ] Members Joined:
Nov 21 09:41:14 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:41:14 corosync [CLM   ] New Configuration:
Nov 21 09:41:14 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 21 09:41:14 corosync [CLM   ] Members Left:
Nov 21 09:41:14 corosync [CLM   ] Members Joined:
Nov 21 09:41:14 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 21 09:41:14 corosync [CPG   ] chosen downlist: sender r(0) ip(192.168.0.211) ; members(old:1 left:0)
Nov 21 09:41:14 corosync [MAIN  ] Completed service synchronization, ready to provide service.

and virt3 node has next:

Code:
Nov 21 09:06:13 corosync [CLM   ] CLM CONFIGURATION CHANGE
Nov 21 09:06:13 corosync [CLM   ] New Configuration:
Nov 21 09:06:13 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 21 09:06:13 corosync [CLM   ] <---->r(0) ip(192.168.0.212).
Nov 21 09:06:13 corosync [CLM   ] <---->r(0) ip(192.168.0.213).
Nov 21 09:06:13 corosync [CLM   ] Members Left:
Nov 21 09:06:13 corosync [CLM   ] Members Joined:
Nov 21 09:06:13 corosync [CLM   ] <---->r(0) ip(192.168.0.211).
Nov 21 09:06:13 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 21 09:06:13 corosync [QUORUM] Members[3]: 1 2 3
Nov 21 09:06:13 corosync [QUORUM] Members[3]: 1 2 3
Nov 21 09:06:13 corosync [CPG   ] chosen downlist: sender r(0) ip(192.168.0.212) ; members(old:2 left:0)
Nov 21 09:06:13 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Nov 21 09:10:46 corosync [TOTEM ] Retransmit List: 698 699 69a 69b 69c 69d 69e 69f 6a0 6a1 6a2 6a3.
...
Nov 21 09:14:54 corosync [TOTEM ] FAILED TO RECEIVE

As we can see, virt1 and virt2 nodes has same errors in almost same time. These nodes are identical S5520HC Intel severs with 2 Gigabit Intel 82575EB NICs (lspci output follows):
Code:
00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)
00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)
00:09.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22)
00:10.0 PIC: Intel Corporation 5520/5500/X58 Physical and Link Layer Registers Port 0 (rev 22)
00:10.1 PIC: Intel Corporation 5520/5500/X58 Routing and Protocol Layer Registers Port 0 (rev 22)
00:11.0 PIC: Intel Corporation 5520/5500 Physical and Link Layer Registers Port 1 (rev 22)
00:11.1 PIC: Intel Corporation 5520/5500 Routing & Protocol Layer Register Port 1 (rev 22)
00:13.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22)
00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 22)
00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22)
00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22)
00:14.3 PIC: Intel Corporation 5520/5500/X58 I/O Hub Throttle Registers (rev 22)
00:15.0 PIC: Intel Corporation 5520/5500/X58 Trusted Execution Technology Registers (rev 22)
00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5
00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 1
00:1c.4 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 5
00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
01:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
01:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
04:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
07:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 02)

virt3 is Dell PowerEdge 2900 server.

Code:
00:00.0 Host bridge: Intel Corporation 5000X Chipset Memory Controller Hub (rev 12)
00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 2 (rev 12)
00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 3 (rev 12)
00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 4 (rev 12)
00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 5 (rev 12)
00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 6-7 (rev 12)
00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 7 (rev 12)
00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 12)
00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 12)
00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 12)
00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev 12)
00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev 12)
00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 12)
00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 12)
00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express Root Port 1 (rev 09)
00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1 (rev 09)
00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2 (rev 09)
00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3 (rev 09)
00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #4 (rev 09)
00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface Controller (rev 09)
00:1f.2 IDE interface: Intel Corporation 631xESB/632xESB/3100 Chipset SATA IDE Controller (rev 09)
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)
02:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3)
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
04:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port (rev 01)
04:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X Bridge (rev 01)
05:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E1 (rev 01)
05:01.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E2 (rev 01)
06:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3)
07:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
0e:0d.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI ES1000 (rev 02)

dmesg and other logs on nodes gives nothing unusual, strange, errors, network errors etc.
 
This is a moderated forum, so it takes a while until you see your post online. So please stop posting things twice.

Besides, I do not see whats wrong (I see the bug, but have no idea about the cause).
 
This is a moderated forum, so it takes a while until you see your post online. So please stop posting things twice.
I'm sorry about that. Sometimes forum gives nothing as an answer to my post, so, I send post again, thinking that error occurs. Will not do this more.

Besides, I do not see whats wrong (I see the bug, but have no idea about the cause).
This is a forum - maybe someone else knows :)
 
Well, switched nodes to use transport="udpu" in cluster.conf - cluster gets quorum. Will check it for a few days in bridge, then, will try bonding.
Seems to be some problems with multicast: hardware or linux kernel, NIC drivers etc. Multicast node addresses don't cover 224.x.x.x range, used for switch intercommunications. Switch allows multicast.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!