Proxmox HA 4.4 Quorum lost

Sep 14, 2015
19
1
23
Hi.

We have 3 nodes HA cluster.
2 nodes (pve2/3) have drbd9 storage + 1 (pve1) node for Quorum.
pve2/3 - HP 380G6/7
pve1 - Dell 2950

Network:

PVE3 -- (LACP 4*1G) -- Cisco 7604 -- (LACP 2*10G) -- Extreme 650-24x -- (10G) - PVE2
Extreme 650-24x -- (2*10G) -- Dlink 3620 -- (LACp 2*1G) -- PVE1


Some weeks ago we got some troubles. 2 nodes (pve2/3) suddenly reboot.
We have seen that Quorum was lost, and nodes were rebooted by watchdog
3 node had some troubles with access to same file, so it worked.

Second reboot had place in a week........the same log.....lost Quorum.....watchdog......reboot.
Third - in 5 days........
Multicast test network passed without troubles.....but...we decided to switch corosync into unicast mode.
And add some logs.

So today we had one more reboot.
All nodes have last updates.
pve2:~# pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.44-1-pve)
pve-manager: 4.4-15 (running version: 4.4-15/7599e35a)
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-52
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-95
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-101
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
drbdmanage: 0.97.3-1


In log we saw such information before reboot:
Aug 01 21:56:33 [4021] pve2 corosync notice [TOTEM ] Retransmit List: 5cac6b
Aug 01 23:14:43 [4021] pve2 corosync notice [TOTEM ] Retransmit List: 5d5f9c
Aug 02 09:09:27 [4021] pve2 corosync notice [TOTEM ] Retransmit List: 62b1c3
Aug 02 15:29:22 [4021] pve2 corosync notice [TOTEM ] A processor failed, forming new configuration.
Aug 02 15:29:22 [4021] pve2 corosync notice [TOTEM ] A new membership (10.10.10.91:1224) was formed. Members
Aug 02 15:29:22 [4021] pve2 corosync notice [QUORUM] Members[3]: 3 4 1
Aug 02 15:29:22 [4021] pve2 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 02 15:29:31 [4021] pve2 corosync notice [TOTEM ] A processor failed, forming new configuration.
Aug 02 15:29:33 [4021] pve2 corosync notice [TOTEM ] A new membership (10.10.10.92:1228) was formed. Members left: 3
Aug 02 15:29:33 [4021] pve2 corosync notice [TOTEM ] Failed to receive the leave message. failed: 3
Aug 02 15:29:33 [4021] pve2 corosync notice [QUORUM] Members[2]: 4 1
Aug 02 15:29:33 [4021] pve2 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 02 15:29:34 [4021] pve2 corosync notice [TOTEM ] A new membership (10.10.10.91:1236) was formed. Members joined: 3
Aug 02 15:29:34 [4021] pve2 corosync notice [QUORUM] Members[3]: 3 4 1
Aug 02 15:29:34 [4021] pve2 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 02 15:29:43 [4021] pve2 corosync notice [TOTEM ] A processor failed, forming new configuration.
Aug 02 15:29:44 [4021] pve2 corosync notice [TOTEM ] A new membership (10.10.10.91:1240) was formed. Members
Aug 02 15:29:44 [4021] pve2 corosync notice [QUORUM] Members[3]: 3 4 1
Aug 02 15:29:44 [4021] pve2 corosync notice [MAIN ] Completed service synchronization, ready to provide service.

*reboot*
002644: Aug 2 15:30:38.271 ua: %LINEPROTO-5-UPDOWN: Line protocol on Interface Port-channel6, changed state to down (pve3)
08/02/2017 15:30:35.13 <Info:vlan.msgs.portLinkStateDown> Port 8 link down - Local fault (pve2)

Aug 02 15:35:49 [4030] pve2 corosync notice [MAIN ] Corosync Cluster Engine ('2.4.2'): started and ready to provide service.
Aug 02 15:35:49 [4030] pve2 corosync info [MAIN ] Corosync built-in features: augeas systemd pie relro bindnow
Aug 02 15:35:49 [4030] pve2 corosync warning [MAIN ] member section is used together with nodelist. Members ignored.
Aug 02 15:35:49 [4030] pve2 corosync warning [MAIN ] Please migrate config file to nodelist.
Aug 02 15:35:49 [4030] pve2 corosync notice [TOTEM ] Initializing transport (UDP/IP Unicast).
Aug 02 15:35:49 [4030] pve2 corosync notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Aug 02 15:35:50 [4030] pve2 corosync notice [TOTEM ] The network interface [10.10.10.92] is now up.
Aug 02 15:35:50 [4030] pve2 corosync notice [SERV ] Service engine loaded: corosync configuration map access [0]
Aug 02 15:35:50 [4030] pve2 corosync info [QB ] server name: cmap
Aug 02 15:35:50 [4030] pve2 corosync notice [SERV ] Service engine loaded: corosync configuration service [1]
Aug 02 15:35:50 [4030] pve2 corosync info [QB ] server name: cfg
Aug 02 15:35:50 [4030] pve2 corosync notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 02 15:35:50 [4030] pve2 corosync info [QB ] server name: cpg
Aug 02 15:35:50 [4030] pve2 corosync notice [SERV ] Service engine loaded: corosync profile loading service [4]
Aug 02 15:35:50 [4030] pve2 corosync notice [QUORUM] Using quorum provider corosync_votequorum
Aug 02 15:35:50 [4030] pve2 corosync notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 02 15:35:50 [4030] pve2 corosync info [QB ] server name: votequorum
Aug 02 15:35:50 [4030] pve2 corosync notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 02 15:35:50 [4030] pve2 corosync info [QB ] server name: quorum
Aug 02 15:35:50 [4030] pve2 corosync notice [TOTEM ] adding new UDPU member {10.10.10.91}
Aug 02 15:35:50 [4030] pve2 corosync notice [TOTEM ] adding new UDPU member {10.10.10.93}
Aug 02 15:35:50 [4030] pve2 corosync notice [TOTEM ] adding new UDPU member {10.10.10.92}
Aug 02 15:35:50 [4030] pve2 corosync notice [TOTEM ] A new membership (10.10.10.92:1244) was formed. Members joined: 4
Aug 02 15:35:50 [4030] pve2 corosync notice [TOTEM ] A new membership (10.10.10.91:1252) was formed. Members joined: 3 1
Aug 02 15:35:50 [4030] pve2 corosync notice [QUORUM] This node is within the primary component and will provide service.
Aug 02 15:35:50 [4030] pve2 corosync notice [QUORUM] Members[0]:
Aug 02 15:35:50 [4030] pve2 corosync notice [QUORUM] Members[3]: 3 4 1
Aug 02 15:35:50 [4030] pve2 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 02 15:37:25 [4030] pve2 corosync notice [TOTEM ] Retransmit List: 595 596 597
Aug 02 15:37:30 [4030] pve2 corosync notice [TOTEM ] Retransmit List: 5c7 5c8 5c9

Logs from alive node pve1:
Aug 02 15:29:19 [3421] pve1 corosync notice [TOTEM ] Retransmit List: 661900
Aug 02 15:29:22 [3421] pve1 corosync notice [TOTEM ] A processor failed, forming new configuration.
Aug 02 15:29:22 [3421] pve1 corosync notice [TOTEM ] A new membership (10.10.10.91:1224) was formed. Members
Aug 02 15:29:22 [3421] pve1 corosync notice [QUORUM] Members[3]: 3 4 1
Aug 02 15:29:22 [3421] pve1 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 02 15:29:31 [3421] pve1 corosync notice [TOTEM ] A processor failed, forming new configuration.
Aug 02 15:29:34 [3421] pve1 corosync notice [TOTEM ] A new membership (10.10.10.91:1236) was formed. Members joined: 4 1 left: 4 1
Aug 02 15:29:34 [3421] pve1 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4 1
Aug 02 15:29:34 [3421] pve1 corosync notice [QUORUM] Members[3]: 3 4 1
Aug 02 15:29:34 [3421] pve1 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 02 15:29:43 [3421] pve1 corosync notice [TOTEM ] A processor failed, forming new configuration.
Aug 02 15:29:44 [3421] pve1 corosync notice [TOTEM ] A new membership (10.10.10.91:1240) was formed. Members
Aug 02 15:29:44 [3421] pve1 corosync notice [QUORUM] Members[3]: 3 4 1
Aug 02 15:29:44 [3421] pve1 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 02 15:30:36 [3421] pve1 corosync notice [TOTEM ] A processor failed, forming new configuration.
Aug 02 15:30:38 [3421] pve1 corosync notice [TOTEM ] A new membership (10.10.10.91:1244) was formed. Members left: 4 1
Aug 02 15:30:38 [3421] pve1 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4 1
Aug 02 15:30:38 [3421] pve1 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 02 15:30:38 [3421] pve1 corosync notice [QUORUM] Members[1]: 3
Aug 02 15:30:38 [3421] pve1 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 02 15:33:57 [3421] pve1 corosync notice [TOTEM ] A new membership (10.10.10.91:1248) was formed. Members joined: 1
Aug 02 15:33:57 [3421] pve1 corosync notice [QUORUM] This node is within the primary component and will provide service.
Aug 02 15:33:57 [3421] pve1 corosync notice [QUORUM] Members[2]: 3 1
Aug 02 15:33:57 [3421] pve1 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 02 15:35:50 [3421] pve1 corosync notice [TOTEM ] A new membership (10.10.10.91:1252) was formed. Members joined: 4
Aug 02 15:35:50 [3421] pve1 corosync notice [QUORUM] Members[3]: 3 4 1
Aug 02 15:35:50 [3421] pve1 corosync notice [MAIN ] Completed service synchronization, ready to provide service.


cat /etc/corosync/corosync.conf
logging {
fileline: off
to_logfile: yes
to_syslog: yes
debug: on
logfile: /var/log/corosync/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

nodelist {
node {
name: pve1
nodeid: 3
quorum_votes: 1
ring0_addr: pve1
}

node {
name: pve3
nodeid: 1
quorum_votes: 1
ring0_addr: pve3
}

node {
name: pve2
nodeid: 4
quorum_votes: 1
ring0_addr: pve2
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: PVE-Intraffic
config_version: 16
ip_version: ipv4
secauth: off
version: 2
interface {
member {
memberaddr: 10.10.10.91
}
member {
memberaddr: 10.10.10.92
}
member {
memberaddr: 10.10.10.93
}
bindnetaddr: 10.10.10.0
ringnumber: 0
mcastport: 5405
ttl: 1
}
transport: udpu
}
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!