Cluster going down randomly

bladux

Well-Known Member
Nov 7, 2016
30
0
46
40
Hi,

I'm having a strange issue on my proxmox4.3 cluster: from time to time all nodes appears red in web GUI. Gui also sometimes is not reachable at all and I have to restart nodes..

I figured out that the logs always shows these lines, even if all is marked green:
Nov 21 12:29:39 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193828) was formed. Members
Nov 21 12:29:39 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:39 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:42 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193832) was formed. Members
Nov 21 12:29:42 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:42 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:43 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193836) was formed. Members
Nov 21 12:29:43 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:43 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:46 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193840) was formed. Members
Nov 21 12:29:46 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:46 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:48 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193844) was formed. Members
Nov 21 12:29:48 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:48 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:49 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193848) was formed. Members
Nov 21 12:29:49 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:49 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:53 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193852) was formed. Members
Nov 21 12:29:53 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:53 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.

Reading the forum have lead me to verify multicast which seems to work nicely (omping does work on all nodes using the multicast IP, no loss..).

All hosts are in /etc/hosts

I do see multicast traffic from tcpdump -n "multicast" | grep IP
:
12:40:40.281906 IP 10.0.0.19.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.281919 IP 10.0.0.19.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.282617 IP 10.0.0.11.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.282621 IP 10.0.0.11.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.282977 IP 10.0.0.12.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.282982 IP 10.0.0.12.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.283358 IP 10.0.0.13.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.283370 IP 10.0.0.13.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.283816 IP 10.0.0.14.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.283829 IP 10.0.0.14.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.284111 IP 10.0.0.15.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.284124 IP 10.0.0.15.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.284406 IP 10.0.0.16.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.284419 IP 10.0.0.16.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.284799 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.284812 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.285107 IP 10.0.0.18.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.285112 IP 10.0.0.18.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.287696 IP 10.0.0.18.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.287784 IP 10.0.0.19.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.288338 IP 10.0.0.11.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.288666 IP 10.0.0.12.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.289039 IP 10.0.0.13.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.289419 IP 10.0.0.14.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.289695 IP 10.0.0.15.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.289990 IP 10.0.0.16.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.290350 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.381531 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 88
12:40:40.383675 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 1176

pvecm status do not show any error:
root@proxmox9:~# pvecm status
Quorum information
------------------
Date: Mon Nov 21 12:43:16 2016
Quorum provider: corosync_votequorum
Nodes: 9
Node ID: 0x00000009
Ring ID: 1/1195216
Quorate: Yes

Votequorum information
----------------------
Expected votes: 9
Highest expected: 9
Total votes: 9
Quorum: 5
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.0.11
0x00000004 1 10.0.0.12
0x00000003 1 10.0.0.13
0x00000002 1 10.0.0.14
0x00000005 1 10.0.0.15
0x00000006 1 10.0.0.16
0x00000007 1 10.0.0.17
0x00000008 1 10.0.0.18
0x00000009 1 10.0.0.19 (local)

Any help would be greatly appreciated.
 
Have the same problem.

Only helps to restart the corosync + pmxcfs manually, but helps only for few hours.

pmxcfs logs in debug mode looks like:


Nov 21 17:53:14 proxmox1 pmxcfs[23774]: [status] debug: dfsm mode is 1 (dfsm.c:658:dfsm_cpg_deliver_callback)
Nov 21 17:53:14 proxmox1 pmxcfs[23774]: [status] debug: queue message 1425285 (subtype = 1, length = 437) (dfsm.c:700:dfsm_cpg_deliver_callback)
Nov 21 17:53:14 proxmox1 pmxcfs[23774]: [status] debug: dfsm mode is 1 (dfsm.c:658:dfsm_cpg_deliver_callback)

Multicast is also working fine between nodes. pvecm status is good.


Any ideas/workarounds on this?
 
When the cluster is "down", are the proxmox services restart well ?

( service pve-cluster restart, service pvedaemon restart, service pvestatd restart and service pveproxy restart ? )
 
Cluster "down" goes from only primary node appears green, all others are red to a no GUI available at all.

In both cases, virtual hosts are up.

Proxmox services do not restart well in a vast majority of cases, so I'm not event trying to restart them now, I simply reboot.

I'm now suspecting the NFS servers I use for backups, so I changed NFS mounts from hard to soft mounts in case the NFS mount hangs... Will keep you posted.
 
When the cluster is "down", are the proxmox services restart well ?

( service pve-cluster restart, service pvedaemon restart, service pvestatd restart and service pveproxy restart ? )

service pve-cluster restart not working because

Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: queue not emtpy - resening 2 messages
Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: members: 1/23774, 2/10972, 3/17015, 4/2722, 8/2824, 9/31520, 10/15083, 11/4319, 13/2059, 14/2028, 15/1685, 16/2237, 17/22022, 18/6...16971, 20/9312
Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: queue not emtpy - resening 6 messages
Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: received sync request (epoch 1/23774/00000019)
Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: received sync request (epoch 1/23774/0000001A)
Nov 21 18:33:12 proxmox20 systemd[1]: pve-cluster.service start-post operation timed out. Stopping.
Nov 21 18:33:22 proxmox20 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Nov 21 18:33:22 proxmox20 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
 
No for the cluster i use separate 1gbit network. I found a way to use udpu, but it's not a good workaround
 
Along with turning NFS shares from hard to soft, I also changed my corosync.conf file and increased (added) a longer token timeout, and increased the config_version :

Code:
totem {
  cluster_name: clustername
  config_version: 10
  ip_version: ipv4
  secauth: on
  version: 2
  token: 3000
  interface {
    bindnetaddr: 10.0.0.11
    ringnumber: 0
  }

}
(Default token value is 1000ms)

All nodes detected the config changes and apparently I'm not spammed with these logs
Nov 21 19:04:33 proxmox8 corosync[1140]: [TOTEM ] A new membership (10.0.0.11:1229456) was formed. Members
Nov 21 19:04:33 proxmox8 corosync[1140]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 19:04:33 proxmox8 corosync[1140]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 19:04:35 proxmox8 corosync[1140]: [TOTEM ] A new membership (10.0.0.11:1229460) was formed. Members
Nov 21 19:04:35 proxmox8 corosync[1140]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 19:04:35 proxmox8 corosync[1140]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 19:04:40 proxmox8 corosync[1140]: [TOTEM ] A new membership (10.0.0.11:1229464) was formed. Members
Nov 21 19:04:40 proxmox8 corosync[1140]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 19:04:40 proxmox8 corosync[1140]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 19:04:43 proxmox8 corosync[1140]: [TOTEM ] A new membership (10.0.0.11:1229468) was formed. Members
Nov 21 19:04:43 proxmox8 corosync[1140]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 19:04:43 proxmox8 corosync[1140]: [MAIN ] Completed service synchronization, ready to provide service.
 
Along with turning NFS shares from hard to soft, I also changed my corosync.conf file and increased (added) a longer token timeout, and increased the config_version :

Code:
totem {
  cluster_name: clustername
  config_version: 10
  ip_version: ipv4
  secauth: on
  version: 2
  token: 3000
  interface {
    bindnetaddr: 10.0.0.11
    ringnumber: 0
  }

}
(Default token value is 1000ms)

All nodes detected the config changes and apparently I'm not spammed with these logs

Where do you edited the corosync.conf?
 
tail /var/log/daemon.log should give you the new config detection warning.

Best regards
 
That means if the corosync looses connection it will showed red and desynced after 3000 instead of 1000ms?

btw:

Host16:
Code:
Nov 21 18:27:03 host16 corosync[1582]:  [TOTEM ] A processor failed, forming new configuration.
Nov 21 18:27:07 host16 corosync[1582]:  [TOTEM ] A new membership (172.16.0.16:948) was formed. Members left: 1 4
Nov 21 18:27:07 host16 corosync[1582]:  [TOTEM ] Failed to receive the leave message. failed: 1 4
Nov 21 18:27:07 host16 pmxcfs[1432]: [dcdb] notice: members: 2/1432
Nov 21 18:27:07 host16 pmxcfs[1432]: [status] notice: members: 2/1432
Nov 21 18:27:07 host16 corosync[1582]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 21 18:27:07 host16 corosync[1582]:  [QUORUM] Members[1]: 2
Nov 21 18:27:07 host16 corosync[1582]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 21 18:27:07 host16 pmxcfs[1432]: [status] notice: node lost quorum
Nov 21 18:27:07 host16 pmxcfs[1432]: [dcdb] crit: received write while not quorate - trigger resync
Nov 21 18:27:07 host16 pmxcfs[1432]: [dcdb] crit: leaving CPG group
Nov 21 18:27:07 host16 pve-ha-lrm[1990]: unable to write lrm status file - unable to open file '/etc/pve/nodes/host16/lrm_status.tmp.1990' - Permission denied
Nov 21 18:27:08 host16 pmxcfs[1432]: [dcdb] notice: start cluster connection
Nov 21 18:27:08 host16 pmxcfs[1432]: [dcdb] notice: members: 2/1432
Nov 21 18:27:08 host16 pmxcfs[1432]: [dcdb] notice: all data is up to date

Host17:
Code:
Nov 21 18:27:03 host17 corosync[1699]:  [TOTEM ] A processor failed, forming new configuration.
Nov 21 18:27:05 host17 corosync[1699]:  [TOTEM ] A new membership (172.16.0.17:944) was formed. Members left: 2
Nov 21 18:27:05 host17 corosync[1699]:  [TOTEM ] Failed to receive the leave message. failed: 2
Nov 21 18:27:05 host17 pmxcfs[1660]: [dcdb] notice: members: 1/1660, 4/1994
Nov 21 18:27:05 host17 pmxcfs[1660]: [dcdb] notice: starting data syncronisation
Nov 21 18:27:05 host17 corosync[1699]:  [QUORUM] Members[2]: 1 4
Nov 21 18:27:05 host17 corosync[1699]:  [MAIN  ] Completed service synchronization, ready to provide service.

Can I see, why one node left the cluster?
 
Last edited:
I have similar problem, nodes sometimes goes down/restart, many corosync retransmit.

At first cluster all nodes goes down immediately at saturday, and today another cluster goes down when I restarted one of nodes.

Confirmed corosync retransmit problem at four different instalations with PVE version: 4.3.

pveversion:
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-3 (running version: 4.3-3/557191d3)
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-91
pve-firmware: 1.1-9
libpve-common-perl: 4.0-75
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-66
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.2-2
pve-container: 1.0-78
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
ceph: 0.94.9-1~bpo80+1
 
Quick follow-up since I started the thread:
- I changed NFS mount from hard to soft
- changed corosync token time from default (1000) to 3000.

No red node nor cluster gui unavailable for the last 24h.

My best guess would be that
1 - for some reason NFS server may have been overloaded during backups and since NFS mounts are mounted as hard, in case anything happens it won't ever timeout.
2 - the token increase has nothing to do with the initial issue, but as I added now hosts recently, maybe it's a common change to increase it as there are more and more things to sync between more and more nodes.

Maybe a proxmox guru will confirm my thoughts ?

Regards
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!