Cluster going down randomly

bladux · Nov 21, 2016

Hi,

I'm having a strange issue on my proxmox4.3 cluster: from time to time all nodes appears red in web GUI. Gui also sometimes is not reachable at all and I have to restart nodes..

I figured out that the logs always shows these lines, even if all is marked green:

Nov 21 12:29:39 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193828) was formed. Members
Nov 21 12:29:39 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:39 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:42 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193832) was formed. Members
Nov 21 12:29:42 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:42 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:43 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193836) was formed. Members
Nov 21 12:29:43 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:43 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:46 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193840) was formed. Members
Nov 21 12:29:46 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:46 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:48 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193844) was formed. Members
Nov 21 12:29:48 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:48 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:49 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193848) was formed. Members
Nov 21 12:29:49 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:49 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 12:29:53 24 corosync[1108]: [TOTEM ] A new membership ( 10.0.0.11:1193852) was formed. Members
Nov 21 12:29:53 24 corosync[1108]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 12:29:53 24 corosync[1108]: [MAIN ] Completed service synchronization, ready to provide service.

Reading the forum have lead me to verify multicast which seems to work nicely (omping does work on all nodes using the multicast IP, no loss..).

All hosts are in /etc/hosts

I do see multicast traffic from tcpdump -n "multicast" | grep IP
:

12:40:40.281906 IP 10.0.0.19.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.281919 IP 10.0.0.19.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.282617 IP 10.0.0.11.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.282621 IP 10.0.0.11.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.282977 IP 10.0.0.12.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.282982 IP 10.0.0.12.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.283358 IP 10.0.0.13.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.283370 IP 10.0.0.13.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.283816 IP 10.0.0.14.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.283829 IP 10.0.0.14.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.284111 IP 10.0.0.15.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.284124 IP 10.0.0.15.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.284406 IP 10.0.0.16.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.284419 IP 10.0.0.16.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.284799 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.284812 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.285107 IP 10.0.0.18.5404 > 239.192.109.205.5405: UDP, length 1448
12:40:40.285112 IP 10.0.0.18.5404 > 239.192.109.205.5405: UDP, length 824
12:40:40.287696 IP 10.0.0.18.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.287784 IP 10.0.0.19.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.288338 IP 10.0.0.11.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.288666 IP 10.0.0.12.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.289039 IP 10.0.0.13.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.289419 IP 10.0.0.14.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.289695 IP 10.0.0.15.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.289990 IP 10.0.0.16.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.290350 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 296
12:40:40.381531 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 88
12:40:40.383675 IP 10.0.0.17.5404 > 239.192.109.205.5405: UDP, length 1176

pvecm status do not show any error:

root@proxmox9:~# pvecm status
Quorum information
------------------
Date: Mon Nov 21 12:43:16 2016
Quorum provider: corosync_votequorum
Nodes: 9
Node ID: 0x00000009
Ring ID: 1/1195216
Quorate: Yes

Votequorum information
----------------------
Expected votes: 9
Highest expected: 9
Total votes: 9
Quorum: 5
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.0.11
0x00000004 1 10.0.0.12
0x00000003 1 10.0.0.13
0x00000002 1 10.0.0.14
0x00000005 1 10.0.0.15
0x00000006 1 10.0.0.16
0x00000007 1 10.0.0.17
0x00000008 1 10.0.0.18
0x00000009 1 10.0.0.19 (local)

Any help would be greatly appreciated.

aale · Nov 21, 2016

Have the same problem.

Only helps to restart the corosync + pmxcfs manually, but helps only for few hours.

pmxcfs logs in debug mode looks like:

Nov 21 17:53:14 proxmox1 pmxcfs[23774]: [status] debug: dfsm mode is 1 (dfsm.c:658:dfsm_cpg_deliver_callback)
Nov 21 17:53:14 proxmox1 pmxcfs[23774]: [status] debug: queue message 1425285 (subtype = 1, length = 437) (dfsm.c:700:dfsm_cpg_deliver_callback)
Nov 21 17:53:14 proxmox1 pmxcfs[23774]: [status] debug: dfsm mode is 1 (dfsm.c:658:dfsm_cpg_deliver_callback)

Multicast is also working fine between nodes. pvecm status is good.

Any ideas/workarounds on this?

yakakliker · Nov 21, 2016

When the cluster is "down", are the proxmox services restart well ?

( service pve-cluster restart, service pvedaemon restart, service pvestatd restart and service pveproxy restart ? )

bladux · Nov 21, 2016

Cluster "down" goes from only primary node appears green, all others are red to a no GUI available at all.

In both cases, virtual hosts are up.

Proxmox services do not restart well in a vast majority of cases, so I'm not event trying to restart them now, I simply reboot.

I'm now suspecting the NFS servers I use for backups, so I changed NFS mounts from hard to soft mounts in case the NFS mount hangs... Will keep you posted.

aale · Nov 21, 2016

yakakliker said:
When the cluster is "down", are the proxmox services restart well ?

( service pve-cluster restart, service pvedaemon restart, service pvestatd restart and service pveproxy restart ? )

service pve-cluster restart not working because

Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: queue not emtpy - resening 2 messages
Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: members: 1/23774, 2/10972, 3/17015, 4/2722, 8/2824, 9/31520, 10/15083, 11/4319, 13/2059, 14/2028, 15/1685, 16/2237, 17/22022, 18/6...16971, 20/9312
Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: queue not emtpy - resening 6 messages
Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: received sync request (epoch 1/23774/00000019)
Nov 21 18:32:53 proxmox20 pmxcfs[9312]: [status] notice: received sync request (epoch 1/23774/0000001A)
Nov 21 18:33:12 proxmox20 systemd[1]: pve-cluster.service start-post operation timed out. Stopping.
Nov 21 18:33:22 proxmox20 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Nov 21 18:33:22 proxmox20 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL

bladux · Nov 21, 2016

aale, do you also have NFS shares mounted ?

aale · Nov 21, 2016

No for the cluster i use separate 1gbit network. I found a way to use udpu, but it's not a good workaround

TechLineX · Nov 21, 2016

I have the same issues:
https://forum.proxmox.com/threads/cluster-running-but-node-shows-offline.29956/

I use zfs-shares between the hosts. Actually I have to reboot the whole red host. Is there a smarter option to resync the cluster?

Regards

bladux · Nov 21, 2016

Along with turning NFS shares from hard to soft, I also changed my corosync.conf file and increased (added) a longer token timeout, and increased the config_version :

Code:

totem {
  cluster_name: clustername
  config_version: 10
  ip_version: ipv4
  secauth: on
  version: 2
  token: 3000
  interface {
    bindnetaddr: 10.0.0.11
    ringnumber: 0
  }

}

(Default token value is 1000ms)

All nodes detected the config changes and apparently I'm not spammed with these logs

Nov 21 19:04:33 proxmox8 corosync[1140]: [TOTEM ] A new membership (10.0.0.11:1229456) was formed. Members
Nov 21 19:04:33 proxmox8 corosync[1140]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 19:04:33 proxmox8 corosync[1140]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 19:04:35 proxmox8 corosync[1140]: [TOTEM ] A new membership (10.0.0.11:1229460) was formed. Members
Nov 21 19:04:35 proxmox8 corosync[1140]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 19:04:35 proxmox8 corosync[1140]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 19:04:40 proxmox8 corosync[1140]: [TOTEM ] A new membership (10.0.0.11:1229464) was formed. Members
Nov 21 19:04:40 proxmox8 corosync[1140]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 19:04:40 proxmox8 corosync[1140]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 21 19:04:43 proxmox8 corosync[1140]: [TOTEM ] A new membership (10.0.0.11:1229468) was formed. Members
Nov 21 19:04:43 proxmox8 corosync[1140]: [QUORUM] Members[9]: 1 4 3 2 5 6 7 8 9
Nov 21 19:04:43 proxmox8 corosync[1140]: [MAIN ] Completed service synchronization, ready to provide service.

aale · Nov 21, 2016

TechLineX said:
I have the same issues:
https://forum.proxmox.com/threads/cluster-running-but-node-shows-offline.29956/

I use zfs-shares between the hosts. Actually I have to reboot the whole red host. Is there a smarter option to resync the cluster?

Regards

Stopping all pve-cluster and corosync on all nodes, after that run corosync and pve-cluster on 1 node and then on others.

TechLineX · Nov 21, 2016

bladux said:
Along with turning NFS shares from hard to soft, I also changed my corosync.conf file and increased (added) a longer token timeout, and increased the config_version :

Code:

totem { cluster_name: clustername config_version: 10 ip_version: ipv4 secauth: on version: 2 token: 3000 interface { bindnetaddr: 10.0.0.11 ringnumber: 0 } }

(Default token value is 1000ms)

All nodes detected the config changes and apparently I'm not spammed with these logs

Where do you edited the corosync.conf?

bladux · Nov 21, 2016

In /etc/pve/

TechLineX · Nov 21, 2016

Where can I check, if the new conf has been changed on all nodes?

bladux · Nov 21, 2016

tail /var/log/daemon.log should give you the new config detection warning.

Best regards

TechLineX · Nov 21, 2016

Got it.. What means the new token:3000?

bladux · Nov 21, 2016

3000ms instead of 1000ms.

TechLineX · Nov 21, 2016

That means if the corosync looses connection it will showed red and desynced after 3000 instead of 1000ms?

btw:

Host16:

Code:

Nov 21 18:27:03 host16 corosync[1582]:  [TOTEM ] A processor failed, forming new configuration.
Nov 21 18:27:07 host16 corosync[1582]:  [TOTEM ] A new membership (172.16.0.16:948) was formed. Members left: 1 4
Nov 21 18:27:07 host16 corosync[1582]:  [TOTEM ] Failed to receive the leave message. failed: 1 4
Nov 21 18:27:07 host16 pmxcfs[1432]: [dcdb] notice: members: 2/1432
Nov 21 18:27:07 host16 pmxcfs[1432]: [status] notice: members: 2/1432
Nov 21 18:27:07 host16 corosync[1582]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 21 18:27:07 host16 corosync[1582]:  [QUORUM] Members[1]: 2
Nov 21 18:27:07 host16 corosync[1582]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 21 18:27:07 host16 pmxcfs[1432]: [status] notice: node lost quorum
Nov 21 18:27:07 host16 pmxcfs[1432]: [dcdb] crit: received write while not quorate - trigger resync
Nov 21 18:27:07 host16 pmxcfs[1432]: [dcdb] crit: leaving CPG group
Nov 21 18:27:07 host16 pve-ha-lrm[1990]: unable to write lrm status file - unable to open file '/etc/pve/nodes/host16/lrm_status.tmp.1990' - Permission denied
Nov 21 18:27:08 host16 pmxcfs[1432]: [dcdb] notice: start cluster connection
Nov 21 18:27:08 host16 pmxcfs[1432]: [dcdb] notice: members: 2/1432
Nov 21 18:27:08 host16 pmxcfs[1432]: [dcdb] notice: all data is up to date

Host17:

Code:

Nov 21 18:27:03 host17 corosync[1699]:  [TOTEM ] A processor failed, forming new configuration.
Nov 21 18:27:05 host17 corosync[1699]:  [TOTEM ] A new membership (172.16.0.17:944) was formed. Members left: 2
Nov 21 18:27:05 host17 corosync[1699]:  [TOTEM ] Failed to receive the leave message. failed: 2
Nov 21 18:27:05 host17 pmxcfs[1660]: [dcdb] notice: members: 1/1660, 4/1994
Nov 21 18:27:05 host17 pmxcfs[1660]: [dcdb] notice: starting data syncronisation
Nov 21 18:27:05 host17 corosync[1699]:  [QUORUM] Members[2]: 1 4
Nov 21 18:27:05 host17 corosync[1699]:  [MAIN  ] Completed service synchronization, ready to provide service.

Can I see, why one node left the cluster?

xcdr · Nov 21, 2016

I have similar problem, nodes sometimes goes down/restart, many corosync retransmit.

At first cluster all nodes goes down immediately at saturday, and today another cluster goes down when I restarted one of nodes.

Confirmed corosync retransmit problem at four different instalations with PVE version: 4.3.

pveversion:

proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-3 (running version: 4.3-3/557191d3)
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-91
pve-firmware: 1.1-9
libpve-common-perl: 4.0-75
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-66
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.2-2
pve-container: 1.0-78
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
ceph: 0.94.9-1~bpo80+1

TechLineX · Nov 22, 2016

You could test

#killall -9 corosync
# systemctl restart pve-cluster

to resync the cluster.

bladux · Nov 22, 2016

Quick follow-up since I started the thread:
- I changed NFS mount from hard to soft
- changed corosync token time from default (1000) to 3000.

No red node nor cluster gui unavailable for the last 24h.

My best guess would be that
1 - for some reason NFS server may have been overloaded during backups and since NFS mounts are mounted as hard, in case anything happens it won't ever timeout.
2 - the token increase has nothing to do with the initial issue, but as I added now hosts recently, maybe it's a common change to increase it as there are more and more things to sync between more and more nodes.

Maybe a proxmox guru will confirm my thoughts ?

Regards

Cluster going down randomly

Well-Known Member

New Member

Renowned Member

Well-Known Member

New Member

Well-Known Member

New Member

Active Member

Well-Known Member

New Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Member

Active Member

Well-Known Member