Can't keep cluster together after removing node

zBrain

Active Member
Apr 27, 2013
37
0
26
I have a cluster of 4 machines that is temporarily 1 short due to needing to test the RAM on it.

After shutting it down and removing it from the cluster using instructions from the wiki, my cluster splits.

Node 1 is by itself (no quorum, expected 3 quorum 2 total 1)
Node 2 was removed (presumably successful)
Node 3 & 4 are in a cluster together and have quorum (expected 3 quorum 2 total 2)

If I reboot any node I get a complete cluster for a few minutes, then lose it again.

Ironically, I need quorum long enough to move data off of node 1 because I want to upgrade it, but I'm concerned the version difference is why it's splitting. Yet it worked fine before removing the node.

Any suggestions on where to look?

pveversion:
Node 1: pve-manager/3.3-1/a06c9f73 (running kernel: 2.6.32-32-pve)
Node 2: pve-manager/3.4-11/6502936f (running kernel: 2.6.32-41-pve)
Node 3: pve-manager/3.4-11/6502936f (running kernel: 2.6.32-41-pve)
 
I have a cluster of 4 machines that is temporarily 1 short due to needing to test the RAM on it.

After shutting it down and removing it from the cluster using instructions from the wiki, my cluster splits.

Node 1 is by itself (no quorum, expected 3 quorum 2 total 1)
Node 2 was removed (presumably successful)
Node 3 & 4 are in a cluster together and have quorum (expected 3 quorum 2 total 2)

If I reboot any node I get a complete cluster for a few minutes, then lose it again.

Node 2 should not be started again - did you take care about this?

Ironically, I need quorum long enough to move data off of node 1 because I want to upgrade it, but I'm concerned the version difference is why it's splitting. Yet it worked fine before removing the node.

Any suggestions on where to look?

pveversion:
Node 1: pve-manager/3.3-1/a06c9f73 (running kernel: 2.6.32-32-pve)
Node 2: pve-manager/3.4-11/6502936f (running kernel: 2.6.32-41-pve)
Node 3: pve-manager/3.4-11/6502936f (running kernel: 2.6.32-41-pve)

What reoports syslog about that peroid?

Does multicast work fine?
 
Node 2 should not be started again - did you take care about this?



What reoports syslog about that peroid?

Does multicast work fine?

To your first question, yes, I switched it off using the PDU at the appropriate time to make sure.
Multicast is working, verified by omping just now.

Syslog from node 1 after reboot - the first line is repeated many many times:

Code:
Oct 10 15:54:27 t1000 corosync[3293]:   [TOTEM ] Retransmit List: 67b 
Oct 10 15:54:27 t1000 corosync[3293]:   [TOTEM ] FAILED TO RECEIVE
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] New Configuration:
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] #011r(0) ip(10.0.0.103) 
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] Members Left:
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] #011r(0) ip(10.0.0.102) 
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] #011r(0) ip(10.0.0.105) 
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] Members Joined:
Oct 10 15:54:39 t1000 corosync[3293]:   [QUORUM] Members[2]: 1 4
Oct 10 15:54:39 t1000 corosync[3293]:   [CMAN  ] quorum lost, blocking activity
Oct 10 15:54:39 t1000 corosync[3293]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 10 15:54:39 t1000 pmxcfs[3136]: [status] notice: node lost quorum
Oct 10 15:54:39 t1000 corosync[3293]:   [QUORUM] Members[1]: 1
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] New Configuration:
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] #011r(0) ip(10.0.0.103) 
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] Members Left:
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] Members Joined:
Oct 10 15:54:39 t1000 corosync[3293]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 10 15:54:39 t1000 kernel: dlm: closing connection to node 2
Oct 10 15:54:39 t1000 kernel: dlm: closing connection to node 4
Oct 10 15:54:39 t1000 corosync[3293]:   [CPG   ] chosen downlist: sender r(0) ip(10.0.0.103) ; members(old:3 left:2)
Oct 10 15:54:39 t1000 pmxcfs[3136]: [dcdb] notice: members: 1/3136, 4/3066
Oct 10 15:54:39 t1000 pmxcfs[3136]: [dcdb] notice: starting data syncronisation
Oct 10 15:54:39 t1000 corosync[3293]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 10 15:54:39 t1000 pmxcfs[3136]: [dcdb] notice: cpg_send_message retried 1 times
Oct 10 15:54:39 t1000 pmxcfs[3136]: [dcdb] notice: members: 1/3136
Oct 10 15:54:39 t1000 pmxcfs[3136]: [status] notice: all data is up to date
Oct 10 15:54:39 t1000 pmxcfs[3136]: [dcdb] notice: members: 1/3136, 4/3066
Oct 10 15:54:39 t1000 pmxcfs[3136]: [dcdb] notice: starting data syncronisation
Oct 10 15:54:39 t1000 pmxcfs[3136]: [dcdb] notice: members: 1/3136
Oct 10 15:54:39 t1000 pmxcfs[3136]: [status] notice: all data is up to date

On another node, I see:

Code:
Oct 10 13:56:44 t70a corosync[127992]:   [CLM   ] New Configuration:
Oct 10 13:56:44 t70a corosync[127992]:   [CLM   ] #011r(0) ip(10.0.0.102) 
Oct 10 13:56:44 t70a corosync[127992]:   [CLM   ] #011r(0) ip(10.0.0.103) 
Oct 10 13:56:44 t70a corosync[127992]:   [CLM   ] #011r(0) ip(10.0.0.105) 
Oct 10 13:56:44 t70a corosync[127992]:   [CLM   ] Members Left:
Oct 10 13:56:44 t70a corosync[127992]:   [CLM   ] Members Joined:
Oct 10 13:56:44 t70a corosync[127992]:   [CLM   ] #011r(0) ip(10.0.0.103) 
Oct 10 13:56:44 t70a corosync[127992]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 10 13:56:44 t70a corosync[127992]:   [CMAN  ] quorum regained, resuming activity
Oct 10 13:56:44 t70a corosync[127992]:   [QUORUM] This node is within the primary component and will provide service.
Oct 10 13:56:44 t70a corosync[127992]:   [QUORUM] Members[3]: 1 2 4
Oct 10 13:56:44 t70a corosync[127992]:   [QUORUM] Members[3]: 1 2 4
Oct 10 13:56:44 t70a pmxcfs[128196]: [status] notice: node has quorum
Oct 10 13:56:44 t70a corosync[127992]:   [QUORUM] Members[3]: 1 2 4
Oct 10 13:56:44 t70a corosync[127992]:   [CPG   ] chosen downlist: sender r(0) ip(10.0.0.102) ; members(old:2 left:0)
Oct 10 13:56:44 t70a corosync[127992]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 10 13:56:45 t70a iscsid: conn 0 login rejected: initiator error - target not found (02/03)
Oct 10 13:56:45 t70a fenced[128313]: fenced 1364188437 started
Oct 10 13:56:45 t70a dlm_controld[128326]: dlm_controld 1364188437 started
Oct 10 13:56:47 t70a kernel: connection1:0: detected conn error (1020)
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: members: 1/79504, 2/128196, 4/85061
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: starting data syncronisation
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: members: 1/79504, 2/128196, 4/85061
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: starting data syncronisation
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: received sync request (epoch 1/79504/00000001)
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: received sync request (epoch 1/79504/00000001)
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: received all states
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: leader is 1/79504
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: synced members: 1/79504, 2/128196, 4/85061
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: all data is up to date
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: received all states
Oct 10 13:56:48 t70a pmxcfs[128196]: [dcdb] notice: all data is up to date

Then further down I see totem retransmitting and the node is lost.

It looks from the logs as though the clocks were out of sync, but I just checked and they arent.

Also, on the side with 2 nodes I see this repeated every ~15s:

Code:
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] New Configuration:
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] #011r(0) ip(10.0.0.102) 
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] #011r(0) ip(10.0.0.105) 
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] Members Left:
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] Members Joined:
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] New Configuration:
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] #011r(0) ip(10.0.0.102) 
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] #011r(0) ip(10.0.0.105) 
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] Members Left:
Oct 12 06:25:15 t70a corosync[127992]:   [CLM   ] Members Joined:
Oct 12 06:25:15 t70a corosync[127992]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 12 06:25:15 t70a corosync[127992]:   [CPG   ] chosen downlist: sender r(0) ip(10.0.0.102) ; members(old:2 left:0)
Oct 12 06:25:15 t70a corosync[127992]:   [MAIN  ] Completed service synchronization, ready to provide service.
 
To your first question, yes, I switched it off using the PDU at the appropriate time to make sure.
Multicast is working, verified by omping just now.

Syslog from node 1 after reboot - the first line is repeated many many times:

Code:
Oct 10 15:54:27 t1000 corosync[3293]:   [TOTEM ] Retransmit List: 67b 
Oct 10 15:54:27 t1000 corosync[3293]:   [TOTEM ] FAILED TO RECEIVE
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] New Configuration:
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] #011r(0) ip(10.0.0.103) 
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] Members Left:
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] #011r(0) ip(10.0.0.102) 
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] #011r(0) ip(10.0.0.105) 
Oct 10 15:54:39 t1000 corosync[3293]:   [CLM   ] Members Joined:

Node 1´s IP multicast communication with the other nodes is not working fine.


Check your network (switches) and NICs@node1



Is the content of /etc/pve/cluster.conf correct?

What reports

Code:
pvecm status

on each node?


Since you wrote "multicast is working": check it directly on the NICs using tcpdump if the messages with the Multicast address you can obtain by the above command are trasported correctly.
 
Last edited:
Node 1´s IP multicast communication with the other nodes is not working fine.


Check your network (switches) and NICs@node1



Is the content of /etc/pve/cluster.conf correct?

What reports

Code:
pvecm status

on each node?


Since you wrote "multicast is working": check it directly on the NICs using tcpdump if the messages with the Multicast address you can obtain by the above command are trasported correctly.

This is run from Node 1

Code:
# omping 10.0.0.105 10.0.0.102 10.0.0.103
10.0.0.105 : waiting for response msg
10.0.0.102 : waiting for response msg
10.0.0.105 : joined (S,G) = (*, 232.43.211.234), pinging
10.0.0.102 : joined (S,G) = (*, 232.43.211.234), pinging
10.0.0.105 :   unicast, seq=1, size=69 bytes, dist=0, time=0.130ms
10.0.0.105 : multicast, seq=1, size=69 bytes, dist=0, time=0.144ms
10.0.0.102 :   unicast, seq=1, size=69 bytes, dist=0, time=0.172ms
10.0.0.102 : multicast, seq=1, size=69 bytes, dist=0, time=0.182ms
10.0.0.102 :   unicast, seq=2, size=69 bytes, dist=0, time=0.141ms
10.0.0.102 : multicast, seq=2, size=69 bytes, dist=0, time=0.162ms
10.0.0.105 :   unicast, seq=2, size=69 bytes, dist=0, time=0.264ms
10.0.0.105 : multicast, seq=2, size=69 bytes, dist=0, time=0.269ms
10.0.0.105 :   unicast, seq=3, size=69 bytes, dist=0, time=0.277ms
10.0.0.105 : multicast, seq=3, size=69 bytes, dist=0, time=0.290ms
10.0.0.102 :   unicast, seq=3, size=69 bytes, dist=0, time=0.274ms
10.0.0.102 : multicast, seq=3, size=69 bytes, dist=0, time=0.278ms
10.0.0.105 :   unicast, seq=4, size=69 bytes, dist=0, time=0.243ms
10.0.0.105 : multicast, seq=4, size=69 bytes, dist=0, time=0.290ms
10.0.0.102 :   unicast, seq=4, size=69 bytes, dist=0, time=0.229ms
10.0.0.102 : multicast, seq=4, size=69 bytes, dist=0, time=0.256ms
10.0.0.102 :   unicast, seq=5, size=69 bytes, dist=0, time=0.177ms
10.0.0.105 : multicast, seq=5, size=69 bytes, dist=0, time=0.257ms
10.0.0.105 :   unicast, seq=5, size=69 bytes, dist=0, time=0.251ms
10.0.0.102 : multicast, seq=5, size=69 bytes, dist=0, time=0.224ms
10.0.0.105 :   unicast, seq=6, size=69 bytes, dist=0, time=0.244ms
10.0.0.105 : multicast, seq=6, size=69 bytes, dist=0, time=0.287ms
10.0.0.102 :   unicast, seq=6, size=69 bytes, dist=0, time=0.234ms
10.0.0.102 : multicast, seq=6, size=69 bytes, dist=0, time=0.248ms
10.0.0.102 :   unicast, seq=7, size=69 bytes, dist=0, time=0.172ms
10.0.0.105 :   unicast, seq=7, size=69 bytes, dist=0, time=0.234ms
10.0.0.102 : multicast, seq=7, size=69 bytes, dist=0, time=0.187ms
10.0.0.105 : multicast, seq=7, size=69 bytes, dist=0, time=0.238ms
10.0.0.102 :   unicast, seq=8, size=69 bytes, dist=0, time=0.127ms
10.0.0.102 : multicast, seq=8, size=69 bytes, dist=0, time=0.144ms
10.0.0.105 :   unicast, seq=8, size=69 bytes, dist=0, time=0.278ms
10.0.0.105 : multicast, seq=8, size=69 bytes, dist=0, time=0.286ms
10.0.0.105 :   unicast, seq=9, size=69 bytes, dist=0, time=0.229ms
10.0.0.105 : multicast, seq=9, size=69 bytes, dist=0, time=0.249ms
10.0.0.102 :   unicast, seq=9, size=69 bytes, dist=0, time=0.283ms
10.0.0.102 : multicast, seq=9, size=69 bytes, dist=0, time=0.340ms
10.0.0.105 :   unicast, seq=10, size=69 bytes, dist=0, time=0.231ms
10.0.0.102 :   unicast, seq=10, size=69 bytes, dist=0, time=0.220ms
10.0.0.105 : multicast, seq=10, size=69 bytes, dist=0, time=0.249ms
10.0.0.102 : multicast, seq=10, size=69 bytes, dist=0, time=0.224ms
10.0.0.105 :   unicast, seq=11, size=69 bytes, dist=0, time=0.310ms
10.0.0.105 : multicast, seq=11, size=69 bytes, dist=0, time=0.343ms
10.0.0.102 :   unicast, seq=11, size=69 bytes, dist=0, time=0.283ms
10.0.0.102 : multicast, seq=11, size=69 bytes, dist=0, time=0.291ms
10.0.0.105 :   unicast, seq=12, size=69 bytes, dist=0, time=0.250ms
10.0.0.105 : multicast, seq=12, size=69 bytes, dist=0, time=0.295ms
10.0.0.102 :   unicast, seq=12, size=69 bytes, dist=0, time=0.252ms
10.0.0.102 : multicast, seq=12, size=69 bytes, dist=0, time=0.318ms
10.0.0.105 :   unicast, seq=13, size=69 bytes, dist=0, time=0.232ms
10.0.0.105 : multicast, seq=13, size=69 bytes, dist=0, time=0.264ms
10.0.0.102 :   unicast, seq=13, size=69 bytes, dist=0, time=0.227ms
10.0.0.102 : multicast, seq=13, size=69 bytes, dist=0, time=0.236ms
10.0.0.105 :   unicast, seq=14, size=69 bytes, dist=0, time=0.228ms
10.0.0.105 : multicast, seq=14, size=69 bytes, dist=0, time=0.247ms
10.0.0.102 :   unicast, seq=14, size=69 bytes, dist=0, time=0.238ms
10.0.0.102 : multicast, seq=14, size=69 bytes, dist=0, time=0.241ms
10.0.0.105 :   unicast, seq=15, size=69 bytes, dist=0, time=0.242ms
10.0.0.105 : multicast, seq=15, size=69 bytes, dist=0, time=0.279ms
10.0.0.102 :   unicast, seq=15, size=69 bytes, dist=0, time=0.230ms
10.0.0.102 : multicast, seq=15, size=69 bytes, dist=0, time=0.239ms
^C
10.0.0.105 :   unicast, xmt/rcv/%loss = 15/15/0%, min/avg/max/std-dev = 0.130/0.243/0.310/0.039
10.0.0.105 : multicast, xmt/rcv/%loss = 15/15/0%, min/avg/max/std-dev = 0.144/0.266/0.343/0.043
10.0.0.102 :   unicast, xmt/rcv/%loss = 15/15/0%, min/avg/max/std-dev = 0.127/0.217/0.283/0.049
10.0.0.102 : multicast, xmt/rcv/%loss = 15/15/0%, min/avg/max/std-dev = 0.144/0.238/0.340/0.055

pvecm status...
Node 1:
Code:
 pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: DH
Cluster Id: 41204
Cluster Member: Yes
Cluster Generation: 187788
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: dh1
Node ID: 3
Multicast addresses: 239.192.160.149 
Node addresses: 10.0.0.103

Node 2:

Code:
pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: DH
Cluster Id: 41204
Cluster Member: Yes
Cluster Generation: 186216
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: dh2
Node ID: 1
Multicast addresses: 239.192.160.149 
Node addresses: 10.0.0.102

Node 3:

Code:
pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: DH
Cluster Id: 41204
Cluster Member: Yes
Cluster Generation: 186856
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: dh3
Node ID: 2
Multicast addresses: 239.192.160.149 
Node addresses: 10.0.0.105

From node 2 after rebooting node 1:

Code:
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2

pvecm nodes is as expected, all 3 show online for a time (2-3 minutes) then node 1 splits.

and finally, from Node 1...tcpdump, watching for packets from Node 2

Code:
16:51:22.375034 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375038 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375042 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375044 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375049 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375069 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375075 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375120 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375131 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375136 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375141 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375143 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375147 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375168 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375244 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375292 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375300 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375303 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375307 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375309 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375313 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375340 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375346 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.375389 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.375395 IP 10.0.0.102.5404 > 10.0.0.103.5405: UDP, length 107
16:51:22.378174 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.378222 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.378227 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.378238 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.378246 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.378254 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.378260 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.378263 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.378266 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
16:51:22.378273 IP 10.0.0.102 > 239.192.160.149: ip-proto-17
16:51:22.378276 IP 10.0.0.102.5404 > 239.192.160.149.5405: UDP, length 1473
 
This just keeps getting better.

kernel: Out of memory: OOM killed process 597850 (corosync) score 0 vm:53331892kB, rss:53129772kB, swap:0kB

Can anyone explain why corosync needs 53 GB of RAM?
 
Going back and looking at the graphs reveals corosync is buggy in this proxmox version. We removed that node from the cluster on Saturday and it's just been a nightmare ever since.

screen26.pngscreen25.png

Those are from the 2 nodes that were still clustered together. Corosync clearly has a memory leak that happens when a node is removed (which we did on Saturday). The Monday drop off was a reboot. When it runs out of memory, it crashes, and the cluster splits in some nasty way. I now have a cluster that is split 3 ways. Can someone help? This is getting ridiculous.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!