Proxmox 4. Cluster nodes being red :(

nandex · Feb 17, 2016

# cat /etc/pve/.members
{
"nodename": "node-01",
"version": 4,
"cluster": { "name": "cluster", "version": 9, "nodes": 6, "quorate": 1 },
"nodelist": {
"node-01": { "id": 1, "online": 1, "ip": "X.X.X.X"},
"node-02": { "id": 2, "online": 0},
"node-03": { "id": 3, "online": 0},
"node-04": { "id": 4, "online": 0},
"node-05": { "id": 5, "online": 0},
}
}
pve-manager/4.1-13/cfb599fb (running kernel: 4.2.6-1-pve)

I'll try with "pvecm updatecerts --force" and "pvecm add master --force" but nothing happends, nodes are in red.

t.lamprecht · Feb 17, 2016

nandex said:
pvecm add master --force

Please do not simply fall back to this command instantly when such a thing happens, this could be dangerous.

What does

Code:

journalctl -u corosync -u pve-cluster -b

outputs?

Please attach also the content of

Code:

cat /etc/pve/corosync.conf

nandex · Feb 17, 2016

I only see this problem:

ipcc_send_rec failed: Connection refused

t.lamprecht · Feb 17, 2016

t.lamprecht said:
Please attach also the content of

Code:

cat /etc/pve/corosync.conf

Please. And can you ping each other host (at the interface where corosync listens)

nandex said:
I only see this problem:

ipcc_send_rec failed: Connection refused

Nothing from corosync?

systemctl restart pve-cluster

nandex · Feb 17, 2016

- Journalctl -u corosync -u pve-cluster -b
[MAIN ] Completed service synchronization, ready to provide service.
Feb 17 16:34:50 cetamox-01 corosync[28546]: [TOTEM ] A new membership (192.168.3.34:33940) was formed. Members left: 2 4 5 6
Feb 17 16:34:50 cetamox-01 corosync[28546]: [TOTEM ] Failed to receive the leave message. failed: 2 4 5 6
Feb 17 16:34:50 cetamox-01 pmxcfs[26483]: [dcdb] notice: members: 1/26483
Feb 17 16:34:50 cetamox-01 pmxcfs[26483]: [status] notice: members: 1/26483
Feb 17 16:34:50 cetamox-01 corosync[28546]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 17 16:34:50 cetamox-01 corosync[28546]: [QUORUM] Members[1]: 1
Feb 17 16:34:50 cetamox-01 corosync[28546]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 17 16:34:50 cetamox-01 pmxcfs[26483]: [status] notice: node lost quorum
Feb 17 16:34:50 cetamox-01 pmxcfs[26483]: [dcdb] crit: received write while not quorate - trigger resync

-Corosync.conf:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: cetamox-05
nodeid: 5
quorum_votes: 1
ring0_addr: cetamox-05
}

node {
name: cetamox-04
nodeid: 4
quorum_votes: 1
ring0_addr: cetamox-04
}
#Same for all nodes.
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cluster
config_version: 9
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 192.168.3.34
mcastaddr: 239.192.22.140
mcastport: 5405

ringnumber: 0
}

}

I have a version 5 en .members y en other node I have version 7.

The cluster sometimes is complete and then dissappear

nandex · Feb 17, 2016

I could make a cluster with 5 nodes but node 1 disconnected and in systemctl status pve.cluster -l said this:

[dcdb] notice: members: 1/21448
Feb 17 18:44:08 cetamox-01 pmxcfs[21448]: [status] notice: members: 1/21448
Feb 17 18:44:08 cetamox-01 pmxcfs[21448]: [status] notice: node lost quorum
Feb 17 18:44:08 cetamox-01 pmxcfs[21448]: [dcdb] crit: received write while not quorate - trigger resync
Feb 17 18:44:08 cetamox-01 pmxcfs[21448]: [dcdb] crit: leaving CPG group
Feb 17 18:44:09 cetamox-01 pmxcfs[21448]: [dcdb] notice: start cluster connection
Feb 17 18:44:09 cetamox-01 pmxcfs[21448]: [dcdb] notice: members: 1/21448
Feb 17 18:44:09 cetamox-01 pmxcfs[21448]: [dcdb] notice: all data is up to date

and complete journal said this:

Feb 17 18:52:12 cetamox-01 corosync[22954]: [TOTEM ] A new membership (192.168.3.34:41416) was formed. Members left: 2 4 5 6
Feb 17 18:52:12 cetamox-01 corosync[22954]: [TOTEM ] Failed to receive the leave message. failed: 2 4 5 6
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] notice: members: 1/23712
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [status] notice: members: 1/23712
Feb 17 18:52:12 cetamox-01 corosync[22954]: [QUORUM] This node is within the non-primary component and will NOT provide any service
Feb 17 18:52:12 cetamox-01 corosync[22954]: [QUORUM] Members[1]: 1
Feb 17 18:52:12 cetamox-01 corosync[22954]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [status] notice: node lost quorum
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] crit: received write while not quorate - trigger resync
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] crit: leaving CPG group
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] notice: start cluster connection
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] notice: members: 1/23712
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] notice: all data is up to date

nandex · Feb 18, 2016

I was watching the network interface and has a INTERRUPT. How could I solve this?

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 90:b1:1c:af:43:c2
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8889403 errors:0 dropped:2745 overruns:0 frame:0
TX packets:5438367 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3247826669 (3.0 GiB) TX bytes:2734407640 (2.5 GiB)
Interrupt:37 Memory:cd000000-cd7fffff

t.lamprecht · Feb 18, 2016

nandex said:
I have a version 5 en .members y en other node I have version 7.

Probably a result of the fore add.
To solve this do the following on the node with the lowest config_version number:

Code:

systemctl stop pve-cluster
systemctl stop corosync

cp corosync.conf /etc/pve/corosync.conf

systemctl start corosync
systemctl start pve-cluster

Check also your Switchconfig and make sure IGMP Snooping is configured correctly so that multicast works properly. What hardware do you have?

Else it seems like an bug with the interface (firmware/hardware?).

nandex said:
RX bytes:3247826669 (3.0 GiB) TX bytes:2734407640 (2.5 GiB)
Interrupt:37 Memory:cd000000-cd7fffff

No worries here, this is normal! This indicates only on which interrupt the cards IO gets processed, interrupts are generally nothing bad, see a breakdown of ifconfig here and search for Interrupt

So before giving you advice can you summarize the following (so that I know what should be the best approach):
Is only one node failing to join with corosync?
Is this a new installation or an upgrade from 3.4?
Is this setup in production or testing?

nandex · Feb 18, 2016

Hi Iampretch,

Is an upgrade. Thank you for the info about network. Finally, I detected the problem with multicast connection because when I force the unicast connection, the cluster works perfectly.

Thanks for your help.

mateusz · Mar 4, 2016

Hello,
I have similar issue. We have 2 PVE clusters, dev and production. clusters are connected to the same switches, in other ip networks, but without vlans.
Yesterday I upgraded dev cluster from PVE 3.4 to 4.1. Procedure from PVE wiki is end without problems, but after few minutes dev cluster lost quorum. I can't get it working. Before upgrade everything is working good. The worst is I realized few hours after upgrade, that production cluster also lost quorum.
In /var/log/cluster/corosync.log (production cluster v.3.4) in attachment.

Could You explain me why dev cluster have impact on production? How Can I restore production cluster quorum?
Now dev cluster is powered off, and production is working without quorum (read only at /etc/pve/), but all vm's are running.

Best Regards
Mateusz

t.lamprecht · Mar 4, 2016

I'd guess you really messed up the corosync config? At least it looks a bit like this. Collisions should happen here, I run multiple 3.4 and 4.X cluster on the same network (virtual through bridges and real ones).

You checked:

t.lamprecht said:
Check also your Switchconfig and make sure IGMP Snooping is configured correctly so that multicast works properly

Re check the switchs config to be sure that it isn't a problem of it.

Did you recreate the cluster with the pvecm tool? With unique cluster name?
Can you post the corosync config, at best from both clusters if possible.

And on the production you can restart corosync and pve-cluster services on all nodes, this should restore quorum.

mateusz said:
Could You explain me why dev cluster have impact on production?

Not really yet, if network is OK it could be a missed step somewhere, no offense.
But please post the corosync configs, maybe we find there something strange:

Code:

# for the 4.X dev cluster
cat /etc/pve/corosync.conf
# for the 3.4
cat /etc/pve/cluster.conf
cman_tool status

mateusz · Mar 4, 2016

t.lamprecht said:
I'd guess you really messed up the corosync config? At least it looks a bit like this. Collisions should happen here, I run multiple 3.4 and 4.X cluster on the same network (virtual through bridges and real ones).

You checked:

Re check the switchs config to be sure that it isn't a problem of it.

Propably IGMP Snooping is disabled. I'm looking at this now.

t.lamprecht said:
Did you recreate the cluster with the pvecm tool? With unique cluster name?
Can you post the corosync config, at best from both clusters if possible.

Dev cluster wos recreated via pvecm. clusters have unique name. ('backup' and 'c01')

t.lamprecht said:
And on the production you can restart corosync and pve-cluster services on all nodes, this should restore quorum.

Should I restart one-by-one starting from first node?

t.lamprecht said:
Not really yet, if network is OK it could be a missed step somewhere, no offense.
But please post the corosync configs, maybe we find there something strange:

Code:

# for the 4.X dev cluster cat /etc/pve/corosync.conf # for the 3.4 cat /etc/pve/cluster.conf cman_tool status

/etc/pve/cluster.conf from 3.4 cluster

Code:

<?xml version="1.0"?>
<cluster name="c01" config_version="8">

  <cman keyfile="/var/lib/pve-cluster/corosync.authkey">
  </cman>

  <clusternodes>
  <clusternode name="kvm12" votes="1" nodeid="1"/>
  <clusternode name="kvm27" votes="1" nodeid="2"/>
  <clusternode name="kvm17" votes="1" nodeid="3"/>
  <clusternode name="kvm37" votes="1" nodeid="4"/>
  <clusternode name="kvm32" votes="1" nodeid="5"/>
  <clusternode name="kvm22" votes="1" nodeid="6"/>
  </clusternodes>

</cluster>

/etc/pve/corosync.conf I will post later becouse servers are powered off

t.lamprecht · Mar 4, 2016

mateusz said:
Should I restart one-by-one starting from first node?

As you currently have no quorum this should not really matter that much (it can only get better (hopefully)), but yes that would be good.

Config looks good (I don't like the corosync 1/cman that much, you do not really see whats going on/hows it configured).

Code:

cman_tool status

Would be nice too, also after the restart of the services.

mateusz · Mar 6, 2016

Hello,
Today I make shutdown of all proxmox servers, next start one by one. Each server join cluster and works, but only for 10 minutes, and quorum is down. IGMP Snooping now is enabled globally, dev cluster switched down.
Corosync.log in attachment.

Code:

pvecm status
Version: 6.2.0
Config Version: 8
Cluster Name: c01
Cluster Id: 541
Cluster Member: Yes
Cluster Generation: 5764
Membership state: Cluster-Member
Nodes: 1
Expected votes: 6
Total votes: 1
Node votes: 1
Quorum: 4 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: kvm12
Node ID: 1
Multicast addresses: 239.192.2.31
Node addresses: 10.20.8.12

pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M   5548   2016-03-06 10:19:25  kvm12
   2   X   5668                        kvm27
   3   X   5668                        kvm17
   4   X   5668                        kvm37
   5   X   5668                        kvm32
   6   X   5668                        kvm22

t.lamprecht · Mar 7, 2016

Did you tried and tested the things from:
https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues

Also the first posts of this:
https://forum.proxmox.com/threads/howto-fix-corosync-totem-retransmit-list-errors.23795/
could be worth a read.

I have the feeling its a switch problem, but not sure. It's hard to tell with only one corosync log from one node.

mateusz · Mar 7, 2016

Thank You for this info.
In Sunday I moved dev cluster to vlan 2000, but this not help, after read this links, I enabled IGMP L2-general-quiter and now there is quorum on each cluster.
Unfortunately the omping is not working (before and after L2 quiter function change), I suppose, it should work.
Now, in switch configuration I have settings:

Code:

Switch on-off IGMP snooping: Open
IGMP Snooping VLAN Config: VID2000 - Open
IGMP Snooping Configuration:
[LIST]
[*]vlan 2000
[LIST]
[*]Immadiate leave configuration - Disable
[*]L2-general-quiter configuration - Enable
[*]Group number  - 50
[*]Source table number - 40
[/LIST]
[/LIST]
IGMP snooping mrouter port configuration:
[LIST]
[*]vlan 2000
[LIST]
[*]VLAN ID 2000
[*]Mrouter port - nothing
[*]MRouter port alive time - 255
[/LIST]
[/LIST]
IGMP snooping query configuration:
[LIST]
[*]vlan 2000
[LIST]
[*]VLAN ID - 2000
[*]Query-Interval - 125
[*]Query-mrsp configuration - 10
[*]Query-robustness configuration - 2
[*]Suppression-query-time configuration - 255
[/LIST]
[/LIST]

Some of this values is default setting on my switch (DCN DCRS-5750). Is there any mistake, or should I configure something else?
And other question, why It was working for 2 years and suddenly stop after upgrade one cluster to PVE 4.1

One more time, Thank You very, very much for help.
Best Regards

Search

Search

Proxmox 4. Cluster nodes being red :(

nandex

Member

t.lamprecht

Proxmox Staff Member

nandex

Member

t.lamprecht

Proxmox Staff Member

nandex

Member

nandex

Member

nandex

Member

t.lamprecht

Proxmox Staff Member

nandex

Member

mateusz

Member

Attachments

t.lamprecht

Proxmox Staff Member

mateusz

Member

t.lamprecht

Proxmox Staff Member

mateusz

Member

Attachments

t.lamprecht

Proxmox Staff Member

mateusz

Member