Proxmox 4. Cluster nodes being red :(

Feb 6, 2013
47
0
6
40
Trujillo
# cat /etc/pve/.members
{
"nodename": "node-01",
"version": 4,
"cluster": { "name": "cluster", "version": 9, "nodes": 6, "quorate": 1 },
"nodelist": {
"node-01": { "id": 1, "online": 1, "ip": "X.X.X.X"},
"node-02": { "id": 2, "online": 0},
"node-03": { "id": 3, "online": 0},
"node-04": { "id": 4, "online": 0},
"node-05": { "id": 5, "online": 0},
}
}
pve-manager/4.1-13/cfb599fb (running kernel: 4.2.6-1-pve)

I'll try with "pvecm updatecerts --force" and "pvecm add master --force" but nothing happends, nodes are in red. :(
 
pvecm add master --force
Please do not simply fall back to this command instantly when such a thing happens, this could be dangerous.

What does
Code:
journalctl -u corosync -u pve-cluster -b

outputs?

Please attach also the content of
Code:
cat /etc/pve/corosync.conf
 
Please attach also the content of
Code:
cat /etc/pve/corosync.conf

Please. And can you ping each other host (at the interface where corosync listens)

I only see this problem:

ipcc_send_rec failed: Connection refused

Nothing from corosync?

systemctl restart pve-cluster
 
- Journalctl -u corosync -u pve-cluster -b
[MAIN ] Completed service synchronization, ready to provide service.
Feb 17 16:34:50 cetamox-01 corosync[28546]: [TOTEM ] A new membership (192.168.3.34:33940) was formed. Members left: 2 4 5 6
Feb 17 16:34:50 cetamox-01 corosync[28546]: [TOTEM ] Failed to receive the leave message. failed: 2 4 5 6
Feb 17 16:34:50 cetamox-01 pmxcfs[26483]: [dcdb] notice: members: 1/26483
Feb 17 16:34:50 cetamox-01 pmxcfs[26483]: [status] notice: members: 1/26483
Feb 17 16:34:50 cetamox-01 corosync[28546]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 17 16:34:50 cetamox-01 corosync[28546]: [QUORUM] Members[1]: 1
Feb 17 16:34:50 cetamox-01 corosync[28546]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 17 16:34:50 cetamox-01 pmxcfs[26483]: [status] notice: node lost quorum
Feb 17 16:34:50 cetamox-01 pmxcfs[26483]: [dcdb] crit: received write while not quorate - trigger resync

-Corosync.conf:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: cetamox-05
nodeid: 5
quorum_votes: 1
ring0_addr: cetamox-05
}

node {
name: cetamox-04
nodeid: 4
quorum_votes: 1
ring0_addr: cetamox-04
}
#Same for all nodes.
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cluster
config_version: 9
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 192.168.3.34
mcastaddr: 239.192.22.140
mcastport: 5405


ringnumber: 0
}

}

I have a version 5 en .members y en other node I have version 7. :(

The cluster sometimes is complete and then dissappear :(
 
Last edited:
I could make a cluster with 5 nodes but node 1 disconnected and in systemctl status pve.cluster -l said this:

[dcdb] notice: members: 1/21448
Feb 17 18:44:08 cetamox-01 pmxcfs[21448]: [status] notice: members: 1/21448
Feb 17 18:44:08 cetamox-01 pmxcfs[21448]: [status] notice: node lost quorum
Feb 17 18:44:08 cetamox-01 pmxcfs[21448]: [dcdb] crit: received write while not quorate - trigger resync
Feb 17 18:44:08 cetamox-01 pmxcfs[21448]: [dcdb] crit: leaving CPG group
Feb 17 18:44:09 cetamox-01 pmxcfs[21448]: [dcdb] notice: start cluster connection
Feb 17 18:44:09 cetamox-01 pmxcfs[21448]: [dcdb] notice: members: 1/21448
Feb 17 18:44:09 cetamox-01 pmxcfs[21448]: [dcdb] notice: all data is up to date

and complete journal said this:

Feb 17 18:52:12 cetamox-01 corosync[22954]: [TOTEM ] A new membership (192.168.3.34:41416) was formed. Members left: 2 4 5 6
Feb 17 18:52:12 cetamox-01 corosync[22954]: [TOTEM ] Failed to receive the leave message. failed: 2 4 5 6
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] notice: members: 1/23712
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [status] notice: members: 1/23712
Feb 17 18:52:12 cetamox-01 corosync[22954]: [QUORUM] This node is within the non-primary component and will NOT provide any service
Feb 17 18:52:12 cetamox-01 corosync[22954]: [QUORUM] Members[1]: 1
Feb 17 18:52:12 cetamox-01 corosync[22954]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [status] notice: node lost quorum
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] crit: received write while not quorate - trigger resync
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] crit: leaving CPG group
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] notice: start cluster connection
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] notice: members: 1/23712
Feb 17 18:52:12 cetamox-01 pmxcfs[23712]: [dcdb] notice: all data is up to date


:(
 
Last edited:
I was watching the network interface and has a INTERRUPT. How could I solve this?

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 90:b1:1c:af:43:c2
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8889403 errors:0 dropped:2745 overruns:0 frame:0
TX packets:5438367 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3247826669 (3.0 GiB) TX bytes:2734407640 (2.5 GiB)
Interrupt:37 Memory:cd000000-cd7fffff
 
I have a version 5 en .members y en other node I have version 7. :(

Probably a result of the fore add.
To solve this do the following on the node with the lowest config_version number:

Code:
systemctl stop pve-cluster
systemctl stop corosync

cp corosync.conf /etc/pve/corosync.conf

systemctl start corosync
systemctl start pve-cluster

Check also your Switchconfig and make sure IGMP Snooping is configured correctly so that multicast works properly. What hardware do you have?

Else it seems like an bug with the interface (firmware/hardware?).

RX bytes:3247826669 (3.0 GiB) TX bytes:2734407640 (2.5 GiB)
Interrupt:37 Memory:cd000000-cd7fffff

No worries here, this is normal! This indicates only on which interrupt the cards IO gets processed, interrupts are generally nothing bad, see a breakdown of ifconfig here and search for Interrupt :)

So before giving you advice can you summarize the following (so that I know what should be the best approach):
Is only one node failing to join with corosync?
Is this a new installation or an upgrade from 3.4?
Is this setup in production or testing?
 
Hi Iampretch,

Is an upgrade. Thank you for the info about network. Finally, I detected the problem with multicast connection because when I force the unicast connection, the cluster works perfectly.

Thanks for your help.
 
Hello,
I have similar issue. We have 2 PVE clusters, dev and production. clusters are connected to the same switches, in other ip networks, but without vlans.
Yesterday I upgraded dev cluster from PVE 3.4 to 4.1. Procedure from PVE wiki is end without problems, but after few minutes dev cluster lost quorum. I can't get it working. Before upgrade everything is working good. The worst is I realized few hours after upgrade, that production cluster also lost quorum.
In /var/log/cluster/corosync.log (production cluster v.3.4) in attachment.


Could You explain me why dev cluster have impact on production? How Can I restore production cluster quorum?
Now dev cluster is powered off, and production is working without quorum (read only at /etc/pve/), but all vm's are running.

Best Regards
Mateusz
 

Attachments

  • corosync.log.txt
    682.1 KB · Views: 6
I'd guess you really messed up the corosync config? At least it looks a bit like this. Collisions should happen here, I run multiple 3.4 and 4.X cluster on the same network (virtual through bridges and real ones).

You checked:
Check also your Switchconfig and make sure IGMP Snooping is configured correctly so that multicast works properly
Re check the switchs config to be sure that it isn't a problem of it.

Did you recreate the cluster with the pvecm tool? With unique cluster name?
Can you post the corosync config, at best from both clusters if possible.

And on the production you can restart corosync and pve-cluster services on all nodes, this should restore quorum.

Could You explain me why dev cluster have impact on production?

Not really yet, if network is OK it could be a missed step somewhere, no offense.
But please post the corosync configs, maybe we find there something strange:

Code:
# for the 4.X dev cluster
cat /etc/pve/corosync.conf
# for the 3.4
cat /etc/pve/cluster.conf
cman_tool status
 
I'd guess you really messed up the corosync config? At least it looks a bit like this. Collisions should happen here, I run multiple 3.4 and 4.X cluster on the same network (virtual through bridges and real ones).

You checked:

Re check the switchs config to be sure that it isn't a problem of it.
Propably IGMP Snooping is disabled. I'm looking at this now.

Did you recreate the cluster with the pvecm tool? With unique cluster name?
Can you post the corosync config, at best from both clusters if possible.
Dev cluster wos recreated via pvecm. clusters have unique name. ('backup' and 'c01')

And on the production you can restart corosync and pve-cluster services on all nodes, this should restore quorum.
Should I restart one-by-one starting from first node?

Not really yet, if network is OK it could be a missed step somewhere, no offense.
But please post the corosync configs, maybe we find there something strange:

Code:
# for the 4.X dev cluster
cat /etc/pve/corosync.conf
# for the 3.4
cat /etc/pve/cluster.conf
cman_tool status

/etc/pve/cluster.conf from 3.4 cluster

Code:
<?xml version="1.0"?>
<cluster name="c01" config_version="8">

  <cman keyfile="/var/lib/pve-cluster/corosync.authkey">
  </cman>

  <clusternodes>
  <clusternode name="kvm12" votes="1" nodeid="1"/>
  <clusternode name="kvm27" votes="1" nodeid="2"/>
  <clusternode name="kvm17" votes="1" nodeid="3"/>
  <clusternode name="kvm37" votes="1" nodeid="4"/>
  <clusternode name="kvm32" votes="1" nodeid="5"/>
  <clusternode name="kvm22" votes="1" nodeid="6"/>
  </clusternodes>

</cluster>

/etc/pve/corosync.conf I will post later becouse servers are powered off
 
Should I restart one-by-one starting from first node?

As you currently have no quorum this should not really matter that much (it can only get better (hopefully)), but yes that would be good.

Config looks good (I don't like the corosync 1/cman that much, you do not really see whats going on/hows it configured).

Code:
cman_tool status

Would be nice too, also after the restart of the services.
 
Hello,
Today I make shutdown of all proxmox servers, next start one by one. Each server join cluster and works, but only for 10 minutes, and quorum is down. IGMP Snooping now is enabled globally, dev cluster switched down.
Corosync.log in attachment.

Code:
pvecm status
Version: 6.2.0
Config Version: 8
Cluster Name: c01
Cluster Id: 541
Cluster Member: Yes
Cluster Generation: 5764
Membership state: Cluster-Member
Nodes: 1
Expected votes: 6
Total votes: 1
Node votes: 1
Quorum: 4 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: kvm12
Node ID: 1
Multicast addresses: 239.192.2.31
Node addresses: 10.20.8.12

pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M   5548   2016-03-06 10:19:25  kvm12
   2   X   5668                        kvm27
   3   X   5668                        kvm17
   4   X   5668                        kvm37
   5   X   5668                        kvm32
   6   X   5668                        kvm22
 

Attachments

  • after_restart_corosync.log.txt
    963.8 KB · Views: 2
Thank You for this info.
In Sunday I moved dev cluster to vlan 2000, but this not help, after read this links, I enabled IGMP L2-general-quiter and now there is quorum on each cluster.
Unfortunately the omping is not working (before and after L2 quiter function change), I suppose, it should work.
Now, in switch configuration I have settings:

Code:
Switch on-off IGMP snooping: Open
IGMP Snooping VLAN Config: VID2000 - Open
IGMP Snooping Configuration:
[LIST]
[*]vlan 2000
[LIST]
[*]Immadiate leave configuration - Disable
[*]L2-general-quiter configuration - Enable
[*]Group number  - 50
[*]Source table number - 40
[/LIST]
[/LIST]
IGMP snooping mrouter port configuration:
[LIST]
[*]vlan 2000
[LIST]
[*]VLAN ID 2000
[*]Mrouter port - nothing
[*]MRouter port alive time - 255
[/LIST]
[/LIST]
IGMP snooping query configuration:
[LIST]
[*]vlan 2000
[LIST]
[*]VLAN ID - 2000
[*]Query-Interval - 125
[*]Query-mrsp configuration - 10
[*]Query-robustness configuration - 2
[*]Suppression-query-time configuration - 255
[/LIST]
[/LIST]
Some of this values is default setting on my switch (DCN DCRS-5750). Is there any mistake, or should I configure something else?
And other question, why It was working for 2 years and suddenly stop after upgrade one cluster to PVE 4.1

One more time, Thank You very, very much for help.
Best Regards
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!