Nodes going red

RobFantini

Famous Member
May 24, 2012
2,041
107
133
Boston,Mass
we have an issue with nodes going red at pve web page for at least a week.
we've a 3 node cluster, all software up to date. corosync uses a separate network.

From pve web pages: every morning at least 2 of the nodes show the other noded red . usually one of the nodes show all green.

from cli pve status shows all OK at the 3 nodes.

the red issue can be fixed by running: /etc/init.d/pve-cluster restart

The network can get busy overnight with pve backups and other rsync cronjobs.

we have the red not issue now.

here is more information:
Code:
dell1  /var/log # cat /etc/pve/.members
{
"nodename": "dell1",
"version": 94,
"cluster": { "name": "cluster-v4", "version": 13, "nodes": 3, "quorate": 1 },
"nodelist": {
  "sys3": { "id": 1, "online": 1, "ip": "10.1.10.42"},
  "dell1": { "id": 3, "online": 1, "ip": "10.1.10.181"},
  "sys5": { "id": 4, "online": 1, "ip": "10.1.10.19"}
  }
}

Code:
dell1  ~ # pvecm status
Quorum information
------------------
Date:             Sun Nov 15 07:51:39 2015
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          11448
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.2.8.19
0x00000001          1 10.2.8.42
0x00000003          1 10.2.8.181 (local)
Code:
dell1  ~ # pveversion -v
proxmox-ve: 4.0-21 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-1-pve: 4.2.3-18
pve-kernel-4.2.3-2-pve: 4.2.3-21
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie


Multicast tests have been done and seem to be OK
Code:
dell1  /etc # omping -c 10000 -i 0.001 -F -q  sys3-corosync sys5-corosync dell1-corosync
sys3-corosync : waiting for response msg
sys5-corosync : waiting for response msg
sys5-corosync : joined (S,G) = (*, 232.43.211.234), pinging
sys3-corosync : joined (S,G) = (*, 232.43.211.234), pinging
sys3-corosync : given amount of query messages was sent
sys5-corosync : given amount of query messages was sent

sys3-corosync :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.073/0.101/0.282/0.020
sys3-corosync : multicast, xmt/rcv/%loss = 10000/9993/0% (seq>=8 0%), min/avg/max/std-dev = 0.069/0.107/0.291/0.021
sys5-corosync :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.060/0.099/3.637/0.073
sys5-corosync : multicast, xmt/rcv/%loss = 10000/9993/0% (seq>=8 0%), min/avg/max/std-dev = 0.059/0.107/3.645/0.073

dell1  /etc # omping -c 600 -i 1 -q  sys3-corosync sys5-corosync dell1-corosync
sys3-corosync : waiting for response msg
sys5-corosync : waiting for response msg
sys3-corosync : waiting for response msg
sys5-corosync : waiting for response msg
sys5-corosync : joined (S,G) = (*, 232.43.211.234), pinging
sys3-corosync : joined (S,G) = (*, 232.43.211.234), pinging
sys3-corosync : given amount of query messages was sent
sys5-corosync : given amount of query messages was sent

sys3-corosync :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.108/0.251/0.382/0.035
sys3-corosync : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.112/0.253/0.779/0.041
sys5-corosync :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.125/0.216/1.754/0.071
sys5-corosync : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.116/0.210/1.762/0.067

As long as our vm's keep working I'll keep the nodes red in order to supply more information.

I've been checking syslog at each node and can not decipher what is causing the issue.

Any suggestions to try to get this fixed?

best regards, Rob Fantini
 
Last edited:
never had that before...

but to me that sounds like a quorum issue.
One of the nodes fell out of the Cluster for a short while (probably the one with the all green) and then is not rejoining for some reason.


What do the following commands say (on each Node):

pvecm nodes
pvecm status

should let you figure out which node is the one with issues.

what type of "network" do you have ? as in interconnection speed between nodes and backups
 
Last edited:
never had that before...

but to me that sounds like a quorum issue.
One of the nodes fell out of the Cluster for a short while (probably the one with the all green) and then is not rejoining for some reason.


What do the following commands say (on each Node):

pvecm nodes
pvecm status

should let you figure out which node is the one with issues.

Code:
sys3  ~ # pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         4          1 sys5-corosync
         1          1 10.2.8.42 (local)
         3          1 dell1-corosync
sys3  ~ # 
sys3  ~ # pvecm status
Quorum information
------------------
Date:             Sun Nov 15 08:31:19 2015
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          11448
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.2.8.19
0x00000001          1 10.2.8.42 (local)
0x00000003          1 10.2.8.181

#-----------------------------------------------------------
dell1  /var/log # pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         4          1 sys5-corosync
         1          1 10.2.8.42
         3          1 dell1-corosync (local)
dell1  /var/log # 
dell1  /var/log # pvecm status
Quorum information
------------------
Date:             Sun Nov 15 08:31:02 2015
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          11448
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.2.8.19
0x00000001          1 10.2.8.42
0x00000003          1 10.2.8.181 (local)

#--------------------------------------------------------------
sys5  ~ # pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         4          1 sys5-corosync (local)
         1          1 10.2.8.42
         3          1 dell1-corosync
sys5  ~ # 
sys5  ~ # pvecm status

Quorum information
------------------
Date:             Sun Nov 15 08:31:25 2015
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000004
Ring ID:          11448
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.2.8.19 (local)
0x00000001          1 10.2.8.42
0x00000003          1 10.2.8.181
 
How can all 3 notes "think" they have quorum , but still mark the Nodes red.

Thats some super weirdness.
My guess is that pve-cluster hangs. note sure why yet.

what type of "network" do you have ? as in interconnection speed between nodes and backupspace - same switch ? different mtu's ?
 
Last edited:
How can all 3 notes "think" they have quorum , but still mark the Nodes red.

Thats some super weirdness.
My guess is that pve-cluster hangs. note sure why yet.

what type of "network" do you have ? as in interconnection speed between nodes and backupspace - same switch ? different mtu's ?

network: We have a Cisco SG300 switch.
the corosync network uses its own vlan . nothing else is on the vlan. each node has a dedicated nic for corosync.
the web pages are accessed using a different vlan . that vlan does get more traffic as our nfs server is on same.

my plan is to move nfs to a different vlan. then use another nic on each node to connect to nfs.
 
so currently you have:

1 Nic - Clients - assume 1G link
1 Nic - Corosync (tagged)+ NFS (tagged) - 1G link


Can your Switch prioritise traffic based on Vlans ?
I am not familiar with Cisco switches.
 
I'd then bond the Nics on each node, tag all Vlans and prioritise said Vlans on the switch(es).


We use Openvswitch for that on the nodes.
https://pve.proxmox.com/wiki/Open_vSwitch

depending on the nodes we have from 5x 1G bonds up to 18x 10G bonds (2x10G + 2x2x40G 40Gto10G breakout-cables) setup that way. Works IMHO more performant than native linux bridging and is cleaner to setup / review in gui.
 
Last edited:
I'd then bond the Nics on each node, tag all Vlans and prioritise said Vlans on the switch(es).


We use Openvswitch for that on the nodes.
https://pve.proxmox.com/wiki/Open_vSwitch

depending on the nodes we have from 5x 1G bonds up to 18x 10G bonds (2x10G + 2x2x40G 40Gto10G breakout-cables) setup that way. Works IMHO more performant than native linux bridging and is cleaner to setup / review in gui.

I'll study Openvswitch. that will take some practice with a test cluster before using in production.

quick questions - what kind of network switch do you use for OVS , what settings are used on the ports OVS uses ? Like same vlan etc. I'll start another thread on OVS questions - if any later.
 
we've had the red issue for awhile. and it is related to network traffic.

Setting up a seprate corosync network should have fixed this issue.

My guess is some of the cluster software uses the corosync network and stays working.

the part the uses /etc/pve does not.


I've got to restart cluster now in order to prevent further issues.

The good thing about a seprate corosync network is that this:
Code:
/etc/init.d/pve-cluster restart
quickly fixes the issue. before using a seprate corosync network it sometimes took reboot to fix the red issue.
 
corosync.conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: sys3
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.2.8.42
  }

  node {
    name: sys5
    nodeid: 4
    quorum_votes: 1
    ring0_addr: sys5-corosync
  }

  node {
    name: dell1
    nodeid: 3
    quorum_votes: 1
    ring0_addr: dell1-corosync
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster-v4
  config_version: 13
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.2.8.181
    ringnumber: 0
  }

}
/etc/hosts part:
Code:
# corosync network hosts
10.2.8.42  sys3-corosync.fantinibakery.com  sys3-corosync
10.2.8.19  sys5-corosync.fantinibakery.com  sys5-corosync
10.2.8.181 dell1-corosync.fantinibakery.com dell1-corosync
 
I've another issue.

I tested restart two nodes. On both none of the vm's started..


When one node restarted /etc/pve was not writable . on the other node it was writable.
I can manually get around these issues, however want to get the core issue fixed.

looking at logs I can see issues, but can see no clues at how to solve the issue.

if there are any suggestions on how to fix this please reply.

maybe one or nodes need a reinstall. Or I should start from scratch and make a new cluster..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!