Cluster problem. Node is red, but online

GospodinAbdula · May 13, 2015

Hello! I have cluster with 10 nodes and all nodes red in web interface.

Code:

root@node0:~# pvecm status
Version: 6.2.0
Config Version: 10
Cluster Name: Cluster0
Cluster Id: 57240
Cluster Member: Yes
Cluster Generation: 5140
Membership state: Cluster-Member
Nodes: 10
Expected votes: 10
Total votes: 10
Node votes: 1
Quorum: 6  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: node0
Node ID: 3
Multicast addresses: 239.192.223.120 
Node addresses: 172.16.187.10

Code:

root@node0:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Code:

root@node0:~# cat /etc/pve/.members 
{
"nodename": "node0",
"version": 19,
"cluster": { "name": "Cluster0", "version": 10, "nodes": 10, "quorate": 1 },
"nodelist": {
  "node1": { "id": 1, "online": 1, "ip": "172.16.187.11"},
  "node3": { "id": 2, "online": 1, "ip": "172.16.187.13"},
  "node0": { "id": 3, "online": 1, "ip": "172.16.187.10"},
  "pmox4": { "id": 4, "online": 1, "ip": "172.16.187.24"},
  "pmox2": { "id": 5, "online": 1, "ip": "172.16.187.22"},
  "pmox3": { "id": 7, "online": 1, "ip": "172.16.187.23"},
  "pmox1": { "id": 8, "online": 1, "ip": "172.16.187.20"},
  "pmox5": { "id": 6, "online": 1, "ip": "172.16.187.21"},
  "node2": { "id": 9, "online": 1, "ip": "172.16.187.12"},
  "pmox0": { "id": 10, "online": 1, "ip": "172.16.187.30"}
  }
}

Code:

node0 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.596/0.596/0.596/0.000
node0 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.620/0.620/0.620/0.000
node1 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.560/0.560/0.560/0.000
node1 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.570/0.570/0.570/0.000
node3 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.573/0.573/0.573/0.000
node3 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.585/0.585/0.585/0.000
pmox0 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.322/0.322/0.322/0.000
pmox0 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.333/0.333/0.333/0.000
pmox1 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.181/0.181/0.181/0.000
pmox1 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.230/0.230/0.230/0.000
pmox3 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.403/0.403/0.403/0.000
pmox3 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.454/0.454/0.454/0.000
pmox4 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.191/0.191/0.191/0.000
pmox4 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.249/0.249/0.249/0.000

KnowVation · May 21, 2015

Hi!

I have the same "Problem", all VMs are running, but the GUI... i restarted everything (services), but it is still red, all my four nodes. At the weekend i will reboot all nodes, then i will see if that will help (on weekend because we use Proxmox in our working environment).
I will told what happens after reboot.
I have read a lot about "red nodes", but i dont found the right solution.
The first moment when the nodes become red, was the moment after i added a NFS Storage (Synology). Then the red node moments startet...
After that i removed the NFS Storage, but it is still red.
I have all up to date, i will write again on coming weekend..

Best regards,
Roman

dietmar · May 21, 2015

Does it help if you restart the pvestatd service?

# service pvestatd restart

KnowVation · May 21, 2015

Hi Dietmar!

Yes, of course. I restart all services on all nodes, one after one. But its not better.
What i notice during this "phase" is, that the load overage on all nodes growed up to the double. On the GUI i can only see actually the "Übersicht", the graphical live processes don´t work.

Is there a chance to restart the "gui" or other services, if there is anyway availible one? like the "NAS4free", sometimes the gui is not available, after "/etc/rc.d/lighttpd restart" the nas4free gui works.

Best regards,

Roman

dietmar · May 22, 2015

Are all your storages online? Test with

# pvesm status

Or does that command hang also?

KnowVation · May 22, 2015

Hi Dietmar!

No, this command does not hang, but why is my NFS Share offline..

And i get mails from every host:

/etc/cron.daily/mlocate:
Warning: /var/lib/mlocate/daily.lock present, not running updatedb.
run-parts: /etc/cron.daily/mlocate exited with return code 1

Output from pvesm status:

root@pve1:~# pvesm status
storage 'DiskStationNFS' is not online
CephStorage01 rbd 1 16971960980 7056923792 9915037188 42.08%
DiskStationNFS nfs 0 0 0 0 100.00%
local dir 1 796039448 8207464 787831984 1.53%
root@pve1:~#

I will try to get my NAS online, then i test the pvesm command again.

Thanks,

Roman

[UPDATE]

Now the NFS Share works, but on the Proxmox GUI the nodes are all still red.

root@pve1:~# pvesm status
CephStorage01 rbd 1 16971960980 7056959636 9915001344 42.08%
DiskStationNFS nfs 1 1913418624 1761644032 151655808 92.57%
local dir 1 796039448 8207464 787831984 1.53%
root@pve1:~#

Thanks,
Roman

sepp_huber · Aug 24, 2015

Hi,

we have the same problem regulary since a few weeks. (Setup 3 Node cluster with NAS NFS-Backup)
We can turn the nodes back to "green", if we restart the "PVECluster" service via the webgui (Node->Services->PVECluster) or the CLI (service pve-cluster status) WITHOUT rebooting the nodes. It took us a long time to discover ....
When this happens we have to restart this service on two of our three nodes - minimum on two nodes.
This happens during a NAS backup, the backup jobs to the share hanging for many hours.
In my opinion, the cluster-communication is disturbed due overload of the backups to the nas.
We are hunting the problem and wrote a Skript, which does few tests regulary and write the output to a logfile.
But at the moment all of the tests are OK when this happens...
pve-cluster, cman, /etc/pve/.members, ...
I can provide the bash-script, if somebody wants it.

It would be interesting how the web-gui, checks if a node turns to red (heartbeat?) and how one can check this via a script?

Ahmad · Jan 18, 2016

Hello? i've problem here. At i check in
#cat /etc/pve/.members

It show this.

{
"nodename": "proxmox2",
"version": 4,
"cluster": { "name": "cluster", "version": 2, "nodes": 2, "quorate": 0 },
"nodelist": {
"proxmox": { "id": 2, "online": 0},
"proxmox2": { "id": 1, "online": 1, "ip": "192.168.36.6"}
}
}

Ip for node proxmox is mssing. How to fix it?

RobFantini · Jan 18, 2016

We have exact issues and use the same solution.

To add: We eliminated NFS or any kind of remote backup. All 3 nodes backup to a local disk . Later on I rsync the backups to each of the other nodes .

During the backups, we still get nodes going red and other issues - like some vm's not starting after backup.

So NAS or NFS are not the cause of the issue.

huky · Oct 21, 2016

I have the same problem.
pvesm status show
local dir less 1% on all node
the dir is /var/lib/vz
but it is 16G and empty

Search

Search

Cluster problem. Node is red, but online

GospodinAbdula

New Member

KnowVation

Renowned Member

dietmar

Proxmox Staff Member

KnowVation

Renowned Member

dietmar

Proxmox Staff Member

KnowVation

Renowned Member

sepp_huber

New Member

Ahmad

New Member

RobFantini

Famous Member

huky

Renowned Member

We value your privacy