Cluster problem. Node is red, but online

GospodinAbdula

New Member
Jul 25, 2014
22
0
1
Hello! I have cluster with 10 nodes and all nodes red in web interface.

5d23ce74ecb34f0587e62b2a67b84ed4.png

Code:
root@node0:~# pvecm status
Version: 6.2.0
Config Version: 10
Cluster Name: Cluster0
Cluster Id: 57240
Cluster Member: Yes
Cluster Generation: 5140
Membership state: Cluster-Member
Nodes: 10
Expected votes: 10
Total votes: 10
Node votes: 1
Quorum: 6  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: node0
Node ID: 3
Multicast addresses: 239.192.223.120 
Node addresses: 172.16.187.10

Code:
root@node0:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Code:
root@node0:~# cat /etc/pve/.members 
{
"nodename": "node0",
"version": 19,
"cluster": { "name": "Cluster0", "version": 10, "nodes": 10, "quorate": 1 },
"nodelist": {
  "node1": { "id": 1, "online": 1, "ip": "172.16.187.11"},
  "node3": { "id": 2, "online": 1, "ip": "172.16.187.13"},
  "node0": { "id": 3, "online": 1, "ip": "172.16.187.10"},
  "pmox4": { "id": 4, "online": 1, "ip": "172.16.187.24"},
  "pmox2": { "id": 5, "online": 1, "ip": "172.16.187.22"},
  "pmox3": { "id": 7, "online": 1, "ip": "172.16.187.23"},
  "pmox1": { "id": 8, "online": 1, "ip": "172.16.187.20"},
  "pmox5": { "id": 6, "online": 1, "ip": "172.16.187.21"},
  "node2": { "id": 9, "online": 1, "ip": "172.16.187.12"},
  "pmox0": { "id": 10, "online": 1, "ip": "172.16.187.30"}
  }
}

Code:
node0 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.596/0.596/0.596/0.000
node0 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.620/0.620/0.620/0.000
node1 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.560/0.560/0.560/0.000
node1 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.570/0.570/0.570/0.000
node3 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.573/0.573/0.573/0.000
node3 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.585/0.585/0.585/0.000
pmox0 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.322/0.322/0.322/0.000
pmox0 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.333/0.333/0.333/0.000
pmox1 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.181/0.181/0.181/0.000
pmox1 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.230/0.230/0.230/0.000
pmox3 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.403/0.403/0.403/0.000
pmox3 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.454/0.454/0.454/0.000
pmox4 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.191/0.191/0.191/0.000
pmox4 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.249/0.249/0.249/0.000
 
Hi!

I have the same "Problem", all VMs are running, but the GUI... i restarted everything (services), but it is still red, all my four nodes. At the weekend i will reboot all nodes, then i will see if that will help (on weekend because we use Proxmox in our working environment).
I will told what happens after reboot.
I have read a lot about "red nodes", but i dont found the right solution.
The first moment when the nodes become red, was the moment after i added a NFS Storage (Synology). Then the red node moments startet...
After that i removed the NFS Storage, but it is still red.
I have all up to date, i will write again on coming weekend..

Best regards,
Roman
 
Hi Dietmar!

Yes, of course. I restart all services on all nodes, one after one. But its not better.
What i notice during this "phase" is, that the load overage on all nodes growed up to the double. On the GUI i can only see actually the "Übersicht", the graphical live processes don´t work.

Is there a chance to restart the "gui" or other services, if there is anyway availible one? like the "NAS4free", sometimes the gui is not available, after "/etc/rc.d/lighttpd restart" the nas4free gui works.

Best regards,

Roman
 
Hi Dietmar!

No, this command does not hang, but why is my NFS Share offline..

And i get mails from every host:

/etc/cron.daily/mlocate:
Warning: /var/lib/mlocate/daily.lock present, not running updatedb.
run-parts: /etc/cron.daily/mlocate exited with return code 1

Output from pvesm status:

root@pve1:~# pvesm status
storage 'DiskStationNFS' is not online
CephStorage01 rbd 1 16971960980 7056923792 9915037188 42.08%
DiskStationNFS nfs 0 0 0 0 100.00%
local dir 1 796039448 8207464 787831984 1.53%
root@pve1:~#

I will try to get my NAS online, then i test the pvesm command again.

Thanks,

Roman


[UPDATE]

Now the NFS Share works, but on the Proxmox GUI the nodes are all still red.


root@pve1:~# pvesm status
CephStorage01 rbd 1 16971960980 7056959636 9915001344 42.08%
DiskStationNFS nfs 1 1913418624 1761644032 151655808 92.57%
local dir 1 796039448 8207464 787831984 1.53%
root@pve1:~#


Thanks,
Roman
 
Last edited:
Hi,

we have the same problem regulary since a few weeks. (Setup 3 Node cluster with NAS NFS-Backup)
We can turn the nodes back to "green", if we restart the "PVECluster" service via the webgui (Node->Services->PVECluster) or the CLI (service pve-cluster status) WITHOUT rebooting the nodes. It took us a long time to discover ....
When this happens we have to restart this service on two of our three nodes - minimum on two nodes.
This happens during a NAS backup, the backup jobs to the share hanging for many hours.
In my opinion, the cluster-communication is disturbed due overload of the backups to the nas.
We are hunting the problem and wrote a Skript, which does few tests regulary and write the output to a logfile.
But at the moment all of the tests are OK when this happens...
pve-cluster, cman, /etc/pve/.members, ...
I can provide the bash-script, if somebody wants it.

It would be interesting how the web-gui, checks if a node turns to red (heartbeat?) and how one can check this via a script?
 
Last edited:
Hello? i've problem here. At i check in
#cat /etc/pve/.members

It show this.

{
"nodename": "proxmox2",
"version": 4,
"cluster": { "name": "cluster", "version": 2, "nodes": 2, "quorate": 0 },
"nodelist": {
"proxmox": { "id": 2, "online": 0},
"proxmox2": { "id": 1, "online": 1, "ip": "192.168.36.6"}
}
}

Ip for node proxmox is mssing. How to fix it?
 
We have exact issues and use the same solution.

To add: We eliminated NFS or any kind of remote backup. All 3 nodes backup to a local disk . Later on I rsync the backups to each of the other nodes .

During the backups, we still get nodes going red and other issues - like some vm's not starting after backup.

So NAS or NFS are not the cause of the issue.