Cluster problem. Node is red, but online

GospodinAbdula

New Member
Jul 25, 2014
22
0
1
Hello! I have cluster with 10 nodes and all nodes red in web interface.

5d23ce74ecb34f0587e62b2a67b84ed4.png

Code:
root@node0:~# pvecm status
Version: 6.2.0
Config Version: 10
Cluster Name: Cluster0
Cluster Id: 57240
Cluster Member: Yes
Cluster Generation: 5140
Membership state: Cluster-Member
Nodes: 10
Expected votes: 10
Total votes: 10
Node votes: 1
Quorum: 6  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: node0
Node ID: 3
Multicast addresses: 239.192.223.120 
Node addresses: 172.16.187.10

Code:
root@node0:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Code:
root@node0:~# cat /etc/pve/.members 
{
"nodename": "node0",
"version": 19,
"cluster": { "name": "Cluster0", "version": 10, "nodes": 10, "quorate": 1 },
"nodelist": {
  "node1": { "id": 1, "online": 1, "ip": "172.16.187.11"},
  "node3": { "id": 2, "online": 1, "ip": "172.16.187.13"},
  "node0": { "id": 3, "online": 1, "ip": "172.16.187.10"},
  "pmox4": { "id": 4, "online": 1, "ip": "172.16.187.24"},
  "pmox2": { "id": 5, "online": 1, "ip": "172.16.187.22"},
  "pmox3": { "id": 7, "online": 1, "ip": "172.16.187.23"},
  "pmox1": { "id": 8, "online": 1, "ip": "172.16.187.20"},
  "pmox5": { "id": 6, "online": 1, "ip": "172.16.187.21"},
  "node2": { "id": 9, "online": 1, "ip": "172.16.187.12"},
  "pmox0": { "id": 10, "online": 1, "ip": "172.16.187.30"}
  }
}

Code:
node0 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.596/0.596/0.596/0.000
node0 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.620/0.620/0.620/0.000
node1 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.560/0.560/0.560/0.000
node1 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.570/0.570/0.570/0.000
node3 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.573/0.573/0.573/0.000
node3 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.585/0.585/0.585/0.000
pmox0 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.322/0.322/0.322/0.000
pmox0 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.333/0.333/0.333/0.000
pmox1 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.181/0.181/0.181/0.000
pmox1 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.230/0.230/0.230/0.000
pmox3 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.403/0.403/0.403/0.000
pmox3 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.454/0.454/0.454/0.000
pmox4 :   unicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.191/0.191/0.191/0.000
pmox4 : multicast, xmt/rcv/%loss = 1/1/0%, min/avg/max/std-dev = 0.249/0.249/0.249/0.000
 
Hi!

I have the same "Problem", all VMs are running, but the GUI... i restarted everything (services), but it is still red, all my four nodes. At the weekend i will reboot all nodes, then i will see if that will help (on weekend because we use Proxmox in our working environment).
I will told what happens after reboot.
I have read a lot about "red nodes", but i dont found the right solution.
The first moment when the nodes become red, was the moment after i added a NFS Storage (Synology). Then the red node moments startet...
After that i removed the NFS Storage, but it is still red.
I have all up to date, i will write again on coming weekend..

Best regards,
Roman
 
Hi Dietmar!

Yes, of course. I restart all services on all nodes, one after one. But its not better.
What i notice during this "phase" is, that the load overage on all nodes growed up to the double. On the GUI i can only see actually the "Übersicht", the graphical live processes don´t work.

Is there a chance to restart the "gui" or other services, if there is anyway availible one? like the "NAS4free", sometimes the gui is not available, after "/etc/rc.d/lighttpd restart" the nas4free gui works.

Best regards,

Roman
 
Hi Dietmar!

No, this command does not hang, but why is my NFS Share offline..

And i get mails from every host:

/etc/cron.daily/mlocate:
Warning: /var/lib/mlocate/daily.lock present, not running updatedb.
run-parts: /etc/cron.daily/mlocate exited with return code 1

Output from pvesm status:

root@pve1:~# pvesm status
storage 'DiskStationNFS' is not online
CephStorage01 rbd 1 16971960980 7056923792 9915037188 42.08%
DiskStationNFS nfs 0 0 0 0 100.00%
local dir 1 796039448 8207464 787831984 1.53%
root@pve1:~#

I will try to get my NAS online, then i test the pvesm command again.

Thanks,

Roman


[UPDATE]

Now the NFS Share works, but on the Proxmox GUI the nodes are all still red.


root@pve1:~# pvesm status
CephStorage01 rbd 1 16971960980 7056959636 9915001344 42.08%
DiskStationNFS nfs 1 1913418624 1761644032 151655808 92.57%
local dir 1 796039448 8207464 787831984 1.53%
root@pve1:~#


Thanks,
Roman
 
Last edited:
Hi,

we have the same problem regulary since a few weeks. (Setup 3 Node cluster with NAS NFS-Backup)
We can turn the nodes back to "green", if we restart the "PVECluster" service via the webgui (Node->Services->PVECluster) or the CLI (service pve-cluster status) WITHOUT rebooting the nodes. It took us a long time to discover ....
When this happens we have to restart this service on two of our three nodes - minimum on two nodes.
This happens during a NAS backup, the backup jobs to the share hanging for many hours.
In my opinion, the cluster-communication is disturbed due overload of the backups to the nas.
We are hunting the problem and wrote a Skript, which does few tests regulary and write the output to a logfile.
But at the moment all of the tests are OK when this happens...
pve-cluster, cman, /etc/pve/.members, ...
I can provide the bash-script, if somebody wants it.

It would be interesting how the web-gui, checks if a node turns to red (heartbeat?) and how one can check this via a script?
 
Last edited:
Hello? i've problem here. At i check in
#cat /etc/pve/.members

It show this.

{
"nodename": "proxmox2",
"version": 4,
"cluster": { "name": "cluster", "version": 2, "nodes": 2, "quorate": 0 },
"nodelist": {
"proxmox": { "id": 2, "online": 0},
"proxmox2": { "id": 1, "online": 1, "ip": "192.168.36.6"}
}
}

Ip for node proxmox is mssing. How to fix it?
 
We have exact issues and use the same solution.

To add: We eliminated NFS or any kind of remote backup. All 3 nodes backup to a local disk . Later on I rsync the backups to each of the other nodes .

During the backups, we still get nodes going red and other issues - like some vm's not starting after backup.

So NAS or NFS are not the cause of the issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!