Cluster Nodes become red without any reason once a week

RodinM

Active Member
Aug 1, 2011
70
0
26
Hi, I have three nodes in a cluster with Proxmox ve 2.1 installed. An external SAS storage is used as the main shared storage in LVM mode. Everything seems to work almost fine. The migration works fine. The problem is that sometimes (once or twice a week) one of the nodes (the same one) becomes red. I can notice it only when I come to work at morning. I restart the PVECluster service and CMan service on the problem node and it turns green after a minute. But the main problem is that the planned backup of the VM's on that node can't be done when it is red. I can read the message in the task pane that the machine couldn't be backed up because the node wasn't able to lock the VM for backup.
What can be the problem with that cluster node?
Here is the output of the "pvecm status" command on the the problem node (it is green now):
Version: 6.2.0
Config Version: 3
Cluster Name: ****************
Cluster Id: 38082
Cluster Member: Yes
Cluster Generation: 372
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: **************
Node ID: 3
Multicast addresses: 239.192.148.87
Node addresses: 172.16.10.237


and from the other node:
Version: 6.2.0
Config Version: 3
Cluster Name: ***************8
Cluster Id: 38082
Cluster Member: Yes
Cluster Generation: 372
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: ************
Node ID: 2
Multicast addresses: 239.192.148.87
Node addresses: 172.16.10.235
 
# cat /etc/pve/.members
{
"nodename": "srv-bld-a07",
"version": 11,
"cluster": { "name": "****************", "version": 3, "nodes": 3, "quorate": 1 },
"nodelist": {
"srv-bld-a06": { "id": 1, "online": 1, "ip": "172.16.10.236"},
"srv-bld-a05": { "id": 2, "online": 1, "ip": "172.16.10.235"},
"srv-bld-a07": { "id": 3, "online": 1, "ip": "172.16.10.237"}
}
}

I can add that I can't open the console window of any machine on that node when logged on to the other node of the cluster. The last message in the console window is "Authetication failed". The migration works well as I already mentioned. Only when I log on to that node I can open the consoles of its VM's.
The other two nodes work well and I can open the console window of any machine of theses two nodes when logged on to any of these two nodes.
 
I need the output when the node is 'red'


Hello,


I am having same problem. From nhprox01 node:


Code:
nhprox01# cat /etc/pve/.members {
"nodename": "nhprox01",
"version": 24,
"cluster": { "name": "nhprox-cluster", "version": 6, "nodes": 6, "quorate": 0 },
"nodelist": {
  "nhprox06": { "id": 1, "online": 0, "ip": "172.17.16.43"},
  "nhprox01": { "id": 2, "online": 1, "ip": "172.17.16.8"},
  "nhprox02": { "id": 3, "online": 0, "ip": "172.17.16.9"},
  "nhprox03": { "id": 4, "online": 0, "ip": "172.17.16.4"},
  "nhprox04": { "id": 5, "online": 0, "ip": "172.17.16.5"},
  "nhprox05": { "id": 6, "online": 0, "ip": "172.17.16.6"}
  }
}




each node is only seeing online itself. The cluster has been working Ok for weeks.


Code:
# pveversion -v
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: not correctly installed
vzquota: 3.0.12-3



Any idea how to solve it?
Thanks for your help.
 
Check cluster communication (cman running?)

cman is not running

Code:
 # /etc/init.d/cman status
fenced is stopped

Code:
# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]


but multicast test seems to run ok:

Code:
 asmping 224.0.0.1 nhprox02asmping joined (S,G) = (*,224.0.0.234)
pinging 172.17.16.9 from 172.17.16.8
  unicast from 172.17.16.9, seq=1 dist=0 time=0.980 ms
multicast from 172.17.16.9, seq=1 dist=0 time=0.997 ms
  unicast from 172.17.16.9, seq=2 dist=0 time=0.193 ms
multicast from 172.17.16.9, seq=2 dist=0 time=0.206 ms

If I run this test with the cluster IP I only get unicast replies.

Code:
# asmping 239.192.7.187 nhprox02
asmping joined (S,G) = (*,239.192.7.234)
pinging 172.17.16.9 from 172.17.16.8
  unicast from 172.17.16.9, seq=1 dist=0 time=1.113 ms
  unicast from 172.17.16.9, seq=2 dist=0 time=0.197 ms
  unicast from 172.17.16.9, seq=3 dist=0 time=0.200 ms
  unicast from 172.17.16.9, seq=4 dist=0 time=0.208 ms
  unicast from 172.17.16.9, seq=5 dist=0 time=0.139 ms


I've verified again that each /etc/hosts has the correct information.
Any ideas how to restore the cluster if it has been working for weeks without making any changes to the switch config?

Thansks a lot
 
I'm seeing the same thing. I had a backup fail because a disk of an unused VM was moved. Since then, the node has been red and I haven't been able to recover it. I made sure to unlock all VMs on that node, and I stopped pve-cluster and cman and started them (on the red node, only). The web interface of this node and others in the cluster show this node as red, with no details on any of its VMs. Here is the output from the red node (node 3):

root@proxmox3:/var/lib/vz/images# service cman status
cluster is running.
root@proxmox3:/var/lib/vz/images# cat /etc/pve/.members
{
"nodename": "proxmox3",
"version": 7,
"cluster": { "name": "connectify", "version": 5, "nodes": 5, "quorate": 1 },
"nodelist": {
"proxmox5": { "id": 1, "online": 1, "ip": "192.168.202.231"},
"proxmox4": { "id": 2, "online": 1, "ip": "192.168.202.244"},
"proxmox3": { "id": 3, "online": 1, "ip": "192.168.202.243"},
"proxmox2": { "id": 4, "online": 1, "ip": "192.168.202.242"},
"proxmox1": { "id": 5, "online": 1, "ip": "192.168.202.241"}
}
}
root@proxmox3:/var/lib/vz/images# pvecm status
Version: 6.2.0
Config Version: 5
Cluster Name: ******
Cluster Id: *******
Cluster Member: Yes
Cluster Generation: 57296
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Node votes: 1
Quorum: 3
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: proxmox3
Node ID: 3
Multicast addresses: 239.192.160.122
Node addresses: 192.168.202.243




And here is the output from another node in the cluster:

root@proxmox5:~# service pve-cluster status
Checking status of pve cluster filesystem: pve-cluster running.
root@proxmox5:~# service cman status
cluster is running.
root@proxmox5:~# cat /etc/pve/.members
{
"nodename": "proxmox5",
"version": 16,
"cluster": { "name": "connectify", "version": 5, "nodes": 5, "quorate": 1 },
"nodelist": {
"proxmox5": { "id": 1, "online": 1, "ip": "192.168.202.231"},
"proxmox4": { "id": 2, "online": 1, "ip": "192.168.202.244"},
"proxmox3": { "id": 3, "online": 1, "ip": "192.168.202.243"},
"proxmox2": { "id": 4, "online": 1, "ip": "192.168.202.242"},
"proxmox1": { "id": 5, "online": 1, "ip": "192.168.202.241"}
}
}


All systems are running VE 2.1.
 
I'm seeing the same thing. I had a backup fail because a disk of an unused VM was moved. Since then, the node has been red and I haven't been able to recover it.

Does it help if you restart pvestatd

# service pvestatd restart

Besides, all your nodes seems to be running, so maybe it is just a GUI bug - or do you have some real problems?
 
Does it help if you restart pvestatd

# service pvestatd restart

Besides, all your nodes seems to be running, so maybe it is just a GUI bug - or do you have some real problems?

Holy cow, you guys are awesome. Yes, that fixed it immediately! It was just a GUI bug, but it was worse than just the node showing as red, because all of the VMs on that node were only displayed by number, and showed as powered off, although they were still running. So I couldn't use the GUI to administer anything on that node until I restarted pvestatd. Thanks!
 
I am having the same difficulty on the GUI with my node becoming red and all vm's showing down while everything is actually running. HOWEVER, i am NOT running a HA or cluster. Just a single node. It seems to occur most often after my weekend backup. But this morning (mid-week - no backup run) when i logged on they all showed down. Running "service pvestatd restart" restores the GUI display to normal. It is interesting that restarting pvestatd always says "cannot kill process xxx, process not found".

Any suggestions?

pve-manager: 2.2-24 (pve-manager/2.2/7f9cfa4c)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-80
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-16-pve: 2.6.32-80
pve-kernel-2.6.32-14-pve: 2.6.32-74
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-1
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-28
qemu-server: 2.0-62
pve-firmware: 1.0-21
libpve-common-perl: 1.0-36
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-34
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1
 
Last edited:
I am having the same difficulty on the GUI with my node becoming red and all vm's showing down while everything is actually running. HOWEVER, i am NOT running a HA or cluster. Just a single node. It seems to occur most often after my weekend backup. But this morning (mid-week - no backup run) when i logged on they all showed down. Running "service pvestatd restart" restores the GUI display to normal. It is interesting that restarting pvestatd always says "cannot kill process xxx, process not found".

Any suggestions?

upgrade to latest stable.
 
We have the most recent stable, and the pvestatd service does indeed routinely fail. I didn't notice this issue until I implemented backups of my VM and CT images, however... piggy-backing on a previous post in this thread. Good correlation implies the service drop could well be triggered by the backup process.

Thanks for the help, and happy to see this was a "non-essential" service issue.

Love the platform, keep up the great work!

upgrade to latest stable.
 
Hello I have the same issue on single host with the latest 2.2 version.
Just indicate red light and there are no names for running VM.
restart is fix trouble:
root@proxmox:~# service pvestatd restart
Restarting PVE Status Daemon: pvestatdstart-stop-daemon: warning: failed to kill 1741: No such process
.
root@proxmox:~# /etc/init.d/pve
pvebanner pve-cluster pvedaemon pve-manager pvenetcommit pvestatd
root@proxmox:~# /etc/init.d/pvestatd restart
Restarting PVE Status Daemon: pvestatd.
root@proxmox:~#
I haven't any scheduler backup or something else. just Host with several KVM and OpenVZ guests.
 
As a temporary workaround, I've added a cron job, running every thirty minutes, to start the pvestatd service. If the job is still running, it will fail and probably add a line to the syslog -- no harm done. If the service is dead, it will kick it off.

On my aystems, at least, the correlation between backup tasks and failures is very high. Hopefully the dev team can provide a patch for this, so I can unplug the cron job...!
 
post the output of 'pveversion -v'
 
root@proxmox:~# pveversion -v
pve-manager: 2.2-24 (pve-manager/2.2/7f9cfa4c)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-80
pve-kernel-2.6.32-16-pve: 2.6.32-80
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-1
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-28
qemu-server: 2.0-62
pve-firmware: 1.0-21
libpve-common-perl: 1.0-36
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-34
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!