Cluster Nodes become red without any reason once a week

RodinM · Jun 15, 2012

Hi, I have three nodes in a cluster with Proxmox ve 2.1 installed. An external SAS storage is used as the main shared storage in LVM mode. Everything seems to work almost fine. The migration works fine. The problem is that sometimes (once or twice a week) one of the nodes (the same one) becomes red. I can notice it only when I come to work at morning. I restart the PVECluster service and CMan service on the problem node and it turns green after a minute. But the main problem is that the planned backup of the VM's on that node can't be done when it is red. I can read the message in the task pane that the machine couldn't be backed up because the node wasn't able to lock the VM for backup.
What can be the problem with that cluster node?
Here is the output of the "pvecm status" command on the the problem node (it is green now):
Version: 6.2.0
Config Version: 3
Cluster Name: ****************
Cluster Id: 38082
Cluster Member: Yes
Cluster Generation: 372
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: **************
Node ID: 3
Multicast addresses: 239.192.148.87
Node addresses: 172.16.10.237

and from the other node:
Version: 6.2.0
Config Version: 3
Cluster Name: ***************8
Cluster Id: 38082
Cluster Member: Yes
Cluster Generation: 372
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: ************
Node ID: 2
Multicast addresses: 239.192.148.87
Node addresses: 172.16.10.235

dietmar · Jun 15, 2012

Please post the output when that node is red. Also post

# cat /etc/pve/.members

RodinM · Jun 15, 2012

# cat /etc/pve/.members
{
"nodename": "srv-bld-a07",
"version": 11,
"cluster": { "name": "****************", "version": 3, "nodes": 3, "quorate": 1 },
"nodelist": {
"srv-bld-a06": { "id": 1, "online": 1, "ip": "172.16.10.236"},
"srv-bld-a05": { "id": 2, "online": 1, "ip": "172.16.10.235"},
"srv-bld-a07": { "id": 3, "online": 1, "ip": "172.16.10.237"}
}
}

I can add that I can't open the console window of any machine on that node when logged on to the other node of the cluster. The last message in the console window is "Authetication failed". The migration works well as I already mentioned. Only when I log on to that node I can open the consoles of its VM's.
The other two nodes work well and I can open the console window of any machine of theses two nodes when logged on to any of these two nodes.

dietmar · Jun 15, 2012

I need the output when the node is 'red'

RodinM · Jun 15, 2012

I must wait till it happens

xelkano · Jul 26, 2012

dietmar said:
I need the output when the node is 'red'

Hello,

I am having same problem. From nhprox01 node:

Code:

nhprox01# cat /etc/pve/.members {
"nodename": "nhprox01",
"version": 24,
"cluster": { "name": "nhprox-cluster", "version": 6, "nodes": 6, "quorate": 0 },
"nodelist": {
  "nhprox06": { "id": 1, "online": 0, "ip": "172.17.16.43"},
  "nhprox01": { "id": 2, "online": 1, "ip": "172.17.16.8"},
  "nhprox02": { "id": 3, "online": 0, "ip": "172.17.16.9"},
  "nhprox03": { "id": 4, "online": 0, "ip": "172.17.16.4"},
  "nhprox04": { "id": 5, "online": 0, "ip": "172.17.16.5"},
  "nhprox05": { "id": 6, "online": 0, "ip": "172.17.16.6"}
  }
}

each node is only seeing online itself. The cluster has been working Ok for weeks.

Code:

# pveversion -v
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: not correctly installed
vzquota: 3.0.12-3

Any idea how to solve it?
Thanks for your help.

dietmar · Jul 26, 2012

xelkano said:
H
Any idea how to solve it?

Check cluster communication (cman running?)

xelkano · Aug 2, 2012

dietmar said:
Check cluster communication (cman running?)

cman is not running

Code:

 # /etc/init.d/cman status
fenced is stopped

Code:

# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

but multicast test seems to run ok:

Code:

 asmping 224.0.0.1 nhprox02asmping joined (S,G) = (*,224.0.0.234)
pinging 172.17.16.9 from 172.17.16.8
  unicast from 172.17.16.9, seq=1 dist=0 time=0.980 ms
multicast from 172.17.16.9, seq=1 dist=0 time=0.997 ms
  unicast from 172.17.16.9, seq=2 dist=0 time=0.193 ms
multicast from 172.17.16.9, seq=2 dist=0 time=0.206 ms

If I run this test with the cluster IP I only get unicast replies.

Code:

# asmping 239.192.7.187 nhprox02
asmping joined (S,G) = (*,239.192.7.234)
pinging 172.17.16.9 from 172.17.16.8
  unicast from 172.17.16.9, seq=1 dist=0 time=1.113 ms
  unicast from 172.17.16.9, seq=2 dist=0 time=0.197 ms
  unicast from 172.17.16.9, seq=3 dist=0 time=0.200 ms
  unicast from 172.17.16.9, seq=4 dist=0 time=0.208 ms
  unicast from 172.17.16.9, seq=5 dist=0 time=0.139 ms

I've verified again that each /etc/hosts has the correct information.
Any ideas how to restore the cluster if it has been working for weeks without making any changes to the switch config?

Thansks a lot

brianp · Sep 27, 2012

I'm seeing the same thing. I had a backup fail because a disk of an unused VM was moved. Since then, the node has been red and I haven't been able to recover it. I made sure to unlock all VMs on that node, and I stopped pve-cluster and cman and started them (on the red node, only). The web interface of this node and others in the cluster show this node as red, with no details on any of its VMs. Here is the output from the red node (node 3):

root@proxmox3:/var/lib/vz/images# service cman status
cluster is running.
root@proxmox3:/var/lib/vz/images# cat /etc/pve/.members
{
"nodename": "proxmox3",
"version": 7,
"cluster": { "name": "connectify", "version": 5, "nodes": 5, "quorate": 1 },
"nodelist": {
"proxmox5": { "id": 1, "online": 1, "ip": "192.168.202.231"},
"proxmox4": { "id": 2, "online": 1, "ip": "192.168.202.244"},
"proxmox3": { "id": 3, "online": 1, "ip": "192.168.202.243"},
"proxmox2": { "id": 4, "online": 1, "ip": "192.168.202.242"},
"proxmox1": { "id": 5, "online": 1, "ip": "192.168.202.241"}
}
}
root@proxmox3:/var/lib/vz/images# pvecm status
Version: 6.2.0
Config Version: 5
Cluster Name: ******
Cluster Id: *******
Cluster Member: Yes
Cluster Generation: 57296
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Node votes: 1
Quorum: 3
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: proxmox3
Node ID: 3
Multicast addresses: 239.192.160.122
Node addresses: 192.168.202.243

And here is the output from another node in the cluster:

root@proxmox5:~# service pve-cluster status
Checking status of pve cluster filesystem: pve-cluster running.
root@proxmox5:~# service cman status
cluster is running.
root@proxmox5:~# cat /etc/pve/.members
{
"nodename": "proxmox5",
"version": 16,
"cluster": { "name": "connectify", "version": 5, "nodes": 5, "quorate": 1 },
"nodelist": {
"proxmox5": { "id": 1, "online": 1, "ip": "192.168.202.231"},
"proxmox4": { "id": 2, "online": 1, "ip": "192.168.202.244"},
"proxmox3": { "id": 3, "online": 1, "ip": "192.168.202.243"},
"proxmox2": { "id": 4, "online": 1, "ip": "192.168.202.242"},
"proxmox1": { "id": 5, "online": 1, "ip": "192.168.202.241"}
}
}

All systems are running VE 2.1.

dietmar · Sep 28, 2012

brianp said:
I'm seeing the same thing. I had a backup fail because a disk of an unused VM was moved. Since then, the node has been red and I haven't been able to recover it.

Does it help if you restart pvestatd

# service pvestatd restart

Besides, all your nodes seems to be running, so maybe it is just a GUI bug - or do you have some real problems?

brianp · Oct 1, 2012

dietmar said:
Does it help if you restart pvestatd

# service pvestatd restart

Besides, all your nodes seems to be running, so maybe it is just a GUI bug - or do you have some real problems?

Holy cow, you guys are awesome. Yes, that fixed it immediately! It was just a GUI bug, but it was worse than just the node showing as red, because all of the VMs on that node were only displayed by number, and showed as powered off, although they were still running. So I couldn't use the GUI to administer anything on that node until I restarted pvestatd. Thanks!

Marvin · Nov 8, 2012

I am having the same difficulty on the GUI with my node becoming red and all vm's showing down while everything is actually running. HOWEVER, i am NOT running a HA or cluster. Just a single node. It seems to occur most often after my weekend backup. But this morning (mid-week - no backup run) when i logged on they all showed down. Running "service pvestatd restart" restores the GUI display to normal. It is interesting that restarting pvestatd always says "cannot kill process xxx, process not found".

Any suggestions?

pve-manager: 2.2-24 (pve-manager/2.2/7f9cfa4c)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-80
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-16-pve: 2.6.32-80
pve-kernel-2.6.32-14-pve: 2.6.32-74
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-1
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-28
qemu-server: 2.0-62
pve-firmware: 1.0-21
libpve-common-perl: 1.0-36
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-34
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1

tom · Nov 9, 2012

Marvin said:
I am having the same difficulty on the GUI with my node becoming red and all vm's showing down while everything is actually running. HOWEVER, i am NOT running a HA or cluster. Just a single node. It seems to occur most often after my weekend backup. But this morning (mid-week - no backup run) when i logged on they all showed down. Running "service pvestatd restart" restores the GUI display to normal. It is interesting that restarting pvestatd always says "cannot kill process xxx, process not found".

Any suggestions?

upgrade to latest stable.

mschultz · Jan 28, 2013

We have the most recent stable, and the pvestatd service does indeed routinely fail. I didn't notice this issue until I implemented backups of my VM and CT images, however... piggy-backing on a previous post in this thread. Good correlation implies the service drop could well be triggered by the backup process.

Thanks for the help, and happy to see this was a "non-essential" service issue.

Love the platform, keep up the great work!

tom said:
upgrade to latest stable.

ppo · Feb 11, 2013

Hello I have the same issue on single host with the latest 2.2 version.
Just indicate red light and there are no names for running VM.
restart is fix trouble:

root@proxmox:~# service pvestatd restart
Restarting PVE Status Daemon: pvestatdstart-stop-daemon: warning: failed to kill 1741: No such process
.
root@proxmox:~# /etc/init.d/pve
pvebanner pve-cluster pvedaemon pve-manager pvenetcommit pvestatd
root@proxmox:~# /etc/init.d/pvestatd restart
Restarting PVE Status Daemon: pvestatd.
root@proxmox:~#

I haven't any scheduler backup or something else. just Host with several KVM and OpenVZ guests.

mschultz · Feb 12, 2013

As a temporary workaround, I've added a cron job, running every thirty minutes, to start the pvestatd service. If the job is still running, it will fail and probably add a line to the syslog -- no harm done. If the service is dead, it will kick it off.

On my aystems, at least, the correlation between backup tasks and failures is very high. Hopefully the dev team can provide a patch for this, so I can unplug the cron job...!

tom · Feb 13, 2013

post the output of 'pveversion -v'

ppo · Feb 13, 2013

root@proxmox:~# pveversion -v
pve-manager: 2.2-24 (pve-manager/2.2/7f9cfa4c)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-80
pve-kernel-2.6.32-16-pve: 2.6.32-80
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-1
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-28
qemu-server: 2.0-62
pve-firmware: 1.0-21
libpve-common-perl: 1.0-36
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-34
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1

tom · Feb 13, 2013

you run the old version with the bug, just upgrade and your issue will be fixed.

see http://pve.proxmox.com/wiki/Downloads#Update_a_running_Proxmox_Virtual_Environment_2.x_to_latest_2.2

ppo · Feb 13, 2013

Thanks, does it mean that iso-image http://www.proxmox.com/downloads/proxmox-ve/iso-images/132-proxmox-ve-2 doesn't contain last updated realise?

Cluster Nodes become red without any reason once a week

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

Renowned Member

Proxmox Staff Member

New Member

Renowned Member

New Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

We value your privacy