cluster hangs on /etc/pve figure out what node causes it

encore

Well-Known Member
May 4, 2018
108
1
58
36
Hi,

from time to time our whole cluster wents grey and "df -h" stucks right before /etc/pve.
It takes us hours to figure out what node is causing it. Mostly by connecting to every node and stop corosync / pve-cluster there.

Is there a way to figure out the hanging node trough a log file?
 
from time to time our whole cluster wents grey and "df -h" stucks right before /etc/pve.
How often does this happen? Sounds like an issue with the /etc/pve/ FUSE mount. Does the cluster loose quorum? Are there some suspicious entries in the syslog?
 
happens like once every 1-3 months. We have a big cluster with 40 nodes now, and there are different reasons why this happens. Broken disk, failing network card, etc. So isn't there a way to find out the failing node without going trough all nodes?
 
Broken disk, failing network card, etc.
This most definitely should not lead to the behavior you describe, one would expect the cluster to see the single failed node to go down without loosing quorum. Unless it floods your network with bad packets. Can you also post the output of `cat /etc/pve/corosync.conf`?
Also, did you test the network with omping for some time: https://pve.proxmox.com/wiki/Multicast_notes#Using_omping_to_test_multicast
Maybe check the syslog's the next time this happens and check the output of `pvecm status`.
 
today a node hung up again and the whole cluster was red. A tec fixed the node too fast to check the syslog or pvecm.
Here the output of the corosync.conf:
root@pve-ffm:~# cat /etc/pve/corosync.conf
nodelist {
node {
name: bondbabe001-74050-bl06
nodeid: 16
quorum_votes: 1
ring0_addr: 10.10.10.52
}
node {
name: bondbabe002-72011-bl12
nodeid: 23
quorum_votes: 1
ring0_addr: 10.10.10.76
}
node {
name: bondsir001-72011-bl14
nodeid: 17
quorum_votes: 1
ring0_addr: 10.10.10.41
}
node {
name: bondsir002-72001-bl08
nodeid: 21
quorum_votes: 1
ring0_addr: 10.10.10.35
}
node {
name: bondsir003-74050-bl10
nodeid: 25
quorum_votes: 1
ring0_addr: 10.10.10.166
}
node {
name: bondsir004-74050-bl11
nodeid: 26
quorum_votes: 1
ring0_addr: 10.10.10.93
}
node {
name: bondsir005-74050-bl16
nodeid: 30
quorum_votes: 1
ring0_addr: 10.10.10.19
}
node {
name: captive001-72001-bl12
nodeid: 7
quorum_votes: 1
ring0_addr: 10.10.10.38
}
node {
name: captive002-77015
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.10.2
}
node {
name: captive003-77030
nodeid: 4
quorum_votes: 1
ring0_addr: 10.10.10.33
}
node {
name: captive004-77028
nodeid: 5
quorum_votes: 1
ring0_addr: 10.10.10.89
}
node {
name: captive005-74001
nodeid: 6
quorum_votes: 1
ring0_addr: 10.10.10.67
}
node {
name: captive006-72011-bl09
nodeid: 8
quorum_votes: 1
ring0_addr: 10.10.10.16
}
node {
name: captive007-72001-bl11
nodeid: 12
quorum_votes: 1
ring0_addr: 10.10.10.18
}
node {
name: captive008-74005
nodeid: 9
quorum_votes: 1
ring0_addr: 10.10.10.143
}
node {
name: captive009-77014
nodeid: 10
quorum_votes: 1
ring0_addr: 10.10.10.62
}
node {
name: captive010-74050-bl14
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.10.25
}
node {
name: captive011-74007
nodeid: 11
quorum_votes: 1
ring0_addr: 10.10.10.7
}
node {
name: captive012-72011-bl06
nodeid: 13
quorum_votes: 1
ring0_addr: 10.10.10.34
}
node {
name: captive013-74050-bl08
nodeid: 14
quorum_votes: 1
ring0_addr: 10.10.10.193
}
node {
name: captive014-72001-bl15
nodeid: 15
quorum_votes: 1
ring0_addr: 10.10.10.92
}
node {
name: captive015-74050-bl05
nodeid: 18
quorum_votes: 1
ring0_addr: 10.10.10.232
}
node {
name: captive016-72001-bl01
nodeid: 19
quorum_votes: 1
ring0_addr: 10.10.10.11
}
node {
name: captive017-74050-bl09
nodeid: 20
quorum_votes: 1
ring0_addr: 10.10.10.151
}
node {
name: captive018-72001-bl04
nodeid: 22
quorum_votes: 1
ring0_addr: 10.10.10.15
}
node {
name: captive019-74050-bl12
nodeid: 24
quorum_votes: 1
ring0_addr: 10.10.10.84
}
node {
name: captive020-74050-bl13
nodeid: 28
quorum_votes: 1
ring0_addr: 10.10.10.21
}
node {
name: captive021-74050-bl15-rev2
nodeid: 27
quorum_votes: 1
ring0_addr: 10.10.10.252
}
node {
name: captive022-79001-bl01
nodeid: 29
quorum_votes: 1
ring0_addr: 10.10.10.219
}
node {
name: captive023-79001-bl03-lvmthin
nodeid: 31
quorum_votes: 1
ring0_addr: captive023-79001-bl03-lvmthin
}
node {
name: pve-ffm
nodeid: 1
quorum_votes: 1
ring0_addr: 10.10.10.1
}
}

ogging {
debug: off
to_syslog: yes
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: fra01
config_version: 84
interface {
bindnetaddr: 10.10.10.1
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

When I try omping, I get:
root@pve-ffm:~# omping -c 10000 -i 0.001 -F -q 10.10.10.1 10.10.10.2
10.10.10.2 : waiting for response msg
10.10.10.2 : waiting for response msg
10.10.10.2 : waiting for response msg
10.10.10.2 : waiting for response msg
10.10.10.2 : waiting for response msg
Any ideas?
 

yes, but you only tested two nodes, so we cannot really extrapolate those result to your full cluster.

It's a bit hard to tell what exact node may cause this from the outside, what would help is checking the syslog/journal around a time this happened and post it here.

Is there a way to figure out the hanging node trough a log file?

So it's always a single node, once you restart that one it works again? We fixed a similar sound bug in the past, can you post your versions
Code:
pveversion -v
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!