cluster hangs on /etc/pve figure out what node causes it

encore · Apr 23, 2019

Hi,

from time to time our whole cluster wents grey and "df -h" stucks right before /etc/pve.
It takes us hours to figure out what node is causing it. Mostly by connecting to every node and stop corosync / pve-cluster there.

Is there a way to figure out the hanging node trough a log file?

Chris · Apr 23, 2019

encore said:
from time to time our whole cluster wents grey and "df -h" stucks right before /etc/pve.

How often does this happen? Sounds like an issue with the /etc/pve/ FUSE mount. Does the cluster loose quorum? Are there some suspicious entries in the syslog?

encore · Apr 23, 2019

happens like once every 1-3 months. We have a big cluster with 40 nodes now, and there are different reasons why this happens. Broken disk, failing network card, etc. So isn't there a way to find out the failing node without going trough all nodes?

Chris · Apr 23, 2019

encore said:
Broken disk, failing network card, etc.

This most definitely should not lead to the behavior you describe, one would expect the cluster to see the single failed node to go down without loosing quorum. Unless it floods your network with bad packets. Can you also post the output of `cat /etc/pve/corosync.conf`?
Also, did you test the network with omping for some time: https://pve.proxmox.com/wiki/Multicast_notes#Using_omping_to_test_multicast
Maybe check the syslog's the next time this happens and check the output of `pvecm status`.

encore · May 15, 2019

today a node hung up again and the whole cluster was red. A tec fixed the node too fast to check the syslog or pvecm.
Here the output of the corosync.conf:

root@pve-ffm:~# cat /etc/pve/corosync.conf
nodelist {
node {
name: bondbabe001-74050-bl06
nodeid: 16
quorum_votes: 1
ring0_addr: 10.10.10.52
}
node {
name: bondbabe002-72011-bl12
nodeid: 23
quorum_votes: 1
ring0_addr: 10.10.10.76
}
node {
name: bondsir001-72011-bl14
nodeid: 17
quorum_votes: 1
ring0_addr: 10.10.10.41
}
node {
name: bondsir002-72001-bl08
nodeid: 21
quorum_votes: 1
ring0_addr: 10.10.10.35
}
node {
name: bondsir003-74050-bl10
nodeid: 25
quorum_votes: 1
ring0_addr: 10.10.10.166
}
node {
name: bondsir004-74050-bl11
nodeid: 26
quorum_votes: 1
ring0_addr: 10.10.10.93
}
node {
name: bondsir005-74050-bl16
nodeid: 30
quorum_votes: 1
ring0_addr: 10.10.10.19
}
node {
name: captive001-72001-bl12
nodeid: 7
quorum_votes: 1
ring0_addr: 10.10.10.38
}
node {
name: captive002-77015
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.10.2
}
node {
name: captive003-77030
nodeid: 4
quorum_votes: 1
ring0_addr: 10.10.10.33
}
node {
name: captive004-77028
nodeid: 5
quorum_votes: 1
ring0_addr: 10.10.10.89
}
node {
name: captive005-74001
nodeid: 6
quorum_votes: 1
ring0_addr: 10.10.10.67
}
node {
name: captive006-72011-bl09
nodeid: 8
quorum_votes: 1
ring0_addr: 10.10.10.16
}
node {
name: captive007-72001-bl11
nodeid: 12
quorum_votes: 1
ring0_addr: 10.10.10.18
}
node {
name: captive008-74005
nodeid: 9
quorum_votes: 1
ring0_addr: 10.10.10.143
}
node {
name: captive009-77014
nodeid: 10
quorum_votes: 1
ring0_addr: 10.10.10.62
}
node {
name: captive010-74050-bl14
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.10.25
}
node {
name: captive011-74007
nodeid: 11
quorum_votes: 1
ring0_addr: 10.10.10.7
}
node {
name: captive012-72011-bl06
nodeid: 13
quorum_votes: 1
ring0_addr: 10.10.10.34
}
node {
name: captive013-74050-bl08
nodeid: 14
quorum_votes: 1
ring0_addr: 10.10.10.193
}
node {
name: captive014-72001-bl15
nodeid: 15
quorum_votes: 1
ring0_addr: 10.10.10.92
}
node {
name: captive015-74050-bl05
nodeid: 18
quorum_votes: 1
ring0_addr: 10.10.10.232
}
node {
name: captive016-72001-bl01
nodeid: 19
quorum_votes: 1
ring0_addr: 10.10.10.11
}
node {
name: captive017-74050-bl09
nodeid: 20
quorum_votes: 1
ring0_addr: 10.10.10.151
}
node {
name: captive018-72001-bl04
nodeid: 22
quorum_votes: 1
ring0_addr: 10.10.10.15
}
node {
name: captive019-74050-bl12
nodeid: 24
quorum_votes: 1
ring0_addr: 10.10.10.84
}
node {
name: captive020-74050-bl13
nodeid: 28
quorum_votes: 1
ring0_addr: 10.10.10.21
}
node {
name: captive021-74050-bl15-rev2
nodeid: 27
quorum_votes: 1
ring0_addr: 10.10.10.252
}
node {
name: captive022-79001-bl01
nodeid: 29
quorum_votes: 1
ring0_addr: 10.10.10.219
}
node {
name: captive023-79001-bl03-lvmthin
nodeid: 31
quorum_votes: 1
ring0_addr: captive023-79001-bl03-lvmthin
}
node {
name: pve-ffm
nodeid: 1
quorum_votes: 1
ring0_addr: 10.10.10.1
}
}

ogging {
debug: off
to_syslog: yes
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: fra01
config_version: 84
interface {
bindnetaddr: 10.10.10.1
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

When I try omping, I get:

root@pve-ffm:~# omping -c 10000 -i 0.001 -F -q 10.10.10.1 10.10.10.2
10.10.10.2 : waiting for response msg
10.10.10.2 : waiting for response msg
10.10.10.2 : waiting for response msg
10.10.10.2 : waiting for response msg
10.10.10.2 : waiting for response msg

Any ideas?

Stoiko Ivanov · May 16, 2019

encore said:
When I try omping, I get:

You need to run omping on all nodes in parallel

encore · May 16, 2019

http://prntscr.com/npbnax
are these results okay?

t.lamprecht · May 16, 2019

encore said:
http://prntscr.com/npbnax
are these results okay?

yes, but you only tested two nodes, so we cannot really extrapolate those result to your full cluster.

It's a bit hard to tell what exact node may cause this from the outside, what would help is checking the syslog/journal around a time this happened and post it here.

encore said:
Is there a way to figure out the hanging node trough a log file?

So it's always a single node, once you restart that one it works again? We fixed a similar sound bug in the past, can you post your versions

Code:

pveversion -v

Search

Search

cluster hangs on /etc/pve figure out what node causes it

encore

Well-Known Member

Chris

Proxmox Staff Member

encore

Well-Known Member

Chris

Proxmox Staff Member

encore

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

encore

Well-Known Member

t.lamprecht

Proxmox Staff Member