cluster hangs on /etc/pve figure out what node causes it

Discussion in 'Proxmox VE: Installation and configuration' started by encore, Apr 23, 2019.

  1. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    69
    Likes Received:
    0
    Hi,

    from time to time our whole cluster wents grey and "df -h" stucks right before /etc/pve.
    It takes us hours to figure out what node is causing it. Mostly by connecting to every node and stop corosync / pve-cluster there.

    Is there a way to figure out the hanging node trough a log file?
     
  2. Chris

    Chris Proxmox Staff Member
    Staff Member

    Joined:
    Jan 2, 2019
    Messages:
    205
    Likes Received:
    22
    How often does this happen? Sounds like an issue with the /etc/pve/ FUSE mount. Does the cluster loose quorum? Are there some suspicious entries in the syslog?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    69
    Likes Received:
    0
    happens like once every 1-3 months. We have a big cluster with 40 nodes now, and there are different reasons why this happens. Broken disk, failing network card, etc. So isn't there a way to find out the failing node without going trough all nodes?
     
  4. Chris

    Chris Proxmox Staff Member
    Staff Member

    Joined:
    Jan 2, 2019
    Messages:
    205
    Likes Received:
    22
    This most definitely should not lead to the behavior you describe, one would expect the cluster to see the single failed node to go down without loosing quorum. Unless it floods your network with bad packets. Can you also post the output of `cat /etc/pve/corosync.conf`?
    Also, did you test the network with omping for some time: https://pve.proxmox.com/wiki/Multicast_notes#Using_omping_to_test_multicast
    Maybe check the syslog's the next time this happens and check the output of `pvecm status`.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    69
    Likes Received:
    0
    today a node hung up again and the whole cluster was red. A tec fixed the node too fast to check the syslog or pvecm.
    Here the output of the corosync.conf:
    root@pve-ffm:~# cat /etc/pve/corosync.conf
    nodelist {
    node {
    name: bondbabe001-74050-bl06
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 10.10.10.52
    }
    node {
    name: bondbabe002-72011-bl12
    nodeid: 23
    quorum_votes: 1
    ring0_addr: 10.10.10.76
    }
    node {
    name: bondsir001-72011-bl14
    nodeid: 17
    quorum_votes: 1
    ring0_addr: 10.10.10.41
    }
    node {
    name: bondsir002-72001-bl08
    nodeid: 21
    quorum_votes: 1
    ring0_addr: 10.10.10.35
    }
    node {
    name: bondsir003-74050-bl10
    nodeid: 25
    quorum_votes: 1
    ring0_addr: 10.10.10.166
    }
    node {
    name: bondsir004-74050-bl11
    nodeid: 26
    quorum_votes: 1
    ring0_addr: 10.10.10.93
    }
    node {
    name: bondsir005-74050-bl16
    nodeid: 30
    quorum_votes: 1
    ring0_addr: 10.10.10.19
    }
    node {
    name: captive001-72001-bl12
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.10.10.38
    }
    node {
    name: captive002-77015
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.2
    }
    node {
    name: captive003-77030
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.10.33
    }
    node {
    name: captive004-77028
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.10.10.89
    }
    node {
    name: captive005-74001
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.10.10.67
    }
    node {
    name: captive006-72011-bl09
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.10.10.16
    }
    node {
    name: captive007-72001-bl11
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 10.10.10.18
    }
    node {
    name: captive008-74005
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.10.10.143
    }
    node {
    name: captive009-77014
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.10.10.62
    }
    node {
    name: captive010-74050-bl14
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.25
    }
    node {
    name: captive011-74007
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.10.10.7
    }
    node {
    name: captive012-72011-bl06
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 10.10.10.34
    }
    node {
    name: captive013-74050-bl08
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 10.10.10.193
    }
    node {
    name: captive014-72001-bl15
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 10.10.10.92
    }
    node {
    name: captive015-74050-bl05
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 10.10.10.232
    }
    node {
    name: captive016-72001-bl01
    nodeid: 19
    quorum_votes: 1
    ring0_addr: 10.10.10.11
    }
    node {
    name: captive017-74050-bl09
    nodeid: 20
    quorum_votes: 1
    ring0_addr: 10.10.10.151
    }
    node {
    name: captive018-72001-bl04
    nodeid: 22
    quorum_votes: 1
    ring0_addr: 10.10.10.15
    }
    node {
    name: captive019-74050-bl12
    nodeid: 24
    quorum_votes: 1
    ring0_addr: 10.10.10.84
    }
    node {
    name: captive020-74050-bl13
    nodeid: 28
    quorum_votes: 1
    ring0_addr: 10.10.10.21
    }
    node {
    name: captive021-74050-bl15-rev2
    nodeid: 27
    quorum_votes: 1
    ring0_addr: 10.10.10.252
    }
    node {
    name: captive022-79001-bl01
    nodeid: 29
    quorum_votes: 1
    ring0_addr: 10.10.10.219
    }
    node {
    name: captive023-79001-bl03-lvmthin
    nodeid: 31
    quorum_votes: 1
    ring0_addr: captive023-79001-bl03-lvmthin
    }
    node {
    name: pve-ffm
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
    }
    }

    ogging {
    debug: off
    to_syslog: yes
    }

    quorum {
    provider: corosync_votequorum
    }

    totem {
    cluster_name: fra01
    config_version: 84
    interface {
    bindnetaddr: 10.10.10.1
    ringnumber: 0
    }
    ip_version: ipv4
    secauth: on
    version: 2
    }

    When I try omping, I get:
    Any ideas?
     
  6. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,111
    Likes Received:
    88
    You need to run omping on all nodes in parallel
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    69
    Likes Received:
    0
  8. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,135
    Likes Received:
    147
    yes, but you only tested two nodes, so we cannot really extrapolate those result to your full cluster.

    It's a bit hard to tell what exact node may cause this from the outside, what would help is checking the syslog/journal around a time this happened and post it here.

    So it's always a single node, once you restart that one it works again? We fixed a similar sound bug in the past, can you post your versions
    Code:
    pveversion -v
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice