Why did PVE reboot all nodes im my cluster, when only 2 needed to be fenced?

budy · Jun 25, 2020

A couple of daya ago, we experienced an issue with a switch, which carried the corosync traffic for two of the 6 PVE hosts in our cluster. I can understand that PVE fenced those two hosts, but why did the other 4 ones rebooted as well? How can I fin out, what caused all my nodes to reboot?

Thanks,
budy

dietmar · Jun 25, 2020

I guess all node lost quorum.

budy · Jun 25, 2020

If that's the case, then I'd like to find out how that happened, of course, but there seems to be no information about this incident in the logs. At least not in /var/log/... Anything else, I can check about what happened?

When I look at the messages sent by PVE, I would reckon, that only these two nodes would have to be restartet:

Code:

 The node 'hades' failed and needs manual intervention.
 
  The PVE HA manager tries  to fence it and recover the
  configured HA resources to a healthy node if possible.
 
  Current fence status:  FENCE
  Try to fence node 'hades'
 
 
  Overall Cluster status:
  -----------------------
 
  {
     "manager_status" : {
        "master_node" : "hera",
        "node_status" : {
           "hades" : "unknown",
           "hera" : "online",
           "hydra" : "unknown",
           "pan" : "online",
           "pandora" : "online",
           "platon" : "online"
        },
        "service_status" : {
           "ct:102" : {
              "node" : "pan",
              "running" : 1,
              "state" : "started",
              "uid" : "CAH9buAT2dzAA4i0W5ckxQ"
           },
           "vm:100" : {
              "node" : "hydra",
              "running" : 1,
              "state" : "started",
              "uid" : "+vndmLsHOq8DfDQhw42Njg"
           },
           "vm:107" : {
              "node" : "pandora",
              "running" : 1,
              "state" : "started",
              "uid" : "E8FOMEbLi2h1W1lVVCx5fw"
           },
           "vm:108" : {
              "node" : "hades",
              "running" : 1,
              "state" : "started",
              "uid" : "DxuY18OMcx/aGN1HoYgkdQ"
           },
           "vm:109" : {
              "node" : "hera",
              "running" : 1,
              "state" : "started",
              "uid" : "lp1437zKVmvBbTFz296seA"
           },
           "vm:110" : {
              "node" : "hera",
              "running" : 1,
              "state" : "started",
              "uid" : "g1rjbQjdvpu4Jr0g9SnxQw"
           },
           "vm:114" : {
              "node" : "hydra",
              "running" : 1,
              "state" : "started",
              "uid" : "UVaQlpzMi95BIpXt80cXTg"
           },
           "vm:116" : {
              "node" : "hydra",
              "running" : 1,
              "state" : "started",
              "uid" : "6fAbxVyOqNE+pQoPHoVkMw"
           },
           "vm:132" : {
              "node" : "hades",
              "running" : 1,
              "state" : "started",
              "uid" : "oIXoPN9XJC2Xlc8lO1PXHQ"
           },
           "vm:178" : {
              "node" : "platon",
              "running" : 1,
              "state" : "started",
              "uid" : "lUUlchSuSlawCWK5yk27Qw"
           },
           "vm:182" : {
              "node" : "platon",
              "running" : 1,
              "state" : "started",
              "uid" : "PiWYOcNYJD5P5suDExv64g"
           },
           "vm:184" : {
              "node" : "platon",
              "running" : 1,
              "state" : "started",
              "uid" : "4QsvhZHYhzonuGsE6oP+Uw"
           },
           "vm:185" : {
              "node" : "pan",
              "running" : 1,
              "state" : "started",
              "uid" : "RD2xbJT1DS2RWRAXBRm+GQ"
           }
        },
        "timestamp" : 1592916764
     },
     "node_status" : {
        "hades" : "fence",
        "hera" : "online",
        "hydra" : "fence",
        "pan" : "online",
        "pandora" : "online",
        "platon" : "online"
     }
  }

budy · Jun 25, 2020

Ahh… I see… the daemon.log seems to hold some information on that… I will investigate that.

budy · Jun 25, 2020

So - I think the issue came up since two of my nodes are on that network, which became unresponsive and I have already tried to figure out, how to change the ring0 address, such as that those nodes are in line with the others.

I think, I do have some two options… I could either:

remove these two nodes from the cluster and re-add them using the dedicated lan interface
weak the corosync.conf and try to "win them over" this way.

Afaik, I should disable HA when trying to do the 2nd, but I haven't come across a good description on how to disable HA temporarily. Would it be sufficient to disable HA by stopping these three services:

pve-ha-crm
pve-ha-lrm
corosync

Thanks,
budy

Search

Search

Why did PVE reboot all nodes im my cluster, when only 2 needed to be fenced?

budy

Active Member

dietmar

Proxmox Staff Member

budy

Active Member

budy

Active Member

budy

Active Member