Why did PVE reboot all nodes im my cluster, when only 2 needed to be fenced?

budy

Well-Known Member
Jan 31, 2020
210
14
58
58
A couple of daya ago, we experienced an issue with a switch, which carried the corosync traffic for two of the 6 PVE hosts in our cluster. I can understand that PVE fenced those two hosts, but why did the other 4 ones rebooted as well? How can I fin out, what caused all my nodes to reboot?

Thanks,
budy
 
If that's the case, then I'd like to find out how that happened, of course, but there seems to be no information about this incident in the logs. At least not in /var/log/... Anything else, I can check about what happened?

When I look at the messages sent by PVE, I would reckon, that only these two nodes would have to be restartet:

Code:
 The node 'hades' failed and needs manual intervention.
 
  The PVE HA manager tries  to fence it and recover the
  configured HA resources to a healthy node if possible.
 
  Current fence status:  FENCE
  Try to fence node 'hades'
 
 
  Overall Cluster status:
  -----------------------
 
  {
     "manager_status" : {
        "master_node" : "hera",
        "node_status" : {
           "hades" : "unknown",
           "hera" : "online",
           "hydra" : "unknown",
           "pan" : "online",
           "pandora" : "online",
           "platon" : "online"
        },
        "service_status" : {
           "ct:102" : {
              "node" : "pan",
              "running" : 1,
              "state" : "started",
              "uid" : "CAH9buAT2dzAA4i0W5ckxQ"
           },
           "vm:100" : {
              "node" : "hydra",
              "running" : 1,
              "state" : "started",
              "uid" : "+vndmLsHOq8DfDQhw42Njg"
           },
           "vm:107" : {
              "node" : "pandora",
              "running" : 1,
              "state" : "started",
              "uid" : "E8FOMEbLi2h1W1lVVCx5fw"
           },
           "vm:108" : {
              "node" : "hades",
              "running" : 1,
              "state" : "started",
              "uid" : "DxuY18OMcx/aGN1HoYgkdQ"
           },
           "vm:109" : {
              "node" : "hera",
              "running" : 1,
              "state" : "started",
              "uid" : "lp1437zKVmvBbTFz296seA"
           },
           "vm:110" : {
              "node" : "hera",
              "running" : 1,
              "state" : "started",
              "uid" : "g1rjbQjdvpu4Jr0g9SnxQw"
           },
           "vm:114" : {
              "node" : "hydra",
              "running" : 1,
              "state" : "started",
              "uid" : "UVaQlpzMi95BIpXt80cXTg"
           },
           "vm:116" : {
              "node" : "hydra",
              "running" : 1,
              "state" : "started",
              "uid" : "6fAbxVyOqNE+pQoPHoVkMw"
           },
           "vm:132" : {
              "node" : "hades",
              "running" : 1,
              "state" : "started",
              "uid" : "oIXoPN9XJC2Xlc8lO1PXHQ"
           },
           "vm:178" : {
              "node" : "platon",
              "running" : 1,
              "state" : "started",
              "uid" : "lUUlchSuSlawCWK5yk27Qw"
           },
           "vm:182" : {
              "node" : "platon",
              "running" : 1,
              "state" : "started",
              "uid" : "PiWYOcNYJD5P5suDExv64g"
           },
           "vm:184" : {
              "node" : "platon",
              "running" : 1,
              "state" : "started",
              "uid" : "4QsvhZHYhzonuGsE6oP+Uw"
           },
           "vm:185" : {
              "node" : "pan",
              "running" : 1,
              "state" : "started",
              "uid" : "RD2xbJT1DS2RWRAXBRm+GQ"
           }
        },
        "timestamp" : 1592916764
     },
     "node_status" : {
        "hades" : "fence",
        "hera" : "online",
        "hydra" : "fence",
        "pan" : "online",
        "pandora" : "online",
        "platon" : "online"
     }
  }
 
Last edited:
Ahh… I see… the daemon.log seems to hold some information on that… I will investigate that.
 
So - I think the issue came up since two of my nodes are on that network, which became unresponsive and I have already tried to figure out, how to change the ring0 address, such as that those nodes are in line with the others.

I think, I do have some two options… I could either:

  • remove these two nodes from the cluster and re-add them using the dedicated lan interface
  • weak the corosync.conf and try to "win them over" this way.
Afaik, I should disable HA when trying to do the 2nd, but I haven't come across a good description on how to disable HA temporarily. Would it be sufficient to disable HA by stopping these three services:

  • pve-ha-crm
  • pve-ha-lrm
  • corosync
Thanks,
budy