Why did PVE reboot all nodes im my cluster, when only 2 needed to be fenced?

budy

Active Member
Jan 31, 2020
210
13
38
57
A couple of daya ago, we experienced an issue with a switch, which carried the corosync traffic for two of the 6 PVE hosts in our cluster. I can understand that PVE fenced those two hosts, but why did the other 4 ones rebooted as well? How can I fin out, what caused all my nodes to reboot?

Thanks,
budy
 
If that's the case, then I'd like to find out how that happened, of course, but there seems to be no information about this incident in the logs. At least not in /var/log/... Anything else, I can check about what happened?

When I look at the messages sent by PVE, I would reckon, that only these two nodes would have to be restartet:

Code:
 The node 'hades' failed and needs manual intervention.
 
  The PVE HA manager tries  to fence it and recover the
  configured HA resources to a healthy node if possible.
 
  Current fence status:  FENCE
  Try to fence node 'hades'
 
 
  Overall Cluster status:
  -----------------------
 
  {
     "manager_status" : {
        "master_node" : "hera",
        "node_status" : {
           "hades" : "unknown",
           "hera" : "online",
           "hydra" : "unknown",
           "pan" : "online",
           "pandora" : "online",
           "platon" : "online"
        },
        "service_status" : {
           "ct:102" : {
              "node" : "pan",
              "running" : 1,
              "state" : "started",
              "uid" : "CAH9buAT2dzAA4i0W5ckxQ"
           },
           "vm:100" : {
              "node" : "hydra",
              "running" : 1,
              "state" : "started",
              "uid" : "+vndmLsHOq8DfDQhw42Njg"
           },
           "vm:107" : {
              "node" : "pandora",
              "running" : 1,
              "state" : "started",
              "uid" : "E8FOMEbLi2h1W1lVVCx5fw"
           },
           "vm:108" : {
              "node" : "hades",
              "running" : 1,
              "state" : "started",
              "uid" : "DxuY18OMcx/aGN1HoYgkdQ"
           },
           "vm:109" : {
              "node" : "hera",
              "running" : 1,
              "state" : "started",
              "uid" : "lp1437zKVmvBbTFz296seA"
           },
           "vm:110" : {
              "node" : "hera",
              "running" : 1,
              "state" : "started",
              "uid" : "g1rjbQjdvpu4Jr0g9SnxQw"
           },
           "vm:114" : {
              "node" : "hydra",
              "running" : 1,
              "state" : "started",
              "uid" : "UVaQlpzMi95BIpXt80cXTg"
           },
           "vm:116" : {
              "node" : "hydra",
              "running" : 1,
              "state" : "started",
              "uid" : "6fAbxVyOqNE+pQoPHoVkMw"
           },
           "vm:132" : {
              "node" : "hades",
              "running" : 1,
              "state" : "started",
              "uid" : "oIXoPN9XJC2Xlc8lO1PXHQ"
           },
           "vm:178" : {
              "node" : "platon",
              "running" : 1,
              "state" : "started",
              "uid" : "lUUlchSuSlawCWK5yk27Qw"
           },
           "vm:182" : {
              "node" : "platon",
              "running" : 1,
              "state" : "started",
              "uid" : "PiWYOcNYJD5P5suDExv64g"
           },
           "vm:184" : {
              "node" : "platon",
              "running" : 1,
              "state" : "started",
              "uid" : "4QsvhZHYhzonuGsE6oP+Uw"
           },
           "vm:185" : {
              "node" : "pan",
              "running" : 1,
              "state" : "started",
              "uid" : "RD2xbJT1DS2RWRAXBRm+GQ"
           }
        },
        "timestamp" : 1592916764
     },
     "node_status" : {
        "hades" : "fence",
        "hera" : "online",
        "hydra" : "fence",
        "pan" : "online",
        "pandora" : "online",
        "platon" : "online"
     }
  }
 
Last edited:
Ahh… I see… the daemon.log seems to hold some information on that… I will investigate that.
 
So - I think the issue came up since two of my nodes are on that network, which became unresponsive and I have already tried to figure out, how to change the ring0 address, such as that those nodes are in line with the others.

I think, I do have some two options… I could either:

  • remove these two nodes from the cluster and re-add them using the dedicated lan interface
  • weak the corosync.conf and try to "win them over" this way.
Afaik, I should disable HA when trying to do the 2nd, but I haven't come across a good description on how to disable HA temporarily. Would it be sufficient to disable HA by stopping these three services:

  • pve-ha-crm
  • pve-ha-lrm
  • corosync
Thanks,
budy
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!