PMX7.0 - HA - preventing entire cluster reboot

dlasher

Renowned Member
Mar 23, 2011
233
23
83
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve) - (5) node cluster, full HA setup, CEPH filesystem

How do I prevent HA from rebooting the entire cluster?

Code:
 20:05:39 up 22 min,  2 users,  load average: 6.58, 6.91, 5.18
 20:05:39 up 22 min,  1 user,  load average: 4.34, 6.79, 6.23
 20:05:39 up 11 min,  1 user,  load average: 7.18, 6.22, 3.44
 20:05:39 up 22 min,  1 user,  load average: 3.16, 3.54, 3.20
 20:05:39 up 22 min,  2 users,  load average: 1.18, 1.77, 1.93

3rd node went offline, (will be replacing motherboard later this week) -- when it came back, corosync was grouchy. Kept saying "blocked" and only saw a single node.. about 2 minutes later, the entire rest of the cluster rebooted. :(

Code:
1: reboot   system boot  5.11.22-5-pve    Sat Dec  4 19:44   still running
2: reboot   system boot  5.11.22-5-pve    Sat Dec  4 19:44   still running
3: reboot   system boot  5.11.22-5-pve    Sat Dec  4 19:54   still running
4: reboot   system boot  5.11.22-5-pve    Sat Dec  4 19:43   still running
5: reboot   system boot  5.11.22-5-pve    Sat Dec  4 19:44   still running

I ended up having to reboot the 3rd node -again, to make it happy, but this is NOT the first time PMX has decided the entire cluster needed rebooting. Never, under any circumstance I can think of, should it reboot the entire cluster. Ever.

How do I stop this in the future?
 
Last edited:
Having read all the other threads (including : https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/page-11#post-269235) , wanted to add - I'm running (4) different "rings" for corosync, spread across (4) different physical interfaces, and (2) different switches. Switches have uptimes in months, so it wasn't the network layer:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pmx1
    nodeid: 6
    quorum_votes: 1
    ring3_addr: 198.18.53.101
    ring2_addr: 198.18.51.101
    ring1_addr: 198.18.50.101
    ring0_addr: 10.4.5.101
  }
  node {
    name: pmx2
    nodeid: 5
    quorum_votes: 1
    ring3_addr: 198.18.53.102
    ring2_addr: 198.18.52.102
    ring1_addr: 198.18.50.102
    ring0_addr: 10.4.5.102
  }
  node {
    name: pmx3
    nodeid: 4
    quorum_votes: 1
    ring3_addr: 198.18.53.103
    ring2_addr: 198.18.51.103
    ring1_addr: 198.18.50.103
    ring0_addr: 10.4.5.103
  }
  node {
    name: pmx4
    nodeid: 3
    quorum_votes: 1
    ring3_addr: 198.18.53.104
    ring2_addr: 198.18.51.104
    ring1_addr: 198.18.50.104
    ring0_addr: 10.4.5.104
  }
  node {
    name: pmx5
    nodeid: 2
    quorum_votes: 1
    ring3_addr: 198.18.53.105
    ring2_addr: 198.18.51.105
    ring1_addr: 198.18.50.105
    ring0_addr: 10.4.5.105
  }
  node {
    name: pmx8
    nodeid: 1
    quorum_votes: 1
    ring3_addr: 198.18.53.20
    ring2_addr: 198.18.51.20
    ring1_addr: 198.18.50.20
    ring0_addr: 10.4.5.20
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pmx7home
  config_version: 16
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  interface {
    linknumber: 2
  }
  interface {
    linknumber: 3
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!