[SOLVED] Expanding cluster reboots all VM's?

tjk

Member
May 3, 2021
112
14
23
Hello,

I have a 5 node cluster licensed and running latest 7.2. I added a 6th node to the cluster last night, and it restarted and moved all the VM's on the cluster. The VM's are marked for HA, and it started them on a bunch of different nodes during the addition of node 6.

During the process, I received email alerts about fencing the nodes but not sure why. The new node eventually was added, but I had 60+ VM's all reboot during the process.

Do I need to remove all the VM's from HA before expanding the cluster? What logs/should I be looking for, this just seems odd and I have 3 more nodes to add to the cluster, but for now have paused expansion until we can figure this out.

Thank you!
 
no, a cluster expansion should not trigger a fencing of all nodes unless something goes (very) wrong. could you please post your corosync.cfg, details about your network setup and the journal from ALL nodes covering the timespan of the expansion and fencing?
 
  • Like
Reactions: tjk
Here is the corosync.cfg dump.

Ring0 is 2x10G LACP and Ring1 is also 2x10G LACP (used for NFS traffic, little to no traffic on this network).

Node 6 is the one that was added last night.

Code:
root@node1:/etc/pve# cat corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.18.1.4
    ring1_addr: 10.201.3.4
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.18.1.5
    ring1_addr: 10.201.3.5
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.18.1.6
    ring1_addr: 10.201.3.6
  }
  node {
    name: node4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 172.18.1.7
    ring1_addr: 10.201.3.7
  }
  node {
    name: node5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 172.18.1.8
    ring1_addr: 10.201.3.8
  }
  node {
    name: node6
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 172.18.1.9
    ring1_addr: 10.201.3.9
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: va1-prox2-xxxx
  config_version: 6
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@node1:/etc/pve#
 
Last edited:
pveversion info:

Code:
root@node1:/etc/pve# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.1-1
proxmox-backup-file-restore: 2.2.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
root@node1:/etc/pve#

network info, fyi cluster has been running for 6+ months, no problems, until trying to add this node:
Code:
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto enp132s0
iface enp132s0 inet manual

auto eno1np0
iface eno1np0 inet manual

auto eno2np1
iface eno2np1 inet manual

auto enp132s0d1
iface enp132s0d1 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves eno1np0 enp132s0
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3
#Frontend 6650s

auto bond1
iface bond1 inet static
    address 10.201.3.4/16
    bond-slaves eno2np1 enp132s0d1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3
#Backend Arista's

auto vmbr300
iface vmbr300 inet manual
    bridge-ports bond0.300
    bridge-stp off
    bridge-fd 0
#VL300 Public

auto vmbr600
iface vmbr600 inet static
    address 172.18.1.4/24
    gateway 172.18.1.1
    bridge-ports bond0.600
    bridge-stp off
    bridge-fd 0
#VL600 Node Management

auto vmbr601
iface vmbr601 inet manual
    bridge-ports bond0.601
    bridge-stp off
    bridge-fd 0
#VL601 Internal Use

auto vmbr675
iface vmbr675 inet manual
    bridge-ports bond0.675
    bridge-stp off
    bridge-fd 0
#VL675

auto vmbr799
iface vmbr799 inet manual
    bridge-ports bond0.799
    bridge-stp off
    bridge-fd 0
#VL799

auto vmbr800
iface vmbr800 inet manual
    bridge-ports bond0.800
    bridge-stp off
    bridge-fd 0
#VL800

auto vmbr801
iface vmbr801 inet manual
    bridge-ports bond0.801
    bridge-stp off
    bridge-fd 0
#VL801

auto vmbr802
iface vmbr802 inet manual
    bridge-ports bond0.802
    bridge-stp off
    bridge-fd 0
#VL802

auto vmbr803
iface vmbr803 inet manual
    bridge-ports bond0.803
    bridge-stp off
    bridge-fd 0
#VL803

auto vmbr804
iface vmbr804 inet manual
    bridge-ports bond0.804
    bridge-stp off
    bridge-fd 0
#VL804
 
thanks! it might be a red herring, but the fact that node 5 logged the config reload before the 'writing new config', followed by not logging anything about node 6 seems quite suspicious. I'll take a closer look whether anything is amiss there!
 
  • Like
Reactions: tjk
I can in fact reproduce this issue by introducing an artificial delay between receiving the corosync.conf update and actually writing it out to /etc/corosync/corosync.conf on one of the "passive" nodes being informed of the update (so that the config reload gets triggered before the new config actually became visible to corosync), but I'll have to think a bit whether there is an easy way out. I'll keep you posted!
 
  • Like
Reactions: tjk
Thank you very much @fabian. Let me know when you think you found something or want to try something, I have 3 more nodes waiting to be added and happy to adjust whatever is needed and test again.
 
you should be able to (safely) add them if you manually "disarm" HA first:

first, stop the local resource manager "pve-ha-lrm" on each node. only after they have been stopped, also stop the cluster resource manager "pve-ha-crm" on each node; use the GUI (Node -> Services) or the CLI by running the following command on each node:

Code:
systemctl stop pve-ha-lrm

Only after the above was done for all nodes, run the following on each node:

Code:
systemctl stop pve-ha-crm

check between each addition that the service are still stopped on all nodes.

after all new nodes have been added, start the LRM service again, followed by the CRM service.

but of course - waiting for further analysis and a potential fix is also a valid approach if the additions are not urgent :)
 
  • Like
Reactions: tjk
yes. unfortunately the issue is a bit hard to fix properly. we'll roll out a workaround (basically waiting unconditionally before reloading the config) that should make it impossible to trigger it in most setups while we work on the proper fix. I'll ping here once the workaround is available in packaged form.
 
  • Like
Reactions: tjk
yes. unfortunately the issue is a bit hard to fix properly. we'll roll out a workaround (basically waiting unconditionally before reloading the config) that should make it impossible to trigger it in most setups while we work on the proper fix. I'll ping here once the workaround is available in packaged form.
Glad to hear you found the issue. Was this something that made it into 7.2 or something you are seeing that is unique to our setup, just curious.
 
as far as I can tell the issues has existed for quite a long time - it's a race, so you might just have encountered the right (un)lucky circumstances that triggered it.
 
as far as I can tell the issues has existed for quite a long time - it's a race, so you might just have encountered the right (un)lucky circumstances that triggered it.
@fabian just wanted to check in and see if you made any progress on this yet?
 
Had a few days of vacation that got in the way, but I'll send a patch for the workaround today :)
 
  • Like
Reactions: tjk
no, not yet. package with the workaround is on pvetest at the moment.
 
no, not yet. package with the workaround is on pvetest at the moment.
@fabian can you let me know what package(s) and version I should be looking for in the Enterprise repo? I'm ready to go and test this out once they hit Enterprise.
 
yes - pve-cluster (et al) in version >= 7.2-2, which should be on pve-enterprise already :)
 
  • Like
Reactions: tjk

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!