[SOLVED] Migrating multiple VMs crashes the entire cluster

tplz · Jul 26, 2023

Hello, and thank you for your time.

TL;DR - My entire 5 node PMVE 7.4 cluster crashed when I tried to migrate 2 VMs simultaneously.

I have a cluster of 5 PMVE 7.4-15 nodes. I am using Ceph Quincy 17.2.6 colocated on every node to store most VM disks. I have Proxmox HA setup to start any failed VMs.

Up until yesterday, I have migrated VMs between nodes without any issues. The only recent change I've made are enabling Proxmox HA to restart VMs hosted on a downed node on another node. I believe this change is relevant because it allows multiple migrations to happen simultaneously (something about the HA agent requesting the migration on behalf of the user instead of the user), whereas before each VM migration would have to wait for it's turn.

The first issue occurred yesterday when I tried to bulk-migrate 5 VMs from one node to another. I had never bulk-migrated any VMs before, but I didn't think it would become an issue as I had used the regular migrate on that cluster many times before. Unfortunately, the nodes that the VMs were migrating to and from both rebooted soon after the migration had started.

I had a similar, but much more severe issue, today. I had two VMs on Node6 that I wanted to migrate to Node1 because I needed to take Node6 down for maintenance (just adding in another 10G NIC, no known issue with the node). I used the GUI to start migrating both VMs via the regular migrate button, not the bulk migrate button. Soon thereafter, the web GUI on all nodes stopped responding. I checked the servers, and sure enough every node was rebooting, taking the entire production environment down with them.

Each node has a Intel dual 10G RJ45 PCIe card. Both ports are bonded together for Ceph's Public and Cluster network in Primary/Failover mode. The cluster network is on a 1G network.

The only potential issue I see is that the 1G switch that every node is connected to for it's cluster network is pretty loaded. But even if the nodes lost communication over the cluster network as a result of simultaneous VM migrations, would that really force the nodes to reboot? Looking at the nodes' syslog it seems like corosync starts freaking out after the migration starts and forces the node to restart.

I have included relevant output from the cluster below and full syslog for Node1 and Node6 attached to this post. Please let me know if any additional information would be helpful.

Thank you!!

Bash:

root@node6:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 <temp node 1>
         2          1 <node 6>
         3          1 <node 5>
         4          1 <temp node 4>
         5          1 <node 1>

Bash:

root@node6:~# pveversion
pve-manager/7.4-15/a5d2a31e (running kernel: 5.15.107-2-pve)

Bash:

root@node6:~# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME        STATUS  REWEIGHT  PRI-AFF
 -1         31.43875  root default                           
-11          3.49319      host node1                         
 16    ssd   1.74660          osd.16       up   1.00000  1.00000
 17    ssd   1.74660          osd.17       up   1.00000  1.00000
 -7          5.23979      host node4                         
  6    ssd   1.74660          osd.6        up   1.00000  1.00000
  7    ssd   1.74660          osd.7        up   1.00000  1.00000
  8    ssd   1.74660          osd.8        up   1.00000  1.00000
 -3          6.98639      host node6                        
  0    ssd   1.74660          osd.0        up   1.00000  1.00000
  1    ssd   1.74660          osd.1        up   1.00000  1.00000
  2    ssd   1.74660          osd.2        up   1.00000  1.00000
 15    ssd   1.74660          osd.15       up   1.00000  1.00000
 -5          5.23979      host t-temp node 1                       
  3    ssd   1.74660          osd.3        up   1.00000  1.00000
  4    ssd   1.74660          osd.4        up   1.00000  1.00000
  5    ssd   1.74660          osd.5        up   1.00000  1.00000
 -9         10.47958      host temp node 4                      
  9    ssd   1.74660          osd.9        up   1.00000  1.00000
 10    ssd   1.74660          osd.10       up   1.00000  1.00000
 11    ssd   1.74660          osd.11       up   1.00000  1.00000
 12    ssd   1.74660          osd.12       up   1.00000  1.00000
 13    ssd   1.74660          osd.13       up   1.00000  1.00000
 14    ssd   1.74660          osd.14       up   1.00000  1.00000

Code:

Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
-- Reboot --

aaron · Jul 27, 2023

So if I understand the network correctly, the 1G link is used for everything not Ceph and Corosync only got one link configured?

Best practice is to give Corosync at least one physical network (1Gbit is usually enough) just for itself, to avoid side effect if other services use up the bandwidth. Additional Corosync links using other networks are a good idea to give it options to switch networks if problems arise. Corosync can use up to 8 different networks.

So, what you could do is one or a combination of the following:

Ideally: add/use another 1Gbit NIC, configure a new network and configure a Corosync link on it: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
add a second Corosync link on the Ceph network. If it still works, it can switch to it
Configure the migration network to another one to reduce the load on the 1Gbit network (Datacenter -> Options)
Set a bandwidth limit on the migrations to limit the impact (Datacenter -> Options)

tplz said:
would that really force the nodes to reboot?

If HA is enabled yes. The Corosync connection is used to determine if the node is still part of the cluster. Should a node with HA guests lose the connection to the cluster for more than a minute, it will fence itself (hard reset) to make sure that the HA guests are definitely powered down before the (hopefully) remaining rest of the cluster will recover those VMs on other nodes.

If it affects multiple nodes, maybe all or the majority, then the whole cluster will be affected.

tplz · May 3, 2024

Thank you for your reply, Aaron. Sorry for the late (10 months.... oops) reply.

The issue ended up being network congestion on the 1G links like you suggested. The only network I had configured as a Corosync network was the 1G network. I also used that 1G link for VM migrations. Migrating multiple VMs at once congested traffic on the sending and receiving nodes so bad that the softdog fenced both nodes and rebooted them.

The initial solution I implemented was moving migrations to the 10G Ceph network. After that, I created a dedicated Corosync network to be the Corosync link 0. I have additionally configured a few other networks as Corosync links as you suggested.

VM migrations have been rock solid ever since implementing your guidance. Thank you for your time and expertise!

[SOLVED] Migrating multiple VMs crashes the entire cluster

tplz

New Member

Attachments

aaron

Proxmox Staff Member

tplz

New Member

We value your privacy