Migrating multiple VMs crashes the entire cluster

Jul 26, 2023
1
0
1
Hello, and thank you for your time.

TL;DR - My entire 5 node PMVE 7.4 cluster crashed when I tried to migrate 2 VMs simultaneously.

I have a cluster of 5 PMVE 7.4-15 nodes. I am using Ceph Quincy 17.2.6 colocated on every node to store most VM disks. I have Proxmox HA setup to start any failed VMs.

Up until yesterday, I have migrated VMs between nodes without any issues. The only recent change I've made are enabling Proxmox HA to restart VMs hosted on a downed node on another node. I believe this change is relevant because it allows multiple migrations to happen simultaneously (something about the HA agent requesting the migration on behalf of the user instead of the user), whereas before each VM migration would have to wait for it's turn.

The first issue occurred yesterday when I tried to bulk-migrate 5 VMs from one node to another. I had never bulk-migrated any VMs before, but I didn't think it would become an issue as I had used the regular migrate on that cluster many times before. Unfortunately, the nodes that the VMs were migrating to and from both rebooted soon after the migration had started.

I had a similar, but much more severe issue, today. I had two VMs on Node6 that I wanted to migrate to Node1 because I needed to take Node6 down for maintenance (just adding in another 10G NIC, no known issue with the node). I used the GUI to start migrating both VMs via the regular migrate button, not the bulk migrate button. Soon thereafter, the web GUI on all nodes stopped responding. I checked the servers, and sure enough every node was rebooting, taking the entire production environment down with them.

Each node has a Intel dual 10G RJ45 PCIe card. Both ports are bonded together for Ceph's Public and Cluster network in Primary/Failover mode. The cluster network is on a 1G network.

The only potential issue I see is that the 1G switch that every node is connected to for it's cluster network is pretty loaded. But even if the nodes lost communication over the cluster network as a result of simultaneous VM migrations, would that really force the nodes to reboot? Looking at the nodes' syslog it seems like corosync starts freaking out after the migration starts and forces the node to restart.

I have included relevant output from the cluster below and full syslog for Node1 and Node6 attached to this post. Please let me know if any additional information would be helpful.

Thank you!!

Bash:
root@node6:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 <temp node 1>
         2          1 <node 6>
         3          1 <node 5>
         4          1 <temp node 4>
         5          1 <node 1>

Bash:
root@node6:~# pveversion
pve-manager/7.4-15/a5d2a31e (running kernel: 5.15.107-2-pve)

Bash:
root@node6:~# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME        STATUS  REWEIGHT  PRI-AFF
 -1         31.43875  root default                           
-11          3.49319      host node1                         
 16    ssd   1.74660          osd.16       up   1.00000  1.00000
 17    ssd   1.74660          osd.17       up   1.00000  1.00000
 -7          5.23979      host node4                         
  6    ssd   1.74660          osd.6        up   1.00000  1.00000
  7    ssd   1.74660          osd.7        up   1.00000  1.00000
  8    ssd   1.74660          osd.8        up   1.00000  1.00000
 -3          6.98639      host node6                        
  0    ssd   1.74660          osd.0        up   1.00000  1.00000
  1    ssd   1.74660          osd.1        up   1.00000  1.00000
  2    ssd   1.74660          osd.2        up   1.00000  1.00000
 15    ssd   1.74660          osd.15       up   1.00000  1.00000
 -5          5.23979      host t-temp node 1                       
  3    ssd   1.74660          osd.3        up   1.00000  1.00000
  4    ssd   1.74660          osd.4        up   1.00000  1.00000
  5    ssd   1.74660          osd.5        up   1.00000  1.00000
 -9         10.47958      host temp node 4                      
  9    ssd   1.74660          osd.9        up   1.00000  1.00000
 10    ssd   1.74660          osd.10       up   1.00000  1.00000
 11    ssd   1.74660          osd.11       up   1.00000  1.00000
 12    ssd   1.74660          osd.12       up   1.00000  1.00000
 13    ssd   1.74660          osd.13       up   1.00000  1.00000
 14    ssd   1.74660          osd.14       up   1.00000  1.00000

Code:
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]:   [TOTEM ] Retransmit List: 3
-- Reboot --
 

Attachments

  • 7-26-23 0000-0833 Node1.txt
    101.2 KB · Views: 0
  • 7-26-23 0000-0832 Node6.txt
    81.7 KB · Views: 0
Last edited:
So if I understand the network correctly, the 1G link is used for everything not Ceph and Corosync only got one link configured?

Best practice is to give Corosync at least one physical network (1Gbit is usually enough) just for itself, to avoid side effect if other services use up the bandwidth. Additional Corosync links using other networks are a good idea to give it options to switch networks if problems arise. Corosync can use up to 8 different networks.

So, what you could do is one or a combination of the following:
  • Ideally: add/use another 1Gbit NIC, configure a new network and configure a Corosync link on it: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
  • add a second Corosync link on the Ceph network. If it still works, it can switch to it
  • Configure the migration network to another one to reduce the load on the 1Gbit network (Datacenter -> Options)
  • Set a bandwidth limit on the migrations to limit the impact (Datacenter -> Options)

would that really force the nodes to reboot?
If HA is enabled yes. The Corosync connection is used to determine if the node is still part of the cluster. Should a node with HA guests lose the connection to the cluster for more than a minute, it will fence itself (hard reset) to make sure that the HA guests are definitely powered down before the (hopefully) remaining rest of the cluster will recover those VMs on other nodes.

If it affects multiple nodes, maybe all or the majority, then the whole cluster will be affected.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!