Hello, and thank you for your time.
TL;DR - My entire 5 node PMVE 7.4 cluster crashed when I tried to migrate 2 VMs simultaneously.
I have a cluster of 5 PMVE 7.4-15 nodes. I am using Ceph Quincy 17.2.6 colocated on every node to store most VM disks. I have Proxmox HA setup to start any failed VMs.
Up until yesterday, I have migrated VMs between nodes without any issues. The only recent change I've made are enabling Proxmox HA to restart VMs hosted on a downed node on another node. I believe this change is relevant because it allows multiple migrations to happen simultaneously (something about the HA agent requesting the migration on behalf of the user instead of the user), whereas before each VM migration would have to wait for it's turn.
The first issue occurred yesterday when I tried to bulk-migrate 5 VMs from one node to another. I had never bulk-migrated any VMs before, but I didn't think it would become an issue as I had used the regular migrate on that cluster many times before. Unfortunately, the nodes that the VMs were migrating to and from both rebooted soon after the migration had started.
I had a similar, but much more severe issue, today. I had two VMs on Node6 that I wanted to migrate to Node1 because I needed to take Node6 down for maintenance (just adding in another 10G NIC, no known issue with the node). I used the GUI to start migrating both VMs via the regular migrate button, not the bulk migrate button. Soon thereafter, the web GUI on all nodes stopped responding. I checked the servers, and sure enough every node was rebooting, taking the entire production environment down with them.
Each node has a Intel dual 10G RJ45 PCIe card. Both ports are bonded together for Ceph's Public and Cluster network in Primary/Failover mode. The cluster network is on a 1G network.
The only potential issue I see is that the 1G switch that every node is connected to for it's cluster network is pretty loaded. But even if the nodes lost communication over the cluster network as a result of simultaneous VM migrations, would that really force the nodes to reboot? Looking at the nodes' syslog it seems like corosync starts freaking out after the migration starts and forces the node to restart.
I have included relevant output from the cluster below and full syslog for Node1 and Node6 attached to this post. Please let me know if any additional information would be helpful.
Thank you!!
TL;DR - My entire 5 node PMVE 7.4 cluster crashed when I tried to migrate 2 VMs simultaneously.
I have a cluster of 5 PMVE 7.4-15 nodes. I am using Ceph Quincy 17.2.6 colocated on every node to store most VM disks. I have Proxmox HA setup to start any failed VMs.
Up until yesterday, I have migrated VMs between nodes without any issues. The only recent change I've made are enabling Proxmox HA to restart VMs hosted on a downed node on another node. I believe this change is relevant because it allows multiple migrations to happen simultaneously (something about the HA agent requesting the migration on behalf of the user instead of the user), whereas before each VM migration would have to wait for it's turn.
The first issue occurred yesterday when I tried to bulk-migrate 5 VMs from one node to another. I had never bulk-migrated any VMs before, but I didn't think it would become an issue as I had used the regular migrate on that cluster many times before. Unfortunately, the nodes that the VMs were migrating to and from both rebooted soon after the migration had started.
I had a similar, but much more severe issue, today. I had two VMs on Node6 that I wanted to migrate to Node1 because I needed to take Node6 down for maintenance (just adding in another 10G NIC, no known issue with the node). I used the GUI to start migrating both VMs via the regular migrate button, not the bulk migrate button. Soon thereafter, the web GUI on all nodes stopped responding. I checked the servers, and sure enough every node was rebooting, taking the entire production environment down with them.
Each node has a Intel dual 10G RJ45 PCIe card. Both ports are bonded together for Ceph's Public and Cluster network in Primary/Failover mode. The cluster network is on a 1G network.
The only potential issue I see is that the 1G switch that every node is connected to for it's cluster network is pretty loaded. But even if the nodes lost communication over the cluster network as a result of simultaneous VM migrations, would that really force the nodes to reboot? Looking at the nodes' syslog it seems like corosync starts freaking out after the migration starts and forces the node to restart.
I have included relevant output from the cluster below and full syslog for Node1 and Node6 attached to this post. Please let me know if any additional information would be helpful.
Thank you!!
Bash:
root@node6:~# pvecm nodes
Membership information
----------------------
Nodeid Votes Name
1 1 <temp node 1>
2 1 <node 6>
3 1 <node 5>
4 1 <temp node 4>
5 1 <node 1>
Bash:
root@node6:~# pveversion
pve-manager/7.4-15/a5d2a31e (running kernel: 5.15.107-2-pve)
Bash:
root@node6:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 31.43875 root default
-11 3.49319 host node1
16 ssd 1.74660 osd.16 up 1.00000 1.00000
17 ssd 1.74660 osd.17 up 1.00000 1.00000
-7 5.23979 host node4
6 ssd 1.74660 osd.6 up 1.00000 1.00000
7 ssd 1.74660 osd.7 up 1.00000 1.00000
8 ssd 1.74660 osd.8 up 1.00000 1.00000
-3 6.98639 host node6
0 ssd 1.74660 osd.0 up 1.00000 1.00000
1 ssd 1.74660 osd.1 up 1.00000 1.00000
2 ssd 1.74660 osd.2 up 1.00000 1.00000
15 ssd 1.74660 osd.15 up 1.00000 1.00000
-5 5.23979 host t-temp node 1
3 ssd 1.74660 osd.3 up 1.00000 1.00000
4 ssd 1.74660 osd.4 up 1.00000 1.00000
5 ssd 1.74660 osd.5 up 1.00000 1.00000
-9 10.47958 host temp node 4
9 ssd 1.74660 osd.9 up 1.00000 1.00000
10 ssd 1.74660 osd.10 up 1.00000 1.00000
11 ssd 1.74660 osd.11 up 1.00000 1.00000
12 ssd 1.74660 osd.12 up 1.00000 1.00000
13 ssd 1.74660 osd.13 up 1.00000 1.00000
14 ssd 1.74660 osd.14 up 1.00000 1.00000
Code:
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:12 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
Jul 26 08:33:13 node1 corosync[2487]: [TOTEM ] Retransmit List: 3
-- Reboot --
Attachments
Last edited: