Odd problem with auto-migrated VMs to wrong hosts, plus how best to disable HA for now?

Pyromancer

Member
Jan 25, 2021
29
7
8
48
We have a Proxmox cluster comprising two identical large machines, with 120 cores and 370+ gig of memory, and with each machine having two ZFS disk pools, Tank1 of 50TB HDD, and Tank2 of 2TB of SSD, and a third less powerful machine with just one ZFS pool, a mirrored paid of 10TB disks designated Tank1. There's also a Qdevice for quorum running on a separate host external to the Proxmox cluster (it's actually a Linux VM on a VMware hypervisor).

All the machines are connected by a 10gbit switch and that has of late been having issues with random crashes. Until recently the only issues the crashes had caused were that the Proxmox machines would reboot on losing contact with each other, and we can see the log entries via journalctl -b -1 -e. The VMs would remain where they were, no failover or migration was triggered.

The VMs on the large machines were configured to replicate between those two machines (only) and had HA configured to give failover, in the event of a host failing.

However following the most recent switch incident, what seems to have happened is that several VMs attempted to migrate to the third machine, which has neither the disk capacity nor processor power for them, and didn't have any replicated disk files. So we ended up with the control files in /etc/pve/qemu-server on the third machine (and removed from their original hosts) but with the disk files left on the original hosts. To add to the issues, the third machine is running Proxmox 6.2 whereas the large machines are both on version 7. We were planning to upgrade the third but hadn't got to it yet.

We run full PVE backups to an external disk system so were able to recover the affected VMs from backup by using different VM IDs for them, though we still have some dev machines down.

Questions:

1. Can we just delete the errant conf files from /etc/pve/qemu-serve on the third machine via the CLI? Trying to delete via the GUI gives errors because the disks they reference (and in some cases the ZFS pool they were on) doesn't exist on the 3rd machine. If we remove them manually will that clear them from the GUI?

2. If we were to just copy the conf files back to the original hosts (and delete the errant ones) would that restore the VMs where the files still exist? Or is there anything we'd need to do to get the original host to re-scan for them?

3. Until we get the switch issue resolved (probably by replacing it), if we set all the HA to "ignored" will that prevent any failover / migration attempts? I tried setting it to "disabled" but as soon as we then restarted the VMs it went back to "started". For now, we just want the VMs to remain on their existing hosts if there are any reboots, not attempt to fail over or migrate.

4. What could have caused the system to try and migrate VMs to entirely the wrong host, with no replication files present?
 
  1. If you want to delete the configuration files for the VMs on the third machine, you can delete them directly from the /etc/pve/qemu-server directory using the command line. This should also remove them from the Proxmox GUI. However, it is generally not recommended to modify configuration files directly in this manner, as it can cause issues with the cluster. Instead, you may want to try using the Proxmox GUI to remove the VMs, or use the pct command line tool to manage the VMs.
  2. If you copy the configuration files for the VMs back to their original hosts, the VMs should be restored on those hosts. You may need to use the pvectl command to scan for new VMs on the host, or restart the pvedaemon service.
  3. To prevent VMs from attempting failover or migration, you can set the high availability (HA) settings for each VM to "ignored" in the Proxmox GUI. This will prevent the cluster from attempting to move the VMs to another host in the event of a host failure.
  4. The issue you are experiencing with VMs attempting to migrate to the wrong host is likely caused by a problem with the Proxmox cluster's configuration or communication between the hosts. This could be due to the switch crashes you mentioned, or it could be due to an issue with the quorum device or the Proxmox configuration. It is difficult to say for sure without more information. I would recommend checking the cluster logs.
 
  • Like
Reactions: Pyromancer
you need to set your HA groups to restricted to ensure only the members of the group are used for failover/recovery.

also, a three node cluster with a qdevice doesn't make sense (a qdevice is only there as a tie-breaker in a cluster with even node count), so I'd remove it and upgrade the third node to 7.x to rule out any weird interactions from the version mismatch.
 
  1. If you want to delete the configuration files for the VMs on the third machine, you can delete them directly from the /etc/pve/qemu-server directory using the command line. This should also remove them from the Proxmox GUI. However, it is generally not recommended to modify configuration files directly in this manner, as it can cause issues with the cluster. Instead, you may want to try using the Proxmox GUI to remove the VMs, or use the pct command line tool to manage the VMs.
  2. If you copy the configuration files for the VMs back to their original hosts, the VMs should be restored on those hosts. You may need to use the pvectl command to scan for new VMs on the host, or restart the pvedaemon service.
  3. To prevent VMs from attempting failover or migration, you can set the high availability (HA) settings for each VM to "ignored" in the Proxmox GUI. This will prevent the cluster from attempting to move the VMs to another host in the event of a host failure.
  4. The issue you are experiencing with VMs attempting to migrate to the wrong host is likely caused by a problem with the Proxmox cluster's configuration or communication between the hosts. This could be due to the switch crashes you mentioned, or it could be due to an issue with the quorum device or the Proxmox configuration. It is difficult to say for sure without more information. I would recommend checking the cluster logs.
Thanks for that, clear and concise.

Having checked in tank1 and tank2 on the main machines that the disk files were still present, I moved the remaining VMs back to the proper hosts by the following process:

1. In a terminal window on the the "wrong" host:
cd /etc/pve/qemu-server; cat xxx.conf

2. In a terminal window on the "right" host:
In /etc/pve/qemu-server, vi xxx.conf, copy and paste in the data from the other terminal window - but don't save the file yet.

3. Back on the wrong host, rm xxx.conf

4. On the right host, now :wq in vi to write the file. That way the same xxx.conf never existed in two places at the same time, as I'm guessing that would be bad?

The VM would disappear from the GUI the instant I deleted xxx.conf on the wrong host, then appear on the correct one when I wrote the file. Didn't need to run anything to scan for them, just popped up automatically. First disabled HA on them to get out of error state, then set it to "ignore" to prevent any further uncontrolled auto-migrations if the switch continues to play up until it is replaced.
Have also now set up proper restricted HA groups to limit failover to the two main machines for the VMs that are on them, while the ones on the less powerful machine can failover to any host - once the switch is replaced we'll set HA back to "started" and the system should run as planned once more.

And really glad we had good backups, Proxmox really does make that a doddle to do.

Thanks again for the help!
 
  • Like
Reactions: themaire
you need to set your HA groups to restricted to ensure only the members of the group are used for failover/recovery.

also, a three node cluster with a qdevice doesn't make sense (a qdevice is only there as a tie-breaker in a cluster with even node count), so I'd remove it and upgrade the third node to 7.x to rule out any weird interactions from the version mismatch.

Aha, wasn't aware of restricted groups. Googled and read the docs, and have now set up a restricted group for just the two main machines, which all the VMs on them will now have their HA set to. For the time being HA is set to "ignore" on all VMs while we're dealing with the errant switch, but once that's been fixed or replaced, will restart HA on everything but using the new groups.

Noted re Qdevice - originally we had just the two nodes so it was needed, and the third host was experimental and not actually part of the cluster. Will have to investigate removal process.

The reason the 3rd machine hadn't been upgraded yet was that when we upgraded the first two, we first migrated all the VMs to one, and upgraded the other - but when we did that on the second one, it caused both of them to reboot, which we hadn't expected and happened in the middle of business hours causing a customer-affecting outage. So been kind of putting it off in case the same thing happens and all three reboot when the third is upgraded. Need to do it in the small hours at a weekend.

Thanks for the help, much appreciated and have learned a lot. Cheers!
 
  • Like
Reactions: fabian

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!