HA failover issues reconfigures VM after upgrade to 6.3

alyarb

Well-Known Member
Feb 11, 2020
140
25
48
37
A host was fenced today and all of the VMs (all Windows 2016 or 2019) were successfully powered up on other hosts.

Only trouble was that none of them were reachable on the network, and they all had services that failed to start upon boot. After investigating, we found that all of the network adapters were replaced and put into DHCP mode, and all of the disks attached to the VM were "offline" in the guest OS and had random letters assigned to them upon activation. The vm conf files were not modified and they look exactly the same as before, MAC addresses on all the NICs remained the same as well, but the NIC configuration in the guest OS was blown away regardless.

After reconfiguring all of the network adapters and getting the drives mounted properly, we rebooted all of the guests and they booted normally with everything working, as expected.

Then, all hosts were fully updated and rebooted, and I fenced another host to test the VM failover, and again, all of the NICs were reset and non-system drives were dismounted in the guest OS!

All hosts are running the same software:

Kernel Version
Linux 5.4.103-1-pve #1 SMP PVE 5.4.103-1 (Sun, 07 Mar 2021 15:55:09 +0100)

PVE Manager Version
pve-manager/6.3-6/2184247e


We did not experience this behavior during failover when we were running 6.1 and 6.2. In the past, VMs would failover, power on, and boot up perfectly with the exact guest OS configuration as before.

Where do I even begin to look to find the update that broke this?

I am starting to think it is relating to the finer-grained virtual machine versions. For instance all of the machines affected by the failover are shown as "pc-q35-5.1"

In v6.1 and 6.2 of PVE, these various chipset versions were not shown. The only options were the i440BX and Q35. Additionally, the new warning message "Machine version change may affect hardware layout and settings in the guest OS." seems to describe exactly what occurred during this failover.

So clearly some recent work has been done on this module. Can others test HA for me with multiple nics and disks on a Q35 windows machine?
 
Last edited:
OK, this is due to the version of the machine chipset defaulting to "latest" even if that is not currently the active version. The issue has nothing to do with failover, you can also trigger it just by shutting down and restarting the VM on the same host. Proxmox will automatically and unexpectedly update the chipset to "latest" and all of your removable hardware will be screwed up after the next boot.

As of right now the only way to prevent this in the future is to get off of the "latest" value and manually set it to a specific value so that it will not unexpectedly change in the future. This would be especially dangerous if not every host in your cluster was running the exact same version. If a VM running a q35-5.1 with the configuration set to "latest" failed over to a newer host where the latest was q35-5.2, you would also experience a reset of all your removable hardware.
 
Last edited:
Hi,
for windows VMs it actually shouldn't default to latest. The ability to pin the machine type was introduced recently, precisely because of the issue you ran into. There were never such noticeable issues in the past, which is why PVE always used the newest available machine type.

If the issue is still present, could you share the configuration for an affected VM? What is your pveversion -v?
 
At the time of the fence, some were on 6.3-3 (or maybe even 6.2) and some were on 6.3-6. So it's understandable. I used to make a habit of slowly staggering host updates, moving VMs around to reboot only a subset of hosts, etc... not any more lol. I will do all of them the same day.

I noticed some machines on a couple hosts DID have the machine type pinned, so I left those at pc-q35-5.1 and powered off everything else so I could manually pin them at pc-q35-5.2.

Making a point to pin the machine type is not a big deal, all my templates are pinned. Thanks
 
At the time of the fence, some were on 6.3-3 (or maybe even 6.2) and some were on 6.3-6. So it's understandable. I used to make a habit of slowly staggering host updates, moving VMs around to reboot only a subset of hosts, etc... not any more lol. I will do all of them the same day.
I mean that wouldn't have helped much here, because as soon as you migrated to a host with a newer version (or rebooted a VM after the upgrade) the issue would've been triggered. If you need guaranteed stability, that's what the enterprise repository is for. Of course we try to not push packages with issues to no-subscription either, but in this case, the issue didn't show up during the internal testing (even with our windows VMs) and neither people using the pve-test repository complained (at least not enough for us to notice).

I noticed some machines on a couple hosts DID have the machine type pinned, so I left those at pc-q35-5.1 and powered off everything else so I could manually pin them at pc-q35-5.2.
You should only pin the ones you already re-configured to pc-q35-5.2. The ones that are still expecting to run on the old machine model, shouldn't be changed:
Machine version change may affect hardware layout and settings in the guest OS.

Making a point to pin the machine type is not a big deal, all my templates are pinned. Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!