A host was fenced today and all of the VMs (all Windows 2016 or 2019) were successfully powered up on other hosts.
Only trouble was that none of them were reachable on the network, and they all had services that failed to start upon boot. After investigating, we found that all of the network adapters were replaced and put into DHCP mode, and all of the disks attached to the VM were "offline" in the guest OS and had random letters assigned to them upon activation. The vm conf files were not modified and they look exactly the same as before, MAC addresses on all the NICs remained the same as well, but the NIC configuration in the guest OS was blown away regardless.
After reconfiguring all of the network adapters and getting the drives mounted properly, we rebooted all of the guests and they booted normally with everything working, as expected.
Then, all hosts were fully updated and rebooted, and I fenced another host to test the VM failover, and again, all of the NICs were reset and non-system drives were dismounted in the guest OS!
All hosts are running the same software:
Kernel Version
Linux 5.4.103-1-pve #1 SMP PVE 5.4.103-1 (Sun, 07 Mar 2021 15:55:09 +0100)
PVE Manager Version
pve-manager/6.3-6/2184247e
We did not experience this behavior during failover when we were running 6.1 and 6.2. In the past, VMs would failover, power on, and boot up perfectly with the exact guest OS configuration as before.
Where do I even begin to look to find the update that broke this?
I am starting to think it is relating to the finer-grained virtual machine versions. For instance all of the machines affected by the failover are shown as "pc-q35-5.1"
In v6.1 and 6.2 of PVE, these various chipset versions were not shown. The only options were the i440BX and Q35. Additionally, the new warning message "Machine version change may affect hardware layout and settings in the guest OS." seems to describe exactly what occurred during this failover.
So clearly some recent work has been done on this module. Can others test HA for me with multiple nics and disks on a Q35 windows machine?
Only trouble was that none of them were reachable on the network, and they all had services that failed to start upon boot. After investigating, we found that all of the network adapters were replaced and put into DHCP mode, and all of the disks attached to the VM were "offline" in the guest OS and had random letters assigned to them upon activation. The vm conf files were not modified and they look exactly the same as before, MAC addresses on all the NICs remained the same as well, but the NIC configuration in the guest OS was blown away regardless.
After reconfiguring all of the network adapters and getting the drives mounted properly, we rebooted all of the guests and they booted normally with everything working, as expected.
Then, all hosts were fully updated and rebooted, and I fenced another host to test the VM failover, and again, all of the NICs were reset and non-system drives were dismounted in the guest OS!
All hosts are running the same software:
Kernel Version
Linux 5.4.103-1-pve #1 SMP PVE 5.4.103-1 (Sun, 07 Mar 2021 15:55:09 +0100)
PVE Manager Version
pve-manager/6.3-6/2184247e
We did not experience this behavior during failover when we were running 6.1 and 6.2. In the past, VMs would failover, power on, and boot up perfectly with the exact guest OS configuration as before.
Where do I even begin to look to find the update that broke this?
I am starting to think it is relating to the finer-grained virtual machine versions. For instance all of the machines affected by the failover are shown as "pc-q35-5.1"
In v6.1 and 6.2 of PVE, these various chipset versions were not shown. The only options were the i440BX and Q35. Additionally, the new warning message "Machine version change may affect hardware layout and settings in the guest OS." seems to describe exactly what occurred during this failover.
So clearly some recent work has been done on this module. Can others test HA for me with multiple nics and disks on a Q35 windows machine?
Last edited: