We have a Proxmox cluster comprising two identical large machines, with 120 cores and 370+ gig of memory, and with each machine having two ZFS disk pools, Tank1 of 50TB HDD, and Tank2 of 2TB of SSD, and a third less powerful machine with just one ZFS pool, a mirrored paid of 10TB disks designated Tank1. There's also a Qdevice for quorum running on a separate host external to the Proxmox cluster (it's actually a Linux VM on a VMware hypervisor).
All the machines are connected by a 10gbit switch and that has of late been having issues with random crashes. Until recently the only issues the crashes had caused were that the Proxmox machines would reboot on losing contact with each other, and we can see the log entries via journalctl -b -1 -e. The VMs would remain where they were, no failover or migration was triggered.
The VMs on the large machines were configured to replicate between those two machines (only) and had HA configured to give failover, in the event of a host failing.
However following the most recent switch incident, what seems to have happened is that several VMs attempted to migrate to the third machine, which has neither the disk capacity nor processor power for them, and didn't have any replicated disk files. So we ended up with the control files in /etc/pve/qemu-server on the third machine (and removed from their original hosts) but with the disk files left on the original hosts. To add to the issues, the third machine is running Proxmox 6.2 whereas the large machines are both on version 7. We were planning to upgrade the third but hadn't got to it yet.
We run full PVE backups to an external disk system so were able to recover the affected VMs from backup by using different VM IDs for them, though we still have some dev machines down.
Questions:
1. Can we just delete the errant conf files from /etc/pve/qemu-serve on the third machine via the CLI? Trying to delete via the GUI gives errors because the disks they reference (and in some cases the ZFS pool they were on) doesn't exist on the 3rd machine. If we remove them manually will that clear them from the GUI?
2. If we were to just copy the conf files back to the original hosts (and delete the errant ones) would that restore the VMs where the files still exist? Or is there anything we'd need to do to get the original host to re-scan for them?
3. Until we get the switch issue resolved (probably by replacing it), if we set all the HA to "ignored" will that prevent any failover / migration attempts? I tried setting it to "disabled" but as soon as we then restarted the VMs it went back to "started". For now, we just want the VMs to remain on their existing hosts if there are any reboots, not attempt to fail over or migrate.
4. What could have caused the system to try and migrate VMs to entirely the wrong host, with no replication files present?
All the machines are connected by a 10gbit switch and that has of late been having issues with random crashes. Until recently the only issues the crashes had caused were that the Proxmox machines would reboot on losing contact with each other, and we can see the log entries via journalctl -b -1 -e. The VMs would remain where they were, no failover or migration was triggered.
The VMs on the large machines were configured to replicate between those two machines (only) and had HA configured to give failover, in the event of a host failing.
However following the most recent switch incident, what seems to have happened is that several VMs attempted to migrate to the third machine, which has neither the disk capacity nor processor power for them, and didn't have any replicated disk files. So we ended up with the control files in /etc/pve/qemu-server on the third machine (and removed from their original hosts) but with the disk files left on the original hosts. To add to the issues, the third machine is running Proxmox 6.2 whereas the large machines are both on version 7. We were planning to upgrade the third but hadn't got to it yet.
We run full PVE backups to an external disk system so were able to recover the affected VMs from backup by using different VM IDs for them, though we still have some dev machines down.
Questions:
1. Can we just delete the errant conf files from /etc/pve/qemu-serve on the third machine via the CLI? Trying to delete via the GUI gives errors because the disks they reference (and in some cases the ZFS pool they were on) doesn't exist on the 3rd machine. If we remove them manually will that clear them from the GUI?
2. If we were to just copy the conf files back to the original hosts (and delete the errant ones) would that restore the VMs where the files still exist? Or is there anything we'd need to do to get the original host to re-scan for them?
3. Until we get the switch issue resolved (probably by replacing it), if we set all the HA to "ignored" will that prevent any failover / migration attempts? I tried setting it to "disabled" but as soon as we then restarted the VMs it went back to "started". For now, we just want the VMs to remain on their existing hosts if there are any reboots, not attempt to fail over or migrate.
4. What could have caused the system to try and migrate VMs to entirely the wrong host, with no replication files present?