HA Migration issues when VM has local storage

NomadCF

Active Member
Dec 20, 2017
29
1
43
44
We continue to run into an issue where we unexpectantly lose a host due for "X" reason (network outage, power, hardware, etc.). When this happens and a VM is configured with both HA and is using local storage. The VM becomes unusable as HA tries and fails to migrate the VM to another host. Proxmox is incorrectly moving the VM's configuration to another host per HA without verifying the storage has successfully been moved and exists first. When the host does come back online, the VM can't be moved back either by HA (after clearing the error) or via migration by the gui or cli. We have to manually go in and move the config to the correct host, then clear the HA error. Then everything starts up correctly.

We understand that HA is trying to migrate and bring back online the VM in question, but it's assuming the storage being used is shared without any kind of verification first.
 
That doesn't always apply. I.e. you may be using ZFS a local storage + storage replication and configure HA to use a group with those "storage replicated" servers.

You should simply not configure a VM in HA if you know that the VM will not be able to be moved to another server. Some kind of shared or replicated storage is listed as a requirement in the manual [1].

Maybe PVE should not let configure HA for a VM with local storage or, as you say, not try to migrate the VM even if on HA if at the time HA kicks in the VM uses local storage (maybe the VM had shared storage and got it drive(s) moved to local for some reason).

[1] https://pve.proxmox.com/wiki/High_Availability#_requirements
 
That doesn't always apply. I.e. you may be using ZFS a local storage + storage replication and configure HA to use a group with those "storage replicated" servers.

You should simply not configure a VM in HA if you know that the VM will not be able to be moved to another server. Some kind of shared or replicated storage is listed as a requirement in the manual [1].

Maybe PVE should not let configure HA for a VM with local storage or, as you say, not try to migrate the VM even if on HA if at the time HA kicks in the VM uses local storage (maybe the VM had shared storage and got it drive(s) moved to local for some reason).

[1] https://pve.proxmox.com/wiki/High_Availability#_requirements


I never used or talked about storage replication, I specify outlined a HA setup using local storage only and this error is repeatable. HA plus local storage is a valid setup and is extremely useful as is. But that fact remains that that HA doesn't do any kind if validation checks on the MV storage before moving and starting a VM.

Even with storage replication this can be the case in the even the replication has become corrupt (whether due to it's first sync not finishing being HA is activated, the dataset doesn't exit for whatever reason, etc.).

The fact is Promox needs to do more checks and validations over just assuming and hoping for the best.
 
I specify outlined a HA setup using local storage only
How would this work?

Either you establish "networked storage" or utilize ZFS replication. In any case the HA-partner needs access to the virtual disks of the just-died VM.
 
How would this work?

Either you establish "networked storage" or utilize ZFS replication. In any case the HA-partner needs access to the virtual disks of the just-died VM.

What needs to happen is that HA checks that the remove host can access the storage needed, if not then it should not move the config there and error out HA if no other host can access that storage. Problem solved.

HA will live migrate without replication when you vary the priority of host in that HA groups.
 
I would like to know your use case for HA with local storage, as I can't really find a situation were it will work. Again, shared storage is a requirement for HA.

Simple all clustered systems firewall, dhcp, dns and AD have a failover VM. In each of these clusters a VM is always setup to not use any shared storage as a (last resort) fall back. This way only a single node and disk set is required to keep everything limping along.
 
Simple all clustered systems firewall, dhcp, dns and AD have a failover VM. In each of these clusters a VM is always setup to not use any shared storage as a (last resort) fall back. This way only a single node and disk set is required to keep everything limping along.
So don't configure HA for those VMs: problem solved.
 
So don't configure HA for those VMs: problem solved.

No, the code is bugged. HA itself will have this issue with any VM and any storage setup when that storage is unable to that VM & host. The code base need to verify that host has access to the needed storage & the virtual disks for that VM before moving it anywhere at anytime. It's that simple.

And HA for this setup is again useful from a management prospective.
 
You don't agree HA should check make sure that the storage and disks are available before migrating a VM config to a new host ?
Couldn't stop myself from registering just to reply to this thread. :D

Despite being a very stubborn and hardliner kind of Linux user for 15+ years and a hardcore fan of Proxmox, I notice the stubbornness and silly argument of these respectable members not accepting a simple "bug" (yeah it is a bug, no matter how 'expert-looking' arguments you bring in) as bug.

I understand that the devs may never be looking at these forums (or likely rarely, if ever), but this kind of attitude stops us hardliners to see a product from an end-user's perspective to make it actually likable by those who just want to 'USE' it, not become an expert 'Nerd' in it.

There can be multiple situations where HA can be useful with local storage, and there can be infinite scenarios where a user can simply make a mistake while creating a VM - remember Proxmox boasts of supporting large enterprise, not just a 4-vm home lab. It's easy to make a dozen such mistakes everyday in a busy workplace, and if you disagree it's obvious you've never worked in a busy place.

Within Proxmox itself there are such checks in places, for example minimum RAM amount can't be more than maximum assigned - the form simply doesn't allow you to make that mistake. This kind of validation is a very BASIC need of all sorts of automation and programming.

Not just NOT having such validation is unacceptable, making a user stuck in situation like NomadCF without easy revert-back options, making them resort to nerdy commandline stuff is simply a Deal-Breaker for ANY ENTERPRISE looking for production ready solution.

And this is coming from someone who just hates UI, loves CLI for almost everything (I've even integrated DRBD manually and run it successfully for 1+ year in Proxmox, just 2 years ago). But I also happen to have accepted the fact that in a Production environment where lots of people, team members or clients depend on you, you just need to get the job done asap. Not dive into fancy research and troubleshooting for a simple mistake like checking a wrong box.

And in my case, it was not even a mistake. It was just a matter of me being 'in-progress' of my infra setup, and the connectivity of one node accidentally lost just for a couple minutes because of the network guys working in parallel on some stuff.

What NomadCF said is actually a very fundamental thing called "Common Sense" - don't step without looking whether there is a ground beneath. :D
 
a simple "bug" (yeah it is a bug, no matter how 'expert-looking' arguments you bring in) as bug.
I don't agree with that statement: It's not a bug, it's how HA has always worked and requires shared storage (i.e. Ceph, NFS, CIFS, SAN, local ZFS + replication) [1]. As HA is right now, it's fully admin responsibility to check the systems configuration (i.e. storage type or limit HA group/affinity to the nodes that do have the VM's disk(s) ).

Of course it would be better if PVE did some checks to prevent human errors, but they are not implemented and that's why I liked to HA's bugzilla so OP or anyone can explain their use case and fill a request for enhancement.

[1] https://pve.proxmox.com/wiki/High_Availability#_requirements
 
Of course it would be better if PVE did some checks to prevent human errors

I would be satisfied if there would be a "pve-cluster-sanity-check"-script which I could run from time to time, be it manually or via cron. It would work like the well done pve8to9 and check for some correct configuration settings.

For example "relevant networks present on all nodes", "all required storages for HA-migration available" and so on. The list of checks would probably grow over time and the result would be just a list of textual hints and warnings, no automatic runtime configuration.