HA Migration issues when VM has local storage

NomadCF · Apr 30, 2024

We continue to run into an issue where we unexpectantly lose a host due for "X" reason (network outage, power, hardware, etc.). When this happens and a VM is configured with both HA and is using local storage. The VM becomes unusable as HA tries and fails to migrate the VM to another host. Proxmox is incorrectly moving the VM's configuration to another host per HA without verifying the storage has successfully been moved and exists first. When the host does come back online, the VM can't be moved back either by HA (after clearing the error) or via migration by the gui or cli. We have to manually go in and move the config to the correct host, then clear the HA error. Then everything starts up correctly.

We understand that HA is trying to migrate and bring back online the VM in question, but it's assuming the storage being used is shared without any kind of verification first.

VictorSTS · Apr 30, 2024

That doesn't always apply. I.e. you may be using ZFS a local storage + storage replication and configure HA to use a group with those "storage replicated" servers.

You should simply not configure a VM in HA if you know that the VM will not be able to be moved to another server. Some kind of shared or replicated storage is listed as a requirement in the manual [1].

Maybe PVE should not let configure HA for a VM with local storage or, as you say, not try to migrate the VM even if on HA if at the time HA kicks in the VM uses local storage (maybe the VM had shared storage and got it drive(s) moved to local for some reason).

[1] https://pve.proxmox.com/wiki/High_Availability#_requirements

NomadCF · Apr 30, 2024

VictorSTS said:
That doesn't always apply. I.e. you may be using ZFS a local storage + storage replication and configure HA to use a group with those "storage replicated" servers.

You should simply not configure a VM in HA if you know that the VM will not be able to be moved to another server. Some kind of shared or replicated storage is listed as a requirement in the manual [1].

Maybe PVE should not let configure HA for a VM with local storage or, as you say, not try to migrate the VM even if on HA if at the time HA kicks in the VM uses local storage (maybe the VM had shared storage and got it drive(s) moved to local for some reason).

[1] https://pve.proxmox.com/wiki/High_Availability#_requirements

I never used or talked about storage replication, I specify outlined a HA setup using local storage only and this error is repeatable. HA plus local storage is a valid setup and is extremely useful as is. But that fact remains that that HA doesn't do any kind if validation checks on the MV storage before moving and starting a VM.

Even with storage replication this can be the case in the even the replication has become corrupt (whether due to it's first sync not finishing being HA is activated, the dataset doesn't exit for whatever reason, etc.).

The fact is Promox needs to do more checks and validations over just assuming and hoping for the best.

VictorSTS · Apr 30, 2024

NomadCF said:
HA plus local storage is a valid setup and is extremely useful as is.

I would like to know your use case for HA with local storage, as I can't really find a situation were it will work. Again, shared storage is a requirement for HA.

UdoB · Apr 30, 2024

NomadCF said:
I specify outlined a HA setup using local storage only

How would this work?

Either you establish "networked storage" or utilize ZFS replication. In any case the HA-partner needs access to the virtual disks of the just-died VM.

NomadCF · Apr 30, 2024

UdoB said:
How would this work?

Either you establish "networked storage" or utilize ZFS replication. In any case the HA-partner needs access to the virtual disks of the just-died VM.

What needs to happen is that HA checks that the remove host can access the storage needed, if not then it should not move the config there and error out HA if no other host can access that storage. Problem solved.

HA will live migrate without replication when you vary the priority of host in that HA groups.

NomadCF · Apr 30, 2024

VictorSTS said:
I would like to know your use case for HA with local storage, as I can't really find a situation were it will work. Again, shared storage is a requirement for HA.

Simple all clustered systems firewall, dhcp, dns and AD have a failover VM. In each of these clusters a VM is always setup to not use any shared storage as a (last resort) fall back. This way only a single node and disk set is required to keep everything limping along.

VictorSTS · Apr 30, 2024

NomadCF said:
Simple all clustered systems firewall, dhcp, dns and AD have a failover VM. In each of these clusters a VM is always setup to not use any shared storage as a (last resort) fall back. This way only a single node and disk set is required to keep everything limping along.

So don't configure HA for those VMs: problem solved.

NomadCF · Apr 30, 2024

VictorSTS said:
So don't configure HA for those VMs: problem solved.

No, the code is bugged. HA itself will have this issue with any VM and any storage setup when that storage is unable to that VM & host. The code base need to verify that host has access to the needed storage & the virtual disks for that VM before moving it anywhere at anytime. It's that simple.

And HA for this setup is again useful from a management prospective.

VictorSTS · Apr 30, 2024

Sorry, I don't agree with you. Feel free to open a bug request in bugzilla [1] and propose your changes.

[1] https://bugzilla.proxmox.com/buglist.cgi?component=HA&list_id=41729&product=pve&resolution=---

NomadCF · Apr 30, 2024

VictorSTS said:
Sorry, I don't agree with you. Feel free to open a bug request in bugzilla [1] and propose your changes.

[1] https://bugzilla.proxmox.com/buglist.cgi?component=HA&list_id=41729&product=pve&resolution=---

You don't agree HA should check make sure that the storage and disks are available before migrating a VM config to a new host ?

oldprox · Jan 7, 2026

NomadCF said:
You don't agree HA should check make sure that the storage and disks are available before migrating a VM config to a new host ?

Couldn't stop myself from registering just to reply to this thread.

Despite being a very stubborn and hardliner kind of Linux user for 15+ years and a hardcore fan of Proxmox, I notice the stubbornness and silly argument of these respectable members not accepting a simple "bug" (yeah it is a bug, no matter how 'expert-looking' arguments you bring in) as bug.

I understand that the devs may never be looking at these forums (or likely rarely, if ever), but this kind of attitude stops us hardliners to see a product from an end-user's perspective to make it actually likable by those who just want to 'USE' it, not become an expert 'Nerd' in it.

There can be multiple situations where HA can be useful with local storage, and there can be infinite scenarios where a user can simply make a mistake while creating a VM - remember Proxmox boasts of supporting large enterprise, not just a 4-vm home lab. It's easy to make a dozen such mistakes everyday in a busy workplace, and if you disagree it's obvious you've never worked in a busy place.

Within Proxmox itself there are such checks in places, for example minimum RAM amount can't be more than maximum assigned - the form simply doesn't allow you to make that mistake. This kind of validation is a very BASIC need of all sorts of automation and programming.

Not just NOT having such validation is unacceptable, making a user stuck in situation like NomadCF without easy revert-back options, making them resort to nerdy commandline stuff is simply a Deal-Breaker for ANY ENTERPRISE looking for production ready solution.

And this is coming from someone who just hates UI, loves CLI for almost everything (I've even integrated DRBD manually and run it successfully for 1+ year in Proxmox, just 2 years ago). But I also happen to have accepted the fact that in a Production environment where lots of people, team members or clients depend on you, you just need to get the job done asap. Not dive into fancy research and troubleshooting for a simple mistake like checking a wrong box.

And in my case, it was not even a mistake. It was just a matter of me being 'in-progress' of my infra setup, and the connectivity of one node accidentally lost just for a couple minutes because of the network guys working in parallel on some stuff.

What NomadCF said is actually a very fundamental thing called "Common Sense" - don't step without looking whether there is a ground beneath.

VictorSTS · Jan 8, 2026

oldprox said:
a simple "bug" (yeah it is a bug, no matter how 'expert-looking' arguments you bring in) as bug.

I don't agree with that statement: It's not a bug, it's how HA has always worked and requires shared storage (i.e. Ceph, NFS, CIFS, SAN, local ZFS + replication) [1]. As HA is right now, it's fully admin responsibility to check the systems configuration (i.e. storage type or limit HA group/affinity to the nodes that do have the VM's disk(s) ).

Of course it would be better if PVE did some checks to prevent human errors, but they are not implemented and that's why I liked to HA's bugzilla so OP or anyone can explain their use case and fill a request for enhancement.

[1] https://pve.proxmox.com/wiki/High_Availability#_requirements

UdoB · Jan 8, 2026

VictorSTS said:
Of course it would be better if PVE did some checks to prevent human errors

I would be satisfied if there would be a "pve-cluster-sanity-check"-script which I could run from time to time, be it manually or via cron. It would work like the well done pve8to9 and check for some correct configuration settings.

For example "relevant networks present on all nodes", "all required storages for HA-migration available" and so on. The list of checks would probably grow over time and the result would be just a list of textual hints and warnings, no automatic runtime configuration.

oldprox · Jan 9, 2026

VictorSTS said:
I don't agree with that statement: It's not a bug, it's how HA has always worked and requires shared storage (i.e. Ceph, NFS, CIFS, SAN, local ZFS + replication) [1]. As HA is right now, it's fully admin responsibility to check the systems configuration....

Then with all due respect, either you have misread my post picking a point I didn't even make, or have the same attitude problem that I did make a point of.

Not working without shared storage is fine (although HA with replication is a common "Practice"). Not having a stupidly simple validation check is not. Automation without validation is a recipe for disaster. Any software that does that is shooting itself in the foot, and in this case, probably in the head.

Human error is an everyday fact of life. So is validation before taking a step, especially in automation. It's lame to blame the lack of a 'necessary' step to human error, that too in a software that boasts to be production ready for 'Large Enterprises'.

Firstly, change of nodes and configs is common in clusters, large or small, and these changes may not always tick all the checkboxes. Anything that relies on validation by humans is only good for student labs and hobbyists, certainly not for Enterprises. Thank God Proxmox is not following this thinking anywhere else I can remember.

Secondly, remember that enterprises are mostly full of employees that are just "Operaters" of the software they handle, barely qualified and rarely even understand everything they do. Any software that has to survive in such environment, let alone "shine", must keep in mind at every step that their customer is dumb. If you expect experts, you belong to community of experts not enterprises.

And lastly, even in the community of experts, even the most basic commands like cp, fsck, umount etc are all full of such validation checks. There is a reason why dd is called "Destroyer of Disks".

Disclaimer:
This is not to demean Proxmox. Just pointing out a black spot in an otherwise awesome software, and the typical expert attitude towards it which makes everything less attractive for a common user.

VictorSTS · Jan 9, 2026

oldprox said:
Then with all due respect, either you have misread my post picking a point I didn't even make, or have the same attitude problem that I did make a point of.

Sorry for that, but It seems you didn’t understand my point either. No need to convince me about anything: use bugzilla to explain your use case to the devs so they can decide what should be improved. I know how PVE's HA works, it's ok for me, even if it's not perfect for every single use case.

oldprox said:
Not working without shared storage is fine (although HA with replication is a common "Practice").

Did you read the official documentation linked above about HA requirements where it specifically notes that shared storage is required?
Of course HA makes sense with local storage (i.e. application clusters), just restrict the nodes that can run that VM to the appropiate ones.

oldprox · Jan 9, 2026

VictorSTS said:
it's ok for me...

Exactly. No need to make it sound like it should be okay for everyone.

Did you read the official documentation..

A lot. Doesn't justify a missing validation check. By that logic there would be no validations needed at all anywhere in any well documented software.

The bugzilla link was all that is needed for this. Thanks for that.

Search

Search

HA Migration issues when VM has local storage

NomadCF

Active Member

VictorSTS

Distinguished Member

NomadCF

Active Member

VictorSTS

Distinguished Member

UdoB

Distinguished Member

NomadCF

Active Member

NomadCF

Active Member

VictorSTS

Distinguished Member

NomadCF

Active Member

VictorSTS

Distinguished Member

NomadCF

Active Member

oldprox

New Member

VictorSTS

Distinguished Member

UdoB

Distinguished Member

oldprox

New Member

VictorSTS

Distinguished Member

oldprox

New Member

We value your privacy