HA and Database Corruption

Jul 20, 2025
5
1
3
Hi Forum,

I am new to HA and have a running setup for two months now. From the experiences that I have gathered it looks like live-migration using a HA group fails on almost all VMs and CTs with a database (sadly almost everything I have).

These were the result of changing the HA group followed by an automatic migration:
  • 2 WordPress hosts both showed a "white screen of death" when editing any page
  • several docker containers did not start anymore (e.g. PhotoPrism, Portainer)
  • Bitcoin lightning node issued detrimental force closure (that one is certainly on me as one should just not have a lightning node in a HA setup, too delicate)
Now, obviously, I headed into HA to naively. Is there a possibility to tell HA to shut down the VM/CT before migration?
 
Make sure that a "write data now!" command on the database machine does actually write the data. No insecure/lying write caches please...
 
Thank you all for your responses. Upon further reflection, I may have set this up incorrectly. The nodes do not actually use "shared storage." Instead, Node 1 stores its data on a specific set of disks, and Node 2 uses a different set. To enable replication, I have named the ZFS pools identically on both nodes (for example, in one of the WordPress containers, "dolphin" — rootfs: dolphin:subvol-107-disk-0,size=8G).


The CPUs on the two nodes are different: Node 1 uses an AMD EPYC 7502P (2.5GHz), while Node 2 uses an AMD EPYC 8324PN. Additionally, I have not implemented any mechanisms to ensure that a "write now" command was issued.
 
For the reported docker and wordpress incidences these we containers not VMs so I could not specify a CPU type. For other VMs on the system I use "x86-64-v2-AES" and "EPYC" (please don't ask why). Is that a problem?
 
“Host” or similar more specific choices can be problematic if the physical CPUs aren’t compatible. Though IIRC the VM won’t migrate.

Containers can’t live migrate AFAIK.
 
“Host” or similar more specific choices can be problematic if the physical CPUs aren’t compatible. Though IIRC the VM won’t migrate.

Exactly, they will fail with an error message. Or if one did a PCI-Passtrough.

Containers can’t live migrate AFAIK.
They can ;) But you will have a short downtime since (different than VM) they can't be migrated with the content of their memory and directly continue. Due to their short startup times this doesn't need to be a problem depending of the usecase.
Now one might argue that this isn't "live-migration", but in my book no live-migration would mean, that they could only be migrated if they are shutdown. And that's not the case: You can also migrate running containers (but with a downtime). Nontheless this issue is one of the reasons, why I prefer VMs if feasible. YMMV
 
  • Like
Reactions: UdoB
Thank you for the detailed info. I still try to determine the cause of failure of my migrations. If I read your statements I tend to believe that, in principle, a container running WordPress should migrate without problems whatever the reason (node-failure or operator choice). Is this correct?

In the case of a Bitcoin lightning node there is money involved. Sudden node failure is not a problem. What is catastorphic is a node which comes back online with a past state, i.e. it went offline at T and came back online as T-30s (i.e. 30 seconds were missing). If I understand you all correctly, a VM should never get into a past state. Am I correct?
 
Thank you for the detailed info. I still try to determine the cause of failure of my migrations. If I read your statements I tend to believe that, in principle, a container running WordPress should migrate without problems whatever the reason (node-failure or operator choice). Is this correct?

Yes, but it will have a downtime until the startup of the container system and service is finished on the target node.

Concerning docker inside lxc: Please reconsider this approach since that might break after updates due to changes in the underlying kernel and system services (which are used by lxc and docker/podman thus might lead to conflicts).

Here is one recent example from the PVE 9 beta:

And two older ones:

For VM migration I would suggest to change the CPU type to something which is compatible with both cpus:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_virtual_machines_settings

In your case the default x86-64-v2-AES should already be enough for live migration. x86-64-v3 however is supposed to work too and allows using EPYC features not present in older CPUs, so I would first try that before switchting to x86-64-v2-AES.


In the case of a Bitcoin lightning node there is money involved. Sudden node failure is not a problem. What is catastorphic is a node which comes back online with a past state, i.e. it went offline at T and came back online as T-30s (i.e. 30 seconds were missing). If I understand you all correctly, a VM should never get into a past state. Am I correct?
Not sure about that to be honest. What if the system time on the VM or host is wrong and NTP timesync doesn't work (for whatever reason)? Isn't there a possibility that your node then might be considered as not trustworthy? But I'm by no means not an expert on the defails of crypto currency ponzi schemes so please take this with a grain of salt ;)
 
  • Like
Reactions: UdoB
Yes, but it will have a downtime until the startup of the container system and service is finished on the target node.

Concerning docker inside lxc: Please reconsider this approach since that might break after updates due to changes in the underlying kernel and system services (which are used by lxc and docker/podman thus might lead to conflicts).

Here is one recent example from the PVE 9 beta:

And two older ones:

For VM migration I would suggest to change the CPU type to something which is compatible with both cpus:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_virtual_machines_settings

In your case the default x86-64-v2-AES should already be enough for live migration. x86-64-v3 however is supposed to work too and allows using EPYC features not present in older CPUs, so I would first try that before switchting to x86-64-v2-AES.



Not sure about that to be honest. What if the system time on the VM or host is wrong and NTP timesync doesn't work (for whatever reason)? Isn't there a possibility that your node then might be considered as not trustworthy? But I'm by no means not an expert on the defails of crypto currency ponzi schemes so please take this with a grain of salt ;)
Thank you for the valuable input and the heads-up on the upcoming incompatibility of LXC and docker. I will migrate this to a VM then. Hahaha - ponzi - I hear you. I am working with/on Bitcoin, only, no crypto-something, not altcoins ;-)
 
  • Like
Reactions: UdoB
Thank you for the valuable input and the heads-up on the upcoming incompatibility of LXC and docker. I will migrate this to a VM then.

To be fair: If you need to use stuff like an iGPU it's easier with LXCs and if your software is only distributed as oci-Image (which is used by podman and docker) you can't do much about it. There are also a lot of people (more on Reddits r/homelab and r/proxmox than here though) who run their docker instances inside an lxc to save on ressources. So you can definitevly do this if you can live with manual troubleshooting from time to time.

For me this is (except for stuff like the mentioned iGPU passthrough (Plex, jellyfin and co) or self-containing applications (pi-hole) without need for docker) not worth it since a lightweight Linux VM (like Alpine or Debian) don't uses much ressources either and you don't need to spin up a new vm for every docker instance you want to host. Instead you can run all of them from one or two VMs.