HA and Database Corruption

MariusFebruary · Sunday at 07:09

Hi Forum,

I am new to HA and have a running setup for two months now. From the experiences that I have gathered it looks like live-migration using a HA group fails on almost all VMs and CTs with a database (sadly almost everything I have).

These were the result of changing the HA group followed by an automatic migration:

2 WordPress hosts both showed a "white screen of death" when editing any page
several docker containers did not start anymore (e.g. PhotoPrism, Portainer)
Bitcoin lightning node issued detrimental force closure (that one is certainly on me as one should just not have a lightning node in a HA setup, too delicate)

Now, obviously, I headed into HA to naively. Is there a possibility to tell HA to shut down the VM/CT before migration?

SteveITS · Sunday at 08:32

You shouldn’t need to. Are the CPUs similar and to what CPU type is the VM set?

spirit · Sunday at 08:46

live migration with or without HA is exactly the same process. Personnally, I never had a single corruption in 15 year (with around 4000vms in production).

What is your shared storage ? same cpus on differents hosts ?

UdoB · Sunday at 09:10

Make sure that a "write data now!" command on the database machine does actually write the data. No insecure/lying write caches please...

MariusFebruary · Sunday at 17:49

Thank you all for your responses. Upon further reflection, I may have set this up incorrectly. The nodes do not actually use "shared storage." Instead, Node 1 stores its data on a specific set of disks, and Node 2 uses a different set. To enable replication, I have named the ZFS pools identically on both nodes (for example, in one of the WordPress containers, "dolphin" — rootfs: dolphin:subvol-107-disk-0,size=8G).

The CPUs on the two nodes are different: Node 1 uses an AMD EPYC 7502P (2.5GHz), while Node 2 uses an AMD EPYC 8324PN. Additionally, I have not implemented any mechanisms to ensure that a "write now" command was issued.

SteveITS · Sunday at 17:51

What is the CPU set to in the VM?

MariusFebruary · Sunday at 19:13

For the reported docker and wordpress incidences these we containers not VMs so I could not specify a CPU type. For other VMs on the system I use "x86-64-v2-AES" and "EPYC" (please don't ask why). Is that a problem?

SteveITS · Sunday at 19:23

“Host” or similar more specific choices can be problematic if the physical CPUs aren’t compatible. Though IIRC the VM won’t migrate.

Containers can’t live migrate AFAIK.

Johannes S · Sunday at 20:33

SteveITS said:
“Host” or similar more specific choices can be problematic if the physical CPUs aren’t compatible. Though IIRC the VM won’t migrate.

Exactly, they will fail with an error message. Or if one did a PCI-Passtrough.

SteveITS said:
Containers can’t live migrate AFAIK.

They can

But you will have a short downtime since (different than VM) they can't be migrated with the content of their memory and directly continue. Due to their short startup times this doesn't need to be a problem depending of the usecase.
Now one might argue that this isn't "live-migration", but in my book no live-migration would mean, that they could only be migrated if they are shutdown. And that's not the case: You can also migrate running containers (but with a downtime). Nontheless this issue is one of the reasons, why I prefer VMs if feasible. YMMV

MariusFebruary · Sunday at 21:03

Thank you for the detailed info. I still try to determine the cause of failure of my migrations. If I read your statements I tend to believe that, in principle, a container running WordPress should migrate without problems whatever the reason (node-failure or operator choice). Is this correct?

In the case of a Bitcoin lightning node there is money involved. Sudden node failure is not a problem. What is catastorphic is a node which comes back online with a past state, i.e. it went offline at T and came back online as T-30s (i.e. 30 seconds were missing). If I understand you all correctly, a VM should never get into a past state. Am I correct?

Johannes S · Sunday at 22:07

MariusFebruary said:
Thank you for the detailed info. I still try to determine the cause of failure of my migrations. If I read your statements I tend to believe that, in principle, a container running WordPress should migrate without problems whatever the reason (node-failure or operator choice). Is this correct?

Yes, but it will have a downtime until the startup of the container system and service is finished on the target node.

Concerning docker inside lxc: Please reconsider this approach since that might break after updates due to changes in the underlying kernel and system services (which are used by lxc and docker/podman thus might lead to conflicts).

Here is one recent example from the PVE 9 beta:

Q

[SOLVED] Thread 'Proxmox VE 9.0 BETA LCX Docker not working'

Friday at 21:03

Hey Guys,

I updated to VE 9.0 Beta, and since then, my LCX Docker apps haven't been running. Every container is showing the same error message:

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "mqueue" to rootfs at "/dev/mqueue": change mount propagation through procfd: resolving path inside rootfs failed: lstat...

And two older ones:

S

[SOLVED] Thread 'Since PVE 7.3-4 update, all docker container disappeared from lxc containers'

Dec 27, 2022

Hi all,

I just updated my PVE to 7.3-4 and no service is reachable, I just checked and all my docker container are gone. Seems like all docker container in my lxc are gone.
This is how docker info looks like...

I will try to restore an lxc from my last backup. I hope someone could tell me what and how I can check for issues further, currently I don't know how to proceed.


docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  compose: Docker Compose (Docker Inc...

L

Thread 'Updating Proxmox breaks docker LXC'

May 1, 2023

Hello,

I had an LXC running Docker, running Compreface. It was working fine until I upgraded Proxmox and restarted the servers. A couple things I don't understand happened.

My Docker LXCs did not automatically start the containers when the LXC booted up. And Many of them had to do fresh pulls when I ran `docker compose up -d`
In the specific case of this LXC, the pull fails with the following error:
`failed to register layer: ApplyLayer exit status 1 stdout: stderr: unlinkat /usr/local/lib/python3.7/site-packages/numpy-1.19.5.dist-info: invalid argument`

I tried restoring...

For VM migration I would suggest to change the CPU type to something which is compatible with both cpus:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_virtual_machines_settings

In your case the default x86-64-v2-AES should already be enough for live migration. x86-64-v3 however is supposed to work too and allows using EPYC features not present in older CPUs, so I would first try that before switchting to x86-64-v2-AES.

MariusFebruary said:
In the case of a Bitcoin lightning node there is money involved. Sudden node failure is not a problem. What is catastorphic is a node which comes back online with a past state, i.e. it went offline at T and came back online as T-30s (i.e. 30 seconds were missing). If I understand you all correctly, a VM should never get into a past state. Am I correct?

Not sure about that to be honest. What if the system time on the VM or host is wrong and NTP timesync doesn't work (for whatever reason)? Isn't there a possibility that your node then might be considered as not trustworthy? But I'm by no means not an expert on the defails of crypto currency ponzi schemes so please take this with a grain of salt

MariusFebruary · 2025-07-21T06:52:57+0200

Johannes S said:
Yes, but it will have a downtime until the startup of the container system and service is finished on the target node.

Concerning docker inside lxc: Please reconsider this approach since that might break after updates due to changes in the underlying kernel and system services (which are used by lxc and docker/podman thus might lead to conflicts).

Here is one recent example from the PVE 9 beta:

Q

[SOLVED] Thread 'Proxmox VE 9.0 BETA LCX Docker not working'

Friday at 21:03

Hey Guys,

I updated to VE 9.0 Beta, and since then, my LCX Docker apps haven't been running. Every container is showing the same error message:

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "mqueue" to rootfs at "/dev/mqueue": change mount propagation through procfd: resolving path inside rootfs failed: lstat...

Click to expand...

quanto11

Replies: 24

Forum: Proxmox VE: Installation and configuration

And two older ones:

S

[SOLVED] Thread 'Since PVE 7.3-4 update, all docker container disappeared from lxc containers'

Dec 27, 2022

Hi all,

I just updated my PVE to 7.3-4 and no service is reachable, I just checked and all my docker container are gone. Seems like all docker container in my lxc are gone.
This is how docker info looks like...

I will try to restore an lxc from my last backup. I hope someone could tell me what and how I can check for issues further, currently I don't know how to proceed.

docker info Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Docker Buildx (Docker Inc., v0.9.1-docker) compose: Docker Compose (Docker Inc...

Sputnik2525

Replies: 2

Forum: Proxmox VE: Installation and configuration

L

Thread 'Updating Proxmox breaks docker LXC'

May 1, 2023

Hello,

I had an LXC running Docker, running Compreface. It was working fine until I upgraded Proxmox and restarted the servers. A couple things I don't understand happened.

My Docker LXCs did not automatically start the containers when the LXC booted up. And Many of them had to do fresh pulls when I ran `docker compose up -d`

In the specific case of this LXC, the pull fails with the following error:
`failed to register layer: ApplyLayer exit status 1 stdout: stderr: unlinkat /usr/local/lib/python3.7/site-packages/numpy-1.19.5.dist-info: invalid argument`

I tried restoring...

LordRatner

Replies: 2

Forum: Proxmox VE: Installation and configuration

For VM migration I would suggest to change the CPU type to something which is compatible with both cpus:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_virtual_machines_settings

In your case the default x86-64-v2-AES should already be enough for live migration. x86-64-v3 however is supposed to work too and allows using EPYC features not present in older CPUs, so I would first try that before switchting to x86-64-v2-AES.

Not sure about that to be honest. What if the system time on the VM or host is wrong and NTP timesync doesn't work (for whatever reason)? Isn't there a possibility that your node then might be considered as not trustworthy? But I'm by no means not an expert on the defails of crypto currency ponzi schemes so please take this with a grain of salt

Thank you for the valuable input and the heads-up on the upcoming incompatibility of LXC and docker. I will migrate this to a VM then. Hahaha - ponzi - I hear you. I am working with/on Bitcoin, only, no crypto-something, not altcoins ;-)

Johannes S · 2025-07-21T10:49:25+0200

MariusFebruary said:
Thank you for the valuable input and the heads-up on the upcoming incompatibility of LXC and docker. I will migrate this to a VM then.

To be fair: If you need to use stuff like an iGPU it's easier with LXCs and if your software is only distributed as oci-Image (which is used by podman and docker) you can't do much about it. There are also a lot of people (more on Reddits r/homelab and r/proxmox than here though) who run their docker instances inside an lxc to save on ressources. So you can definitevly do this if you can live with manual troubleshooting from time to time.

For me this is (except for stuff like the mentioned iGPU passthrough (Plex, jellyfin and co) or self-containing applications (pi-hole) without need for docker) not worth it since a lightweight Linux VM (like Alpine or Debian) don't uses much ressources either and you don't need to spin up a new vm for every docker instance you want to host. Instead you can run all of them from one or two VMs.

Search

Search

HA and Database Corruption

MariusFebruary

New Member

SteveITS

Active Member

spirit

Distinguished Member

UdoB

Distinguished Member

MariusFebruary

New Member

SteveITS

Active Member

MariusFebruary

New Member

SteveITS

Active Member

Johannes S

Famous Member

MariusFebruary

New Member

Johannes S

Famous Member

[SOLVED] Thread 'Proxmox VE 9.0 BETA LCX Docker not working'

[SOLVED] Thread 'Since PVE 7.3-4 update, all docker container disappeared from lxc containers'

Thread 'Updating Proxmox breaks docker LXC'

MariusFebruary

New Member

[SOLVED] Thread 'Proxmox VE 9.0 BETA LCX Docker not working'

[SOLVED] Thread 'Since PVE 7.3-4 update, all docker container disappeared from lxc containers'

Thread 'Updating Proxmox breaks docker LXC'

Johannes S

Famous Member

We value your privacy