Avoid unnecessary process freeze during Storage Replication for LXC with single volume on ZFS

Jun 25, 2024
2
0
1
I'm currently running a three node PVE 8.2.2 cluster (kernel: 6.8.4-3-pve) and have several LXCs participating in ZFS Storage Replication. I've noticed that the replication process seems to call lxc-freeze or some equivalent before creating the ZFS snapshots. For the most part this works out fine, but on one of my LXCs running the Frigate NVR, I've observed that the freeze process takes several seconds and that has a noticeable effect inside the LXC. This can cause processes to crash that are ingesting video from cameras.

I haven't observed the freeze interfering with normal operations in my other LXCs, and the logs indicate that the freeze + snapshot + thaw is occurring in under a second.

I understand the motivation for the freeze/thaw cycle when their is more than one mountpoint attached in order to ensure the entire container is in a crash consistent state across multiple filesystems. However, for a single ZFS dataset LXC, I would think this should be unnecessary since ZFS already ensures that a dataset snapshot is atomic.

Is it possible or planned in the future for PVE Storage Replication and Backups to avoid freezing all processes when their is only a single dataset attached? Alternately, any thoughts on why a particular LXC would take several seconds to freeze and ideas on how to shorten it?

Thanks!
 
I've observed that the freeze process takes several seconds
Interesting. I'd guess that means the data it wants on disk is still in memory, so the pause is probably the result of a `sync` type of call of some sort.

Is it possible or planned in the future for PVE Storage Replication and Backups to avoid freezing all processes when their is only a single dataset attached?
Total guess here, but probably not. Freezing and sync-ing data to disk prior to snapshotting is generally regarded as a foundational feature that other things get built on top of. Probably no-one's (yet) really thought of situations where it might not be the best thing to do.

any thoughts on why a particular LXC would take several seconds to freeze and ideas on how to shorten it?
Again, complete guessing here but (as above) I'd suspect it's because that container has a bunch of unwritten disk data in ram and the "several seconds" thing is probably how long its taking to write that to disk.

For shortening it... two concepts spring to mind, though I have no idea on what you'd need to do for making them happen:
  • Increase the frequency of that container flushing its data to disk. I'm pretty sure ZFS defaults to something like 5 seconds between automatic flushes, so maybe see if you can reduce that to (eg) 1 second for that VM?
  • Alternatively, see if you can limit the size of the disk write buffer its using so that it never gets big enough to cause bad pauses when flushing
It's not something I've personally explored at all, but doing some searching online will probably turn up useful info to investigate. :)
 
I've noticed that the replication process seems to call lxc-freeze or some equivalent before creating the ZFS snapshots

Hello,

Have you ever found a solution for this?
I am running into similar issues during snapshots of containers under workloads.
The short freeze causes errors and has even led to crashing LXC containers.
 
A bit of a late reply, but I think this should work in your:
Code:
/etc/pve/lxc/{lxc}.conf

Code:
backup: skip-lock


Remember!!! zfs freezes for a reason! It does this to ensure the data is available for a proper snapshot or backup.