Avoid unnecessary process freeze during Storage Replication for LXC with single volume on ZFS

alarsson · Jun 25, 2024

I'm currently running a three node PVE 8.2.2 cluster (kernel: 6.8.4-3-pve) and have several LXCs participating in ZFS Storage Replication. I've noticed that the replication process seems to call lxc-freeze or some equivalent before creating the ZFS snapshots. For the most part this works out fine, but on one of my LXCs running the Frigate NVR, I've observed that the freeze process takes several seconds and that has a noticeable effect inside the LXC. This can cause processes to crash that are ingesting video from cameras.

I haven't observed the freeze interfering with normal operations in my other LXCs, and the logs indicate that the freeze + snapshot + thaw is occurring in under a second.

I understand the motivation for the freeze/thaw cycle when their is more than one mountpoint attached in order to ensure the entire container is in a crash consistent state across multiple filesystems. However, for a single ZFS dataset LXC, I would think this should be unnecessary since ZFS already ensures that a dataset snapshot is atomic.

Is it possible or planned in the future for PVE Storage Replication and Backups to avoid freezing all processes when their is only a single dataset attached? Alternately, any thoughts on why a particular LXC would take several seconds to freeze and ideas on how to shorten it?

Thanks!

justinclift · Jun 25, 2024

alarsson said:
I've observed that the freeze process takes several seconds

Interesting. I'd guess that means the data it wants on disk is still in memory, so the pause is probably the result of a `sync` type of call of some sort.

alarsson said:
Is it possible or planned in the future for PVE Storage Replication and Backups to avoid freezing all processes when their is only a single dataset attached?

Total guess here, but probably not. Freezing and sync-ing data to disk prior to snapshotting is generally regarded as a foundational feature that other things get built on top of. Probably no-one's (yet) really thought of situations where it might not be the best thing to do.

alarsson said:
any thoughts on why a particular LXC would take several seconds to freeze and ideas on how to shorten it?

Again, complete guessing here but (as above) I'd suspect it's because that container has a bunch of unwritten disk data in ram and the "several seconds" thing is probably how long its taking to write that to disk.

For shortening it... two concepts spring to mind, though I have no idea on what you'd need to do for making them happen:

Increase the frequency of that container flushing its data to disk. I'm pretty sure ZFS defaults to something like 5 seconds between automatic flushes, so maybe see if you can reduce that to (eg) 1 second for that VM?
Alternatively, see if you can limit the size of the disk write buffer its using so that it never gets big enough to cause bad pauses when flushing

It's not something I've personally explored at all, but doing some searching online will probably turn up useful info to investigate.

Search

Search

Avoid unnecessary process freeze during Storage Replication for LXC with single volume on ZFS

alarsson

New Member

justinclift

Active Member