Avoid unnecessary process freeze during Storage Replication for LXC with single volume on ZFS

Jun 25, 2024
1
0
1
I'm currently running a three node PVE 8.2.2 cluster (kernel: 6.8.4-3-pve) and have several LXCs participating in ZFS Storage Replication. I've noticed that the replication process seems to call lxc-freeze or some equivalent before creating the ZFS snapshots. For the most part this works out fine, but on one of my LXCs running the Frigate NVR, I've observed that the freeze process takes several seconds and that has a noticeable effect inside the LXC. This can cause processes to crash that are ingesting video from cameras.

I haven't observed the freeze interfering with normal operations in my other LXCs, and the logs indicate that the freeze + snapshot + thaw is occurring in under a second.

I understand the motivation for the freeze/thaw cycle when their is more than one mountpoint attached in order to ensure the entire container is in a crash consistent state across multiple filesystems. However, for a single ZFS dataset LXC, I would think this should be unnecessary since ZFS already ensures that a dataset snapshot is atomic.

Is it possible or planned in the future for PVE Storage Replication and Backups to avoid freezing all processes when their is only a single dataset attached? Alternately, any thoughts on why a particular LXC would take several seconds to freeze and ideas on how to shorten it?

Thanks!
 
I've observed that the freeze process takes several seconds
Interesting. I'd guess that means the data it wants on disk is still in memory, so the pause is probably the result of a `sync` type of call of some sort.

Is it possible or planned in the future for PVE Storage Replication and Backups to avoid freezing all processes when their is only a single dataset attached?
Total guess here, but probably not. Freezing and sync-ing data to disk prior to snapshotting is generally regarded as a foundational feature that other things get built on top of. Probably no-one's (yet) really thought of situations where it might not be the best thing to do.

any thoughts on why a particular LXC would take several seconds to freeze and ideas on how to shorten it?
Again, complete guessing here but (as above) I'd suspect it's because that container has a bunch of unwritten disk data in ram and the "several seconds" thing is probably how long its taking to write that to disk.

For shortening it... two concepts spring to mind, though I have no idea on what you'd need to do for making them happen:
  • Increase the frequency of that container flushing its data to disk. I'm pretty sure ZFS defaults to something like 5 seconds between automatic flushes, so maybe see if you can reduce that to (eg) 1 second for that VM?
  • Alternatively, see if you can limit the size of the disk write buffer its using so that it never gets big enough to cause bad pauses when flushing
It's not something I've personally explored at all, but doing some searching online will probably turn up useful info to investigate. :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!