LVM ext4 to ZFS ?

toxic

Active Member
Aug 1, 2020
56
6
28
37
Hi,
I am running an LXC with frigate and would like to make it Highly Available using the lukewarm replication of the FS as my attempts with ceph are producing slow results (cheap SSD, 1GB Ethernet...)

I just realized this relies on snapshots from ZFS, meaning my existing ext4 installs (3nodes) needs to be either fully reinstalled, or I'm looking for help with this thread to hopefully shrink LVM and replace it slowly with ZFS.

Each node has a 500GB Samsung Evo SATA SSD, and only about 250g are used so I was thinking 2 or 3 iterations of shrinkExt4+growZFS+moveVMs should do the trick.

The thing is I'm not versed in either LVM nor ZFS.

If anyone has some pointers to some of the commands to do that it would be very helpful, I can read the manual but from where I am I'm not sure where to start.

Note that I have a single SSD on each machine I'm not looking for redundancy with ZFS, only the snapshot thing for the lukewarm replication.
Each machine is running small loads and powered by an i5 8365U that mostly sits around 20-30% usage so I have headroom I believe if ZFS needs it.

Thanks in advance for any kind help!
 
I just realized this relies on snapshots from ZFS, meaning my existing ext4 installs (3nodes) needs to be either fully reinstalled, or I'm looking for help with this thread to hopefully shrink LVM and replace it slowly with ZFS.
Keep in mind that ZFS, similar to ceph, wants Enterprise SSDs. Probably not as bad as with ceph but don't expect it to be as fast as ext4 and the SSDs will probably also fail multiple times earlier.

Each node has a 500GB Samsung Evo SATA SSD, and only about 250g are used so I was thinking 2 or 3 iterations of shrinkExt4+growZFS+moveVMs should do the trick.
In case your virtual disks are on that "local-lvm" LVM-thin pool, keep in mind that you can't shrink that. It has to be destroyed (with all data on it) and recreated with a smaller size.
 
Would it not be much simpler to remove one node from the cluster and reinstall Proxmox with ZFS (and restore your VM/CTs and test). And then create a new cluster from there by removing nodes from the old cluster, reinstalling and adding them to the new cluster.
 
  • Like
Reactions: andlil
Thanks @Dunuin , if the local-lvm can't be shrunk to make room for a new ZFS, I fear the full reinstall route is the only way to go for me, I don't have extra disks on theses machines, only a single SATA port...
I did plan to do it node by node indeed as @leesteken also mentioned, I don't have any unused node lying around to spin up another cluster

Given I'm not very intensive on writes and I'm using these SSD only for improved latency I'm going to give ZFS a shot. I'm already offloading all I can to NFS storage backed by my Synology over 1GB lan, but to run the frigate Database or even the frequently accesses recent recordings I find local SSD to be much more responsive than NFS over gigabit, even when the Synology has nvme ssds as both read and write cache...
These 500gb Samsung Evo have turned out incredibly reliable, they don't have the same age so should one fail I'd probably have the opportunity to replace them with bigger capacity for cheaper nowerdays... Else I do keep backups

A bit sad though not being able to shrink the local-lvm thinpool, as this means going the other way should I decide to give up on ZFS will mean yet another reinstall...

HA storage is a pain, I'm just realizing now that PvE made HA VMs a breeze

If someone got hardware recommendations for a PvE cluster with fast ceph storage within 60Watts I'd be amazed, else I' going to give a shot with the lukewarm replication and if failover doesn't work nicely, probably give up and go back to 1 node, no ceph, no HA, but 15-20Watts only yes, European electricity price is that high it's my home lab, who else has HA at home with these energy prices
 
Given I'm not very intensive on writes and I'm using these SSD only for improved latency I'm going to give ZFS a shot.
The problem is the massive overhead of ZFS and the resulting write amplification. Depending on what I do here, I see a write amplification factor of 3 (big async sequential writes) to 60 (small random sync writes) with an average of factor 20. So even if I only write 50GB per day, this gets amplified to 1TB per day.
 
Do you have maybe a tool to follow that up? I do have a Prometheus and node exporter, I could graph the before and after
Isn't most of this overhead due tu actual redundancy? I'm not sure what the verbiage is in ZFS terms but I'm planning for a single disk no redundancy, if you were doing the equivalent of raid 5 or 6 for sure ZFS will need to write more and probably even more with metadata and such...
 
The problem is the massive overhead of ZFS and the resulting write amplification. Depending on what I do here, I see a write amplification factor of 3 (big async sequential writes) to 60 (small random sync writes) with an average of factor 20. So even if I only write 50GB per day, this gets amplified to 1TB per day.

Does anyone have actual statistics (as in tested in this and that use case, recordsize that and that for such and such dataset, result of factor X vs regular fs) for any of these claims? Depending on what one does, when tweaked it will behave quite normally (as a copy-on-write filesystem) without excessive write amplification. See e.g.: https://www.usenix.org/system/files/login/articles/login_winter16_09_jude.pdf

I am not saying there is none, but it's tossed around here like it's a show stopper. Also, newer even just gaming (especially NVMe) SSDs are getting to TBW of 2PB at 2TB sizes (which might be cheaper than DC grade (even SATA) SSDs).
 
Keep in mind that ZFS, similar to ceph, wants Enterprise SSDs. Probably not as bad as with ceph but don't expect it to be as fast as ext4 and the SSDs will probably also fail multiple times earlier.
I agree it will not be fast.

In case your virtual disks are on that "local-lvm" LVM-thin pool, keep in mind that you can't shrink that. It has to be destroyed (with all data on it) and recreated with a smaller size.

@toxic If you feel like experimental, you may want to have a look at:
https://github.com/jthornber/thin-provisioning-tools/blob/main/src/commands/thin_shrink.rs

EDIT:

Based on:
https://github.com/nkshirsagar/thinpool_shrink

Sample use (not my site):
https://pascalroeleven.nl/2022/08/0...2-thin-pool-manually-if-you-are-adventurious/
 
Last edited:
Wow, thanks @tempacc346235 but this does seems more involved than I envisioned, probably the reinstall node by node is more feasible for me in the end, but thanks again that is indeed the exact answer to my original question!

A bit frightened by your warnings that ZFS would be slow for me though... Seeing my underpowered Synology is running btrfs wich seems quite equivalent in terms of features, I was expecting ZFS to be trivial in this one drive per pool no redundancy configuration... This gives me pause... Will think this through a bit more before I attempt it...
 
Wow, thanks @tempacc346235 but this does seems more involved than I envisioned, probably the reinstall node by node is more feasible for me in the end, but thanks again that is indeed the exact answer to my original question!

It can be done, now I am not staff, so I can afford to dispense advice as I see fit. I cannot guarantee others' tools, but let's assume that you have backups (everyone should) and just do not want to do redo the whole node (not everyone has a cluster). It would be reasonable to do it with the tool, backups are then for when something goes wrong. Of course if it's more practicable to rehash the nodes all over, it's the "supported" way to do it. I do not think the thinpool_shrink is complex, but like everything, it carries certain risk. It certainly is not part of the LVM suite.

A bit frightened by your warnings that ZFS would be slow for me though... Seeing my underpowered Synology is running btrfs wich seems quite equivalent in terms of features, I was expecting ZFS to be trivial in this one drive per pool no redundancy configuration... This gives me pause... Will think this through a bit more before I attempt it...

Just to be clear, I think CEPH would be slow. The ZFS argument would be about write amplification on SSDs with low TBW. Then PVE has this habit of doing lots of writes per day especially in a cluster. There would be another unofficial way to mitigate that it, e.g.:
https://github.com/isasmendiagus/pmxcfs-ram

I do like BTRFS, I use it and do not consider it a problem, but by PVE staff again this would be considered experimental use. There's certain things with BTRFS to avoid like RAIDs other than mirrors. Some people complain about quotas not working as expected entirely there. Often it's the same people who do not even use them on ZFS (it's not multi-user PVE install for them, so why bother). Again, for production it would be not officially endorsed. Does it work? Yes. Will I use it? Yes. Should you use it? I do not know.

BTRFS has some advantages over ZFS e.g. it does not need to reference whole datasets with snapshots. ZFS will be considered more mature (i.e. less risky). I do like ZFS pool for VMs, I do not like it for host install, i.e. the default way ISO installer does it. So if I want to use it with PVE on single drive, I do manual Debian install then set up LVM for root (so that I can snapshot backup, but it's different sort of snapshot on thick LVM, it's good only for duration of the backup process), ZFS pool for VMs.

Any copy-on-write filesystem will have write amplification, I have yet to browse through @Dunuin's elaborate benchmarks. He did them, I did not. I have yet to figure out what is the part that mostly causes it, it might be the PVE's own pmxcfs (see above). This is where BTRFS actually should (speculation) do better. You will find some documentation on how to use or not use ZFS in PVE's docs. I am mostly familiar with the differences from LXD, I found the docs there on STORAGE more comprehensive (while this is not all applicable to PVE, a filesystem is a filesystem):
https://documentation.ubuntu.com/lxd/en/stable-4.0/storage/#btrfs

EDIT: The reason why I like ZFS / BTRFS for VMs is the backup options, i.e. zfs send / receive (typically piped through mbuffer) with snapshots. Sadly I cannot take advantage of it in PVE other than manually, for more details:
https://blog.guillaumematheron.fr/2023/261/taking-advantage-of-zfs-for-smarter-proxmox-backups/
 
Last edited:
  • Like
Reactions: pongraczi
Thanks a lot, it's gonna be a lot of reading (young dad here spends more time with the baby than the homelab)
I've read other users more versed in ZFS and using commodity hardware as I am and they all seem to get speed issues with ZFS probably due to the write amplification. So btrfs seems appealing to me, I'm not so afraid of "technology preview" stuff and so, this is running my home automations not rocket science, and as you see I spent 0€ on raid but quite a lot on backups

One thing though, I read here https://pve.proxmox.com/wiki/Storage_Replication
that only ZFS is going to allow storage replication. Could this be the doc isn't up to date? Or do I only have the ZFS option if I wish to try out the lukewarm storage replication?

Also, nothing against anyone here, but I'm taking advice from everyone, sometimes regular users would have experience more matching my needs and often experts are thinking with enterprise hardware in mind where only my switch is enterprise grade and from 20years ago... So I find using my head and trying to understand is a big factor to choosing what's right for me especially when there are heated debates or conflicting evidences :')

Today (and for a few months) I was thinking about trying out the storage replication, my SSDs are already not new, so unless someone tells me that should work with a btrfs install I will continue waiting until I attempt the ZFS route with replacement SSDs or another solution ready to take the load already on hand :)
(There is a 4th node already ordered, Intel N100 with cheap Chinese 512G SSD, is actually capable of running my opnsense+home assistant+frigate for me while I play around with the cluster)
 
Last edited:
One thing though, I read here https://pve.proxmox.com/wiki/Storage_Replication
that only ZFS is going to allow storage replication. Could this be the doc isn't up to date? Or do I only have the ZFS option if I wish to try out the lukewarm storage replication?

You got it right there, I suggest everyone to +1 themselves where it matter to get these things:
https://bugzilla.proxmox.com/show_bug.cgi?id=4193

Of course you could potentially hack it yourself or give up high availability or use shared storage only for those. This is your call (in a homelab) what matters most.

I've read other users more versed in ZFS and using commodity hardware as I am and they all seem to get speed issues with ZFS probably due to the write amplification. So btrfs seems appealing to me

I used ZFS with PVE and HA before, I did not have issues with it being slow. It did write a lot, but I did not measure it against non-ZFS scenario and felt like it was "normal".

as you see I spent 0€ on raid but quite a lot on backups

Which is the more sensible of the two to do in my opinion too.

(young dad here spends more time with the baby than the homelab)

Everyone should spend it where it actually matters. ;)

Just to sum it up (for me personally), CEPH for homelab is overkill, ZFS worked okay but you may want to check on your SSDs with smartmon/nvme-cli every now and then and see how it fares. You do not want to discover you have chewed through your TBWs in a few months. :) I would use the pmxcfs-ram tool to save a lot of write in homelab. I would tweak that zfs as per the other link above and @Dunuin's observations too.
 
Got myself a +1 on btrfs storage replication.
Will need to read about ZFS understand your pmxcf thing before I attempt ZFS, let's see which comes first, btrfs or me trying to learn ZFS
 
Basically all optimizations that lower writes will decrease data integrity and/or reliability. Not recommend such unsupported/hacky approaches where you run important system files in volatile RAM. Always better to spend some money on proper disks that got the performance and durability for your workload.
 
Basically all optimizations that lower writes will decrease data integrity and/or reliability. Not recommend such unsupported/hacky approaches where you run important system files in volatile RAM. Always better to spend some money on proper disks that got the performance and durability for your workload.

If you are getting at the RAMdisk tool [1], I just want to point out that:
1) "To ensure data persistence, the service saves RAM data to the disk in the directory /var/lib/pve-cluster-persistent at certain intervals. You can configure or disable this feature as needed."
2) The integrity part of pmxcfs is handled by the whole filesystem being backed by sqlite [2] which is nothing else than a config.db file in /var, which is the very reason PVE folks decided to use it (as the DB will be left in some consistent state no matter what).

Not to mention the whole pmxcfs in and of itself is a RAM-based filesystem already.

[1] https://github.com/isasmendiagus/pmxcfs-ram
[2] https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)
 
Yes, but I've seen threads where people screwed up the PVE installation after a power outage because of using such scripts moving the PVE configs to RAM.
Even if it stores the config db from time to time on disk, you will lose more data than without such modifications where it is only cached in RAM for a few seconds and then written to persistent disk. So you still sacrifice the data for less SSD wear.
 
Last edited:
Yes, but I've seen threads where people screwed up the PVE installation after a power outage because of using such scripts moving the PVE configs to RAM.
Even if it stores the config db from time to time on disk, you will lose more data than without such modifications where it is only cached in RAM for a few seconds and then written to persistent disk. So you still sacrifice the data for less SSD wear.
Well, please be specific. To begin with, if one has a cluster, there are exactly as many copies of /etc/pve as there's nodes and the most recent one will prevail. Say your node fails due to power supply and later you replace it and re-launch, even if it missed e.g. to flush the pmxcfs items onto disk for the last hour prior to the cold shutdown, it will literally get overwritten with what's in the cluster now (supposedly hours later).

And because it's a DB file - this is also true for when it's being synced onto harddrive constantly - it will never be in unreadable or otherwise inconsistent state.

Let's take worst case, cluster, no UPS, no backups. no nothing (how likely is that though?), cluster-wide power fail, say the most recent pmxcfs is 15 minutes old. So yeah, if you created some VM within the 15 minutes prior to the fail, you might not find its metadata there. I would humbly submit it's the least of your problem at that moment.
 
Last edited:
I can't find that thread. But answer from staff was to not store stuff in RAM. They probably know some edge cases where this might be problematic.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!