Question about file system

dbspl123 · Dec 28, 2024

Hi All,

Relatively new Proxmox user here, so I apologize if this comes off as an uninformed question.

Back when I spun up proxmox (single NVME ssd, micro PC), I selected ZFS (RAID 0 as I only have one drive) for the file system. At the time, I thought that was a good idea. After doing some more research over time, I'm reading a lot about the wear/tear on consumer grade SSD's, which is exactly what I have.

I am concerned about the risk of failure of my current SSD and wondering if I should consider moving back to ext4 (or something else) if it's even possible or worth doing. I do have another micro PC running PBS, FWIW.

Appreciate any thoughts and apologies again for the potentially silly question.

bfrd9k · Dec 29, 2024

I'd use EXT4. ZFS is great for other applications like in vm datastores, especially when using many disks, mirrored vdevs, raidz1, etc.

dbspl123 · Dec 29, 2024

That’s my thought too. Any idea on the best way to get from ZFS to ext4? Am I wiping my current PM instance and starting from scratch/restore.?

UdoB · Dec 29, 2024

dbspl123 said:
if I should consider moving back to ext4

You will lose a lot functionality, some of them may be relevant for you:

Integrity

ZFS assures **integrity**. It will deliver the *same* data when you read it as was written at some point of time in the past. To assure this a checksum is calculated (and written to disk) when you write the actual data. When you _read_ the data the same checksum is again re-calculated and only if both are the same it is okay to deliver it to the reading process.

Most other “classic” filesystems do not do this kind of check and just do deliver the data coming from the disk instead.

For most on-disk-problems a “read-error” will occur, avoiding to hand over damaged data. These days this can happen every 10^15th block of data read - it is called an “URE” (“Unrecoverable Read Error”). A _much higher_ number of blocks needs to be read to deliver actually different/wrong/damaged data *without an error message*. On the other hand this is not the whole story: errors may get introduced not only on the platters or in an SSD-cell but also on the physical wire, the physical connectors, on the motherboard’s data bus, in RAM or inside the CPU. So yeah, to receive damaged data in your application is _not_ impossible! ZFS will detect this with a high probability.

Snapshots

The concept of “copy-on-write” (“CoW”) allows to implement technically **cheap snapshots** while most other filesystems do not have this capability. “LVM-thick” does not offer snapshot at all and LVM-thin has other drawbacks. Directory storages allow for .qcow-files, but they introduce a whole new layer of complexity compared to raw block devices (“ZVOL”) used for virtual disk.

Compression

ZFS allows for transparent and cheap **compression** - you just can store more data on the same disk. A long time ago the CPU had to compress data in software. Since several years all CPUs do that “in hardware” by specific internal instructions - this is usually so fast that you won’t notice any delay.

ZFS **scales**! You can always add an additional vdev at any time to grow capacity. Some people were missing “in-vdev”-“raidz-expansion” which is (probably) coming to PVE in 2025.

Using ZFS you can combine old rotating rust and speedy SSD/NVME by utilizing a “**Special Device**”. This allows using ZFS in some(!) more use cases which are impossible to create otherwise. That SD will store metadata (and _possibly_ some more “small blocks”). This speeds up some operations, depending on the application. Usually the resulting pool could be twice as fast because we have a higher number vdevs now --> the need of physical head movements is _drastically_ reduced. And because the SD may be really small (below 1% of raw capacity) this is a cheap and recommended optimization option. Use fast devices in a mirror for this - if it dies the pool is completely gone. (If your data is RaidZ2 use a _triple_ mirror!)

All of the above comes with a price tag, leading to some counter arguments:

For every single write-command additional metadata and ZIL operations are required. ZFS writes more data more frequent than other filesystems. More writes means “slower” = not positively recognized by users. This slows down especially “sync-writes”. _Usual_ “async”-data is quickly buffered in Ram for up to 5 seconds before it is written to disk.

To compensate that “slowing down”-aspect it is _highly_ recommended to use “Enterprise”-class devices with “Power-loss-Protection” (“PLP”) instead of cheap “Consumer”-class ones. Unfortunately these devices are much more expensive...

dbspl123 · Dec 29, 2024

UdoB said:
You will lose a lot functionality, some of them may be relevant for you:

Integrity
ZFS assures **integrity**. It will deliver the *same* data when you read it as was written at some point of time in the past. To assure this a checksum is calculated (and written to disk) when you write the actual data. When you _read_ the data the same checksum is again re-calculated and only if both are the same it is okay to deliver it to the reading process.

Most other “classic” filesystems do not do this kind of check and just do deliver the data coming from the disk instead.

For most on-disk-problems a “read-error” will occur, avoiding to hand over damaged data. These days this can happen every 10^15th block of data read - it is called an “URE” (“Unrecoverable Read Error”). A _much higher_ number of blocks needs to be read to deliver actually different/wrong/damaged data *without an error message*. On the other hand this is not the whole story: errors may get introduced not only on the platters or in an SSD-cell but also on the physical wire, the physical connectors, on the motherboard’s data bus, in RAM or inside the CPU. So yeah, to receive damaged data in your application is _not_ impossible! ZFS will detect this with a high probability.

Snapshots
The concept of “copy-on-write” (“CoW”) allows to implement technically **cheap snapshots** while most other filesystems do not have this capability. “LVM-thick” does not offer snapshot at all and LVM-thin has other drawbacks. Directory storages allow for .qcow-files, but they introduce a whole new layer of complexity compared to raw block devices (“ZVOL”) used for virtual disk.

Compression
ZFS allows for transparent and cheap **compression** - you just can store more data on the same disk. A long time ago the CPU had to compress data in software. Since several years all CPUs do that “in hardware” by specific internal instructions - this is usually so fast that you won’t notice any delay.

ZFS **scales**! You can always add an additional vdev at any time to grow capacity. Some people were missing “in-vdev”-“raidz-expansion” which is (probably) coming to PVE in 2025.

Using ZFS you can combine old rotating rust and speedy SSD/NVME by utilizing a “**Special Device**”. This allows using ZFS in some(!) more use cases which are impossible to create otherwise. That SD will store metadata (and _possibly_ some more “small blocks”). This speeds up some operations, depending on the application. Usually the resulting pool could be twice as fast because we have a higher number vdevs now --> the need of physical head movements is _drastically_ reduced. And because the SD may be really small (below 1% of raw capacity) this is a cheap and recommended optimization option. Use fast devices in a mirror for this - if it dies the pool is completely gone. (If your data is RaidZ2 use a _triple_ mirror!)

All of the above comes with a price tag, leading to some counter arguments:

For every single write-command additional metadata and ZIL operations are required. ZFS writes more data more frequent than other filesystems. More writes means “slower” = not positively recognized by users. This slows down especially “sync-writes”. _Usual_ “async”-data is quickly buffered in Ram for up to 5 seconds before it is written to disk.

To compensate that “slowing down”-aspect it is _highly_ recommended to use “Enterprise”-class devices with “Power-loss-Protection” (“PLP”) instead of cheap “Consumer”-class ones. Unfortunately these devices are much more expensive...

This is a great point and snapshots/integrity are two solid reasons to stick with ZFS. My instance is working perfectly and I’ve spent a lot of time configuring PVE to get it to its current state.
I guess my biggest concern is that I spun it up on a mediocre-at-best nvme drive and I’m worried that ZFS is going to destroy my drive and I’ll end up in a bad place in the future. Maybe that concern is unfounded.

gfngfn256 · Dec 29, 2024

dbspl123 said:
Maybe that concern is unfounded.

It is not unfounded. But assuming you make regular full & restorable (tested) backups of all your VM's & LXCs to an external device (that PBS?), and also you make as little as possible changes to your PVE host OS (& document clearly all those you do make), then even if disaster strikes, and your OS disk dies, you can probably rebuild your environment on a new disk ASAP.

bfrd9k · Dec 29, 2024

We have a saying in IT, "two is one and one is none", or simply "one is none". You're running your OS from a single disk, so you should expect a failure. ZFS is only going to help detect and heal bit rot or corruption. While this is a very nice feature the chances of it happening and also impacting something irreplaceable or critical are... not high. I run ~10 FreeNAS/TrueNAS servers and I will see a checksum error on an OS maybe once a year max, and it usually indicates hardware failure. Not long after you see one, you may see a spike in checksum errors or SMART errors, and it's just time to replace the disk. ZFS RAID 0 doesn't help with replacement, we just rebuild and import configuration.

PVE might not be as easy to restore as TrueNAS but PVE is easy enough to stand up. Either backup the host OS using PBS or borg (and target the most important files) or document your changes from default, and the reason for the changes, this way if you have no backups you can just rebuild from scratch.

So, while ZFS is nice for features and while these features come at a reasonable low cost (to performance, resource utilization, additional wear), the payout is realistically even lower and they almost enable you to feel safe about a situation that you shouldn't.

Kingneutron · Dec 29, 2024

dbspl123 said:
My instance is working perfectly and I’ve spent a lot of time configuring PVE to get it to its current state.
I guess my biggest concern is that I spun it up on a mediocre-at-best nvme drive and I’m worried that ZFS is going to destroy my drive and I’ll end up in a bad place in the future. Maybe that concern is unfounded.

https://github.com/kneutron/ansitest/tree/master/proxmox

You need to be backing up your critical files on a daily basis, to separate media/NAS. ("NAS" can be as easy as a samba share on your win10 PC, as long as it's on during the backup window.) Look into the bkpcrit script, point the destination somewhere other than your boot/OS disk, run it nightly in cron.

And proactively replace your nvme with something that has a high TBW rating. Lexar NM790 has given me good results, especially with cluster services turned off and log2ram enabled. My main proxmox server has been running mostly 24/7 since Feb and it shows ~1% wearout.

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_zfs

Search page for " Changing a failed device "

https://github.com/kneutron/ansites...-replace-zfs-mirror-boot-disks-with-bigger.sh

Highly recommended to install Proxmox in a VM and try out this procedure first in-vm as you are not familiar with it. You can always snapshot and go back.

I just did this yesterday, replacing a 64GB zfs boot/root thumbdrive with a 1TB SK Hynix Beetle. Worked perfectly.

The thumbdrive is actually still bootable/usable (I did a detach instead of replace), it's just a portable PVE with LVM GUI software installed for utility / troubleshooting.

dbspl123 · Dec 29, 2024

Kingneutron said:
https://github.com/kneutron/ansitest/tree/master/proxmox

You need to be backing up your critical files on a daily basis, to separate media/NAS. ("NAS" can be as easy as a samba share on your win10 PC, as long as it's on during the backup window.) Look into the bkpcrit script, point the destination somewhere other than your boot/OS disk, run it nightly in cron.

And proactively replace your nvme with something that has a high TBW rating. Lexar NM790 has given me good results, especially with cluster services turned off and log2ram enabled. My main proxmox server has been running mostly 24/7 since Feb and it shows ~1% wearout.

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_zfs

Search page for " Changing a failed device "

https://github.com/kneutron/ansites...-replace-zfs-mirror-boot-disks-with-bigger.sh

Highly recommended to install Proxmox in a VM and try out this procedure first in-vm as you are not familiar with it. You can always snapshot and go back.

I just did this yesterday, replacing a 64GB zfs boot/root thumbdrive with a 1TB SK Hynix Beetle. Worked perfectly.

The thumbdrive is actually still bootable/usable (I did a detach instead of replace), it's just a portable PVE with LVM GUI software installed for utility / troubleshooting.

This is helpful info, thanks!

I’m totally aware that running on a single disk is risky. I have a separate PC running PBS and am doing twice-daily backups to a NAS drive. Also doing backups locally on PVE too.

I think you’re right about replacing the drive proactively with something enterprise grade.

wahmed · Dec 29, 2024

As it has been mentioned in previous replies by others, you lose some functionality by going from ZFS to ext4. Other than that, ZFS is not going to cause any more wear and tear than ext4. The faster wear of consumer grade SSDs is certainly a valid concern. But that can be easily mitigated.
1. It is the write that causes the most wear. Proxmox or the OS in general writes log on regular basis. By using zram to create a device on RAM, you can move those writes out of the SSD. Only caveat is you will lose logs when you reboot the node. Since zram creates volatile storage. Both swap and log location can be configured to use zram thus eliminating 90% of all writes.

2. Given how cheap consumer SSDs are, just use 2 drives in ZFS mirror. When a drive dies just replace it and mirror will duplicate the data.

Combining mirror with zram to offload writes, consumer SSDs become a real possibility. We have numerous production deployments with such configuration using consumer SSDs for OS.

gfngfn256 · Dec 29, 2024

dbspl123 said:
I have a separate PC running PBS and am doing twice-daily backups to a NAS drive.

This is excellent practice.

dbspl123 said:
Also doing backups locally on PVE too.

So that is to the same SSD as the OS & the VMs. I don't think I would do that. Its just extra wear on that drive, & in the event of a drive failure you will loose those backups anyway. Stick to backups on separate drives.

dbspl123 · Dec 29, 2024

wahmed said:
As it has been mentioned in previous replies by others, you lose some functionality by going from ZFS to ext4. Other than that, ZFS is not going to cause any more wear and tear than ext4. The faster wear of consumer grade SSDs is certainly a valid concern. But that can be easily mitigated.
1. It is the write that causes the most wear. Proxmox or the OS in general writes log on regular basis. By using zram to create a device on RAM, you can move those writes out of the SSD. Only caveat is you will lose logs when you reboot the node. Since zram creates volatile storage. Both swap and log location can be configured to use zram thus eliminating 90% of all writes.

2. Given how cheap consumer SSDs are, just use 2 drives in ZFS mirror. When a drive dies just replace it and mirror will duplicate the data.

Combining mirror with zram to offload writes, consumer SSDs become a real possibility. We have numerous production deployments with such configuration using consumer SSDs for OS.

Thanks - I’ll look into Zram.

I actually have two drives in the PC. I wanted to setup as ZFS RAID1 but since they’re are different sizes, PVE wouldn’t let me setup as RAID1. I know other systems (NAS for example) will mirror but use the smaller of the two drives for overall size. Unfortunately PVE didn’t let me do this at setup.

dbspl123 · Dec 29, 2024

gfngfn256 said:
This is excellent practice.

So that is to the same SSD as the OS & the VMs. I don't think I would do that. Its just extra wear on that drive, & in the event of a drive failure you will loose those backups anyway. Stick to backups on separate drives.

Didn’t consider the local backup write wear. Will cancel the job!

Search

Search

Question about file system

dbspl123

New Member

bfrd9k

New Member

dbspl123

New Member

UdoB

Distinguished Member

Integrity

Snapshots

Compression

dbspl123

New Member

Integrity

Snapshots

Compression

gfngfn256

Distinguished Member

bfrd9k

New Member

Kingneutron

Renowned Member

dbspl123

New Member

wahmed

Famous Member

gfngfn256

Distinguished Member

dbspl123

New Member

dbspl123

New Member

We value your privacy

Question about file system

New Member

New Member

New Member

Distinguished Member

Integrity​

Snapshots​

Compression​

New Member

Integrity​

Snapshots​

Compression​

Distinguished Member

New Member

Renowned Member

New Member

Famous Member

Distinguished Member

New Member

New Member

We value your privacy

Integrity

Snapshots

Compression

Integrity

Snapshots

Compression