Write amplification for OS only drives, zfs vs btrfs

bishoptf

New Member
Jun 7, 2025
28
3
3
I am curious about write amplification for OS only volumes not where you are storing your containers or vms. I have a situation where I only have physical space for 4 ssd's 2 nvme drives. In my case I am going with 4 enterprise ssd's and for the lack of finding cheap enterprise nvme going with consumer grade. Currently I have the OS on raid1 mdraid and then raid10 zfs pool for vm's and containers. I have used mdraid for years and while it doesn't have all the bells and whistles it has served me well in the past, well until using it with the 8.x proxmox kernel. Appears at some point Debian had a what I would consider a nasty bug for mdraid where on shutdown or reboot it would have a kernel panic. The issue has been corrected upstream kernel and I am using the 6.14 kernel but THAT fix really should have been back ported and corrected but alas it hasn't and pretty sure the official answer is mdraid is no official supported.

This has me thinking while I can roll this way with a newer kernel that mdraid will never really be tested enough for me to consider it reliable and I should look into using btrfs or zfs for my os raid1. I was concerned about write amplification so that is why I am using enterprise SSD's for the vm storage, local backup and container storage but maybe its not a concern for just OS workloads?

My question is there any write amplification difference between btrfs vs zfs and would either be fine being used on consumer nvme when only being used for the operating system. I have done some google searches and found that for non DB applications btrfs should be fine but not sure if that is the case for zfs and thought I would post and ask.
 
The write amplification does also heavily comes from the consumer SSDs itself, so that despite the used filesystem, you will always have more wear and tear on consumer devices. I never noticed any write amplification on my enterprise SSDs, but HEAVILY on consumer SSDs. The written TB on a Samsung EVO SSD was in two month on a desktop more than in 5 years of samsung SM enterprise ssds on a PVE box with a lot of VMs - both with ZFS.

What is expensive for you in the context of enterprise NVMe? I would look for e.g. cards that are used on Dell Boss, the early ones have 120 and 240 GB, newer ones only come with 480 GB. All is total overkill for PVE.
 
The write amplification does also heavily comes from the consumer SSDs itself, so that despite the used filesystem, you will always have more wear and tear on consumer devices. I never noticed any write amplification on my enterprise SSDs, but HEAVILY on consumer SSDs. The written TB on a Samsung EVO SSD was in two month on a desktop more than in 5 years of samsung SM enterprise ssds on a PVE box with a lot of VMs - both with ZFS.

What is expensive for you in the context of enterprise NVMe? I would look for e.g. cards that are used on Dell Boss, the early ones have 120 and 240 GB, newer ones only come with 480 GB. All is total overkill for PVE.
Yeah I really just didn't know if write amplification for OS drives would be something to be concerned about. Did some searching for btrfs and nothing really jumped out at me. In regards to the cost issue, I really did not know what to look for in regards to enterprise nvme drives, most if not all of the drive listings that I came across were for ssd's so I wasn't sure what to look for. I have plenty of enterrpise ssd's but in this case I just do not have any additional room for 2.5" ssds but have space for 2 nvme which I prefer to mirror.

If anyone knows of a enterprise nvme listing I can go look but didn't know if it was really an issue when using them for just the OS.
 
Maybe install Proxmox itself on old/CMR HDD(s). That would remove the write/amplification issue completely and still be fast enough.
Again only have 2 nvme slots available for OS, i could go just one drive with ext4 but I prefer to have a mirror for OS. @LnxBil has a good suggestion and that is looking for Dell Boss Card and that appears to be a good alternative and looking right now, thanks for the suggestion.
 
Anyone know of what would be the alternative to Dell Boss cards are there any other vendors that make them?
 
The 240GB Kingston cost about € 90 per ssd, the dell BOSS card does not have any PLP / Cache? So i assume you would have to add M.2 NVMe ssd's with PLP there too?

Don't think the suggested intel or micron ssd's are cheaper then the kingston ssd's and the card itself costs something too.
At least from what i have seen the BOSS S-1 card is m.2 SATA, not NVMe.

If you decide to buy one make sure you read the tech specs carefully, otherwise you could buy the wrong ssd's ...
 
The 240GB Kingston cost about € 90 per ssd, the dell BOSS card does not have any PLP / Cache? So i assume you would have to add M.2 NVMe ssd's with PLP there too?

Don't think the suggested intel or micron ssd's are cheaper then the kingston ssd's and the card itself costs something too.
At least from what i have seen the BOSS S-1 card is m.2 SATA, not NVMe.

If you decide to buy one make sure you read the tech specs carefully, otherwise you could buy the wrong ssd's ...
Eh thats all good, the m.2 ssds used on the boss card do have plp..they are enterprise grade ssd's just in m.w form factor. There are lots of Dell boss cards and the m.2 ssd's are cheap as chips. I can pick up 2 ssd's plus the pcie card for around $60 and maybe less if I look for other brand m.2 ssd but I like micron.

Thanks for the suggestions though...
 
  • Like
Reactions: MarkusKo
I think I finally found which cards are officially supported so now I can look and see how cheap I can throw something together for the boss-s1 cards it lists these as compatible drives:

Intel M.2 S4510
Micron M.2 5100
Micron M.2 5300

These should all be good drives I know the micron have plp and I assume the others will as well but need to read the specs.
 
I am curious about write amplification for OS only volumes not where you are storing your containers or vms.
Might not be an obvious question, but why? your OS needs are pretty meagre, and disk performance will have little (if any) impact on your vms. The only real consumer of iops are the logs, and if you are really concerned with write endurance either log to an outside logging facility or to zram.

This has me thinking while I can roll this way with a newer kernel that mdraid will never really be tested enough for me to consider it reliable and I should look into using btrfs or zfs for my os raid1
Just use a zfs mirror and call it a day :) its supported by the installer and works just fine. just be Dr Strangelove and stop worrying (and love the bomb ;)
 
Are we talking about write amplification or drive wear/endurance? How is the type of drive relevant to the amount of write amplification?

Endurance, sure, enterprise will normally last far longer.
 
  • Like
Reactions: uzumo
Are we talking about write amplification or drive wear/endurance? How is the type of drive relevant to the amount of write amplification?

Endurance, sure, enterprise will normally last far longer.
Yeah we are talking about write amplification of zfs or really any cow fs that reduces drive endurance. Currently I have consumer nvme's and the endurance is not that great, I would like to avoid the early demise. Mdraid works and works well but is not officially supported and in testing found at least in the proxmox 6.8 kernel a nasty shutdown/reboot bug and just made me think that I should not go with mdraid for the os partitions. I think the dell boss card is a good option and I am going down that path, the endurance level for those m.2 sata drives are way better as expected so that is the current plan. Looks like I can pick up locally the card and drives really cheap. They are not officially supported in the Dell 13g servers but you can use them and boot with them once you configure the virtual disk offline with the mvcli command so that is my current plan.
 
Might not be an obvious question, but why? your OS needs are pretty meagre, and disk performance will have little (if any) impact on your vms. The only real consumer of iops are the logs, and if you are really concerned with write endurance either log to an outside logging facility or to zram.


Just use a zfs mirror and call it a day :) its supported by the installer and works just fine. just be Dr Strangelove and stop worrying (and love the bomb ;)
Nothing to do with iops etc, only has to do with longevity of the drives, consumer drives so the write endurance not so great and just want to avoid that. Again normally I would just use mdraid which i have running and have used for years and works well but since its not officially supported it also means its not tested etc...in the current 6.8 promox kernel there is a nasty kernel panic on shutdown/reboot. Resolved with going with the 6.14 kernel but just made me rethink how I wanted to do it and since I wanted a raid 1 OS that left me with either btrfs or zfs and my only concern was going to be write endurance of the consumer drives.

Like you said it may not be an issue for just OS disks, for the vm's and other stuff I have on a separate raid10 enterprise SSD's but I just wanted to avoid early death of an OS drive and looks like the boss cards and drives are cheap so thats the plan now.
 
You guys are IMHO way overthinking it.
Consumer NMVE in a ZFS mirros are even fine for boot +10 VMs, let alone for boot alone.
I would say maybe, depending on the drives you are using...I am only using 256g drives and endurance is only 150TB...that may be enough for a good while, maybe not. I dunno and I really do not want to have to be checking it etc, i found an alternative that is cheap and the endurance is a lot better, I'll just roll with that.
 
.I am only using 256g drives and endurance is only 150TB
Only? I don't know what your workflow is, but most Proxmox installations don't have 1 drive write per day. And even if, that would be over two years.
SMART gives you wearout information (and most drives work even after reaching 100% wearout).

For reference: My 990 Pro is at 1% and my WD RED SN700 is at 7% after 3 years of OS + VMs usage.
That is why I say you guys are overthinking it (for a homelab).

Just get two different brands / controllers, get something with decent performance and avoid QLC and you are good to go. Otherwise you are probably looking at U.2.
 
Last edited:
but I just wanted to avoid early death of an OS drive
Just to put things in perspective, I have nodes running on consumer level OS ssds for OVER 10 YEARS, and that's without local log prevention. as long as you're not comingling payload and OS even crappy old drives dont get enough writes for it to matter.

With current drives (even the cheapest consumer ones) they should outlast your server by an order of magnitude- you'll end up throwing them away before they fail for being old, small, and slow.
 
Just to put things in perspective, I have nodes running on consumer level OS ssds for OVER 10 YEARS, and that's without local log prevention. as long as you're not comingling payload and OS even crappy old drives dont get enough writes for it to matter.

With current drives (even the cheapest consumer ones) they should outlast your server by an order of magnitude- you'll end up throwing them away before they fail for being old, small, and slow.
Yeah I agree but I've not run much with ZFS and you read all the issues with consumer drives but again thats probably with vm/lxc workload etc. In my case I just want a pair for the os and will have separate drives for data. In my case I needed to purchase something anyway and if I can get enterprise m.2 ssd's vs consumer for about the same price then I will just use the enterprise stuff. Some of this is for a production unit so I'd rather have more breathing room even though I probably do not need it just a nice to have with the plp etc.

Eh and I tend to not throw away ssd's, I have some pretty old intel 320's and I think a generation before that, they still work fine for boot disks when needed for linux stuff.