ZFS with SSDs: Am I asking for a headache in the near future?

wizpig64

New Member
Feb 9, 2016
8
3
1
33
Hey there.

I've built a new proxmox 4 machine for the small business I work at, which will mostly be running a postgres database and a few web hosts in LXC, in addition to some windows kvm test environments, upgrading from an old Core 2 Quad system running PVE3.4

I'm using a pair of Samsung 850 Evos (as I figured their performance and longevity was "good enough" for what we need, rather than going for the Pro models) mirrored via ZFS through the installer.

(Sidenote: it'd be rad if proxmox automatically created the zfs storage for me, or directed me to do it rather than spewing errors when trying to create vms/cts on the local storage. Either way it's working now.)

Anyways, I've just realized that ZoL doesn't support TRIM. Oops. Technically I could get past this by putting the drives in my FreeNAS box, using a spare drive for the root storage, and connecting via iSCSI, but I don't think that would be an improvement unless I upgrade to 10g lan.

Have I fucked up by using SSDs, consumer-grade ones at that, with ZFS on Linux? Should I expect a headache 3-6 months from now when everything starts crawling? All my googling leads me to believe that TRIM is a "big deal," but I don't know how big when it comes to this. Does anyone have any experience with SSDs in ZFS on Proxmox?

Thanks at lot.

EDIT - Feb 21, 2017

One year after this post was made, I decided to check in with my results for anyone else doing research. It's been almost a year since this system was deployed using a pair of 250GB Samsung 850 Evos (TLC flash) mirrored with ZFS.

QQrhLcq.png



The system hosts a modest postgres database, a small handful of django web servers, and a bunch of random test machines. The database is dumped to /tmp/ hourly and moved to our NAS as a backup, which is probably a major contributor to the writes.

Total_LBAs_Written translates to about 20TB written in the last 12 months, or about 1.66TB/month. Given the workload of a hypervisor and the copy-on-write nature of ZFS, I'd expect this to be high, but I've probably been a little hard on the drives with the backups.

For comparison, I looked at a nearby workstation using a 120GB 840 Evo that was purchased 45 months ago. It has about 10TB written, or 0.22TB/month. All other factors aside, we could then say that my particular PVE ZFS write workload is about 7.5x the workload of a simple windows office workstation.

So yeah, it's a lot. But even so, if we compare our TB written to Tech Report's SSD endurance test (and we assume that 850 Evos behave anything like the 840 Evos they tested), we're still merely a tenth of the way written until the drives will start reallocating sectors, and a hundredth of the way towards the drives outright dying.

Hope this helps anyone looking for data.
 
Last edited:
you should avoid to use consurmer-grade ssd with zfs, because zfs journal is syncronous, and these samsung evos drives are prettry slow for syncronous write.

@spirit
You are mixing up entirely different things. It's not consumer grade SSD that's need to be avoided, it's TLC drives, which - due to their complex write operations - are much slower to write than MLC drives. This has nothing to do with the fact that the ZIL buffers sync writes only.

Still, the EVOs use a portion of their flash as "SLC mode" write cache (1 GB IIRC), so if your workload is not write-heavy, they will perform just fine as ZIL/L2ARC. BTW we are using 850 PROs, which are also consumer grade drives, yet perform admirably.

Anyways, I've just realized that ZoL doesn't support TRIM. Oops. Technically I could get past this by putting the drives in my FreeNAS box, using a spare drive for the root storage, and connecting via iSCSI, but I don't think that would be an improvement unless I upgrade to 10g lan.

Have I fucked up by using SSDs, consumer-grade ones at that, with ZFS on Linux? Should I expect a headache 3-6 months from now when everything starts crawling? All my googling leads me to believe that TRIM is a "big deal," but I don't know how big when it comes to this. Does anyone have any experience with SSDs in ZFS on Proxmox?

I have no idea how the lack of TRIM in ZFS affects cache performance in the long term, but all SSDs have automatic garbage collection / maintenance routines, so I think the most important thing is to set aside some decent OP (over provision = unpartitioned space) area, like 15-20% of the capacity, and I reckon everything should be fine.

On 250 GB TLC drives, I would leave about 40-50 GB unpartitioned, so partition table would look like this:
- 10 GB SWAP (mirrored on other drive)
- 10 GB ZIL (mirrored on other drive)
- 180 GB L2ARC (striped with other drive)
 
Last edited:
@spirit
You are mixing up entirely different things. It's not consumer grade SSD that's need to be avoided, it's TLC drives, which - due to their complex write operations - are much slower to write than MLC drives. This has nothing to do with the fact that the ZIL buffers sync writes only.

Don't know if it makes sense (I'm not a guru), but should be considered also "power loss data protection" that most DC drivers have and consumer grade don't. With that feature usually the driver has a sort of "embedded bbu" and returns to the OS a really fast "write done". I've seen it with ext4 and barriers ON, with Intel DC 3710 it's very fast, with Samsung 840 is incredibly slow (fsync 215 vs. 6300 without barriers) instead.
 
you should avoid to use consurmer-grade ssd with zfs, because zfs journal is syncronous, and these samsung evos drives are prettry slow for syncronous write.

I'm happy with the performance I've achieved, especially given the Evo's SLC cache. I'm more worried about performance degradation. Degradation is common in all SSDs, but there are horror stories on the internet about how not running TRIM will kill your drives fast.


@spirit
- 10 GB SWAP (mirrored on other drive)
- 10 GB ZIL (mirrored on other drive)
- 180 GB L2ARC (striped with other drive)

I am actually talking about using 2x 250gb drives as the only pool on the system, no HDDs, so L2ARC doesn't make sense for this setup, and ZIL is already there of course. Actually, maybe a small L2ARC would make sense when striped like you suggest, but for now I'll keep it simple.

That said, thanks for your reassurances, underprovisioning seems the way to go. That and waiting patiently for ZoL to add TRIM.


power loss data protection

Do you think that's necessary if it's behind a UPS? I'm not running a datacenter by any stretch but I try to learn and keep up with best practices when affordable.
 
I am actually talking about using 2x 250gb drives as the only pool on the system, no HDDs, so L2ARC doesn't make sense for this setup, and ZIL is already there of course. Actually, maybe a small L2ARC would make sense when striped like you suggest, but for now I'll keep it simple.

That said, thanks for your reassurances, underprovisioning seems the way to go. That and waiting patiently for ZoL to add TRIM.

Do you think that's necessary if it's behind a UPS? I'm not running a datacenter by any stretch but I try to learn and keep up with best practices when affordable.

You might use a mirrored ZIL for safety, but it will most likely not affect performance at all (or hinder it a bit), and take up unnecessary space. You really need the ZIL when you have a pool of slow drives, no advantage when using SSDs only. Same for L2ARC. Forget both, stripe the drives for performance (and make backups often), or mirror if you need absolute safety.

Also no need for power loss protection when running with a UPS. It's the same thing as the BBU: do you run mission critical systems? Do you need subsecond data consistency and integrity for your application? No? Then you will be just fine without.
 
We are currently using Proxmox 4.1 on two zfs mirrored 1TB SSDs (Samsung SSD 850Pro), due to uptime is critcal for us.
We run two VMs with MSSQL-Databases and buildservers on it using virtio-driver an cache=writethrough and so far there were no problems except the high RAM usage :-(
 
no problems except the high RAM usage :-(

By default proxmox gives half of the overall system memory to ARC, but the wiki shows how to limit its size.
Code:
#limit ZFS arc to 4GB
echo options zfs zfs_arc_max=4299967296 > /etc/modprobe.d/zfs.conf
update-initramfs -u
You'll need to reboot in order to get your memory back, and your storage performance will suffer for it as you trade the ARC for more VM memory of course, but for my use case it's not going to affect anything.
 
i want to hang in here and ask why are you doing mirrored ZIL and striped L2ARC? Is is necessary?

What happens when my not mirrored ZIL/L2ARC-SSD breaks??
 
Why run a write cache when you have SSDs as your main storage in the first place? For that matter why run L2ARC when you have SSDs sounds like a waste of time.
 
well, i think he hasn't ssds as main storage, as well as i do.

letting ssd do the cache speeds up that much. You can get 10x faster performance out of your box and still having the storage capacity of sata-hdds
 
i want to hang in here and ask why are you doing mirrored ZIL and striped L2ARC? Is is necessary?

What happens when my not mirrored ZIL/L2ARC-SSD breaks??

I wouldn't call myself an expert, but this is how I understand it:

ARC is a read-cache that wraps around the filesystem but isn't part of the filesystem itself, so it benefits from the performance of striping and doesn't benefit from the safety of mirroring. If the L2ARC fails, the data is still in the pool.

ZIL is a write-cache that is part of the filesystem. You might set up a pool with an SSD ZIL + HDD vdev, which lets you write at SSD speeds, and then when the disks are ready ZFS will flush those writes into the HDDs. You want that write cache to be mirrored for data integrity, just like you'd want your HDDs to be mirrored or to have parity via raidz. The same goes for swap. Also see this freebsd forum thread explaining it a little.

For the setup I'm talking about in the OP with 2 mirrored SSDs as the only vdev, a ZIL cache partition won't do anything but waste space, because you'd still want it mirrored for data integrity. A striped L2ARC partition, however, might speed up read performance a little. Personally though, the write performance of mirrored SSDs is good enough for my use case, and I'd rather have the extra space in the main pool.

edit: here's another source: http://www.45drives.com/wiki/index.php/FreeNAS_-_What_is_ZIL_&_L2ARC
 
Last edited:
thanks for the explanation.
i set up a system with zfs-raid10 and added ZIL/L2ARC partitions on a SSD. The performance of this system is really great.

But what i am worried about is, what happens when the ssd fails. Will the system boot up again just without the ZIL/L2ARC?

Kind regards
 
If your SSD fails you can in worst case loose upto the last 5 sec of writes. The file system will not be corrupt - remember we are dealing with a COW file system. Your system will not reboot automatically but your SSD will be marked as bad and your pool will work in degraded mode until said SSD is either replaced or remove (zpool remove <pool> <diskid>, zpool add <pool> (cache|log) <diskid>)
 
  • Like
Reactions: takeokun
what happens when the ssd fails. Will the system boot up again just without the ZIL/L2ARC?
As far as I can tell, especially with modern zfs versions, yes. Putting l2arc on a single point of failure will at worst cause zfs to read directly from your slower hard drives when that point fails. A non-mirrored zil, upon failure, will lose any data stored on the Zil and not on the hard drives, but the pool will continue to function without it.
 
ok, 5 secs or 10secs.
i think it's like writing to an ext3 filesys with barriers=0.

the most what i worry about is that the ssd is broken and the system will not come up again. and this will happen. i just tried it out.

Have another test-setup here. With a ssd drive for log and cache. I pluged it out and restarted the system.

Code:
GRUB: Unknown-Filesystem

pluged it in again, system is booting just fine.


Is it normal that the zfs is not marking the logs partition as removed when the drive is pluged out?
Code:
  rpool  ONLINE  0  0  0
  sdb2  ONLINE  0  0  0
  logs
  sda2  ONLINE  0  0  0
  cache
  ata-ADATA_SP900_2F1720046125-part1  REMOVED  0  0  0
 
yeah but not the logs partition. it still show ONLINE. (sda is the SSD - i don't know why it shows sda2 i added the log and cache by-id)

after removing the ssd i tried to remove the sda2 from the pool, then the system doesn't accept any inputs.
After reboot the same GRUB-Error-Message came up.

how can i remove the log partition without booting into a debug-mode when the ssd fails?
i expect i have to do so when the ssd is failing.


edit:
did some more testing, after doing some writes on the pool the log-device gets in the UNAVAIL-State in this state i was able to remove it from the pool and reboot the server graceful.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!