Storage recommendations for home workload?

Zeash

Member
Jan 23, 2022
15
3
8
21
Hi,

I'm planning on building a new Proxmox server for my home environment (hosting a NAS, K8s, websites, Minecraft, DBs, home networking, etc.). Regarding VM/CT/NAS storage, I'm not too sure how I should go about it.

I was thinking either:
  • 2x 1TB NVME SSDs (ZFS mirror) used as the boot volume and for storing VM/CT boot disks
  • 2x 4TB HDDs (ZFS mirror) used for storing important files, DBs, etc.
Or:
  • Some random SSD as the boot drive (the idea being that availability doesn't matter that much)
  • 2x 4TB HDDs (ZFS mirror) used for storing everything else with an NVME SSD for SLOG (loss of total capacity doesn't really matter)

The idea is to (potentially) save some money, while retaining similar or identical performance. In both cases I'd of course aim to have as much RAM as possible for ARC and I have backups basically figured out. I haven't used SLOG up until now, so I'm not sure how much it would help slow spinning rust against a dozen VMs/CTs and a NAS workload. I'm open to suggestions for a different kind of setup and would appreciate if anyone could share their experiences with this kind of thing.
 
Last edited:
2x 4TB HDDs (ZFS mirror) used for storing important files, DBs, etc.
I wouldn't store DBs on HDDs. They are easily overwhelmed by the small IO the DBs are creating, as HDDs really suck at everything that isn't purely doing async sequential IO.

The idea is to (potentially) save some money, while retaining similar or identical performance.
With storage you get what you pay for. Cheap out on storage and it will be slow or fail sooner. You really shouldn't save on the disks, like you wouldn't try to build your system around the cheapest CPU available.
With ZFS you should at least get some proper enterprise grade SATA SSDs with power-loss protection and use that for everything that is not NAS cold storage. I would use HDDs only for the NAS.
And SLOG/L2ARC is in most of the cases not that useful. It won't turn your slow HDDs into fast SSDs. They will still be terrible slow for any sync or random IO.
 
Last edited:
I would mirror your boot drive. I know you said that availability doesn't matter that much but honestly I can fit my Proxmox 8 boot drive on a 128 gb drive easily. 128GB SSDs are so cheap, why not?

My storage layout/strategy is that I have 2 SATA SSDs in a ZFS mirror for my boot drives, and to store ISOs, and container templates. I have two NVMe SSDs mounted in an Asus Hyper M.2 PCIe adapter also in a ZFS mirror (my motherboard supports PCIe bifurcation making this possible), and I do all my backups to my Synology NAS, which itself backs up to another raspberry Pi NAS I built (two copies, no raid on this), and to Amazon Glacier.

If you separate out your boot drives you can use very small SSDs for that (save money), and if you put your backups on a separate device, you can go with smaller SSDs for your running VM/container storage. I am running about 15 apps in a variety of VMs, docker images and LXC containers (stuff like Wordpress, Nextcloud, Photoprism, Grocy, Home Assistant, etc. NO Plex, Emby or other media servers). My current VM and LXC storage is just under 500GB. Not that big if you separate out the data as I have. I do run another NAS on my Proxmox machine, but I have two older spinning disks that I pass through directly to the NAS, for use in my Nexcloud instance, via NFS.

If I was you I would go with 2 small SSDs for the boot drive (ZFS mirror), 2 medium size (1TB-2TB?) SSDs for the VMs (again ZFS mirrors) and whatever spinning disks you could afford for NAS storage. I might not be concerned about mirroring or setting up the NAS in a RAID if you have a decent backup strategy. But that's just me. You can still do snapshots and protect against bit rot with other tools besides ZFS on those spinning drives. That might allow you to run with a bit less system memory. But it all depends on what kind of storage performance you really need. For my purposes, I don't need super fast storage for my data.
 
With ZFS you should at least get some proper enterprise grade SATA SSDs with power-loss protection and use that for everything that is not NAS cold storage. I would use HDDs only for the NAS.

Aside from endurance, I don't see how it's worth it to invest in an enterprise SSD for a very light workload like mine (I'm still learning all these technologies after all). Also, the enterprise SSDs I can buy in my region (D3-S4510, D3-S4520, PM893, Micron 5400 Pro, D3-S4610) perform about the same or worse in random I/O as the SN770 I was planning on buying. I can't find any info about the PM897 though. The PM9A3 is worth it, but I'm not sure how well it would work with some random PCIe card that I'd need just to plug its M.2 version (since it's 110mm long and there aren't any U.2 ports on a consumer motherboard).

I have a UPS so I'm not that worried about PLP.

The only thing I'm unsure of is write amplification in ZFS. Can a light homelab workflow really burn through hundreds of TB in a short amount of time?
 
Aside from endurance, I don't see how it's worth it to invest in an enterprise SSD for a very light workload like mine (I'm still learning all these technologies after all). Also, the enterprise SSDs I can buy in my region (D3-S4510, D3-S4520, PM893, Micron 5400 Pro, D3-S4610) perform about the same or worse in random I/O as the SN770 I was planning on buying.
DBs do sync writes. Only enterprise SSDs got a power-loss protection, so only these can cache sync writes in DRAM. Without that sync write performance will be magnitudes slower. So no, depending on the workload, they don'T perform about the same. See for example the official Proxmox ZFS benchmark paper:
1692011635190.png
So 339 IOPS of a Samsung 850 EVO vs 12,518 IOPS of an Intel S3510 is really big...
And yes, on paper, for async writes, the 850 EVO is advertised with 98,000 IOPS while the S3510 is only advertised for 20,000 IOPS. But the Evo can only handle that for a few seconds until the performance will heavily drop, once the caches are full, as it is only made for short bursts of writes. While the S3510 is made to deliver a usable performance continuously and predictable...
Think of the consumer SSD as a drag race car. Good for a short drag race but really sucks once you try to drive offroad or try to attach a plow for plowing a field with it. The enterprise SSD is more like a reliable Jeep, which you won't win a race with, but at least it can handle all terrains and won't let you down.

I have a UPS so I'm not that worried about PLP.
But that will only help you with data integrity. It won't help you with performance. A SSD without a power-loss protection won't cache sync writes, no matter if you got an UPS or not, as its firmware can't know about it. You then would need to lie to your SSDs and handle sync writes as async writes by "sync=disabled" or "cache=unsafe" but this would be really bad in case of a CPU/mainboard/PSU failure or a kernel crash, where your UPS won't help.
 
Last edited:
  • Like
Reactions: IsThisThingOn
I see, that makes sense. I did some deeper research and found conflicting experiences regarding write amplification. Guess it really depends on the particular workload.

I might be able to test my workflow's I/O activity on a server I'm planning on setting up soon. It'll have one MX500 (single disk zpool) for the host and all VMs/CTs. Can I expect the I/O to be similar/the same as on the 2 SSD mirror? The alternative would be to install another drive through a USB adapter (since it's a laptop) and try it that way.
 
Last edited:
I have read that if you are operating a single node you can safely disable corosync, pve-ha-crm, and pve-ha-lrm. It should reduce the wear on your drives. In my experience I have done this and I have zero wear onmy consumer grade SSDs. Not sure if it was this or just a factor of the workloads I run
 
Biggest problems are the sync writes, so the DBs. My homelab is running like 30 VMs/LXCs and the logs and metrics of those guests are collected and written to DBs. Those logs and metrics alone cause nearly 1TB per day of writes to the SSDs while idleing.
 
Last edited:
Biggest problems are the sync writes, so the DBs. My homelab is running like 30 VMs/LXCs and the logs and metrics of those guests are collected and written to DBs. Those logs and metrics alone cause nearly 1TB per day of writes to the SSDs while idleing.
What kinds of logs are you writing to DBs? 1TB of logs per day is a lot.

I am deciding between enterprise SSDs and a consumer SSDs (such as Samsung/WD9) for my VMs and LXCs. I am not sure if the cost of the enterprise SSDs will be worth it for my use case. I plan to use Home Assistant, Scrypted/Shinobi, and perhaps InfluxDB/Prometheus with Grafana.

I plant to buy an MS-01 for PVE with the following specs and setup:
  • 96GB RAM with an Intel 12900H CPU
  • 256 GB SATA SSD for the OS via the U.2 slot on the mobo
  • 2x 1TB or 2x 2TB NVMe SSDs (not sure if I should go with 1 TB or 2 TB, I think 2 TB might be enough to run all my VMs and LXCs) in a ZFS pool for VMs and LXCs
What do you think?

Also, do you know if you can create a pool with an NVMe SSD (I can use the 3rd NVMe slot on the MS-01) and a SATA SSD?
 
What kinds of logs are you writing to DBs? 1TB of logs per day is a lot.
Problem isn't the amount of logs but the write amplification. So 50GB of writes (mostly logs and metrics) inside the VMs gets amplified by factor 20 to 1TB of writes to the physical NAND of the SSDs. Also depends how you store your logs. All my metrics and logs gets written to DBs (zabbix with MySQL for metrics and graylog with MongoDB/OpenSearch). You might also want to set up your journald to only log to volatile RAM in case you don't care about losing them on a crash/power outage.

2x 1TB or 2x 2TB NVMe SSDs (not sure if I should go with 1 TB or 2 TB, I think 2 TB might be enough to run all my VMs and LXCs) in a ZFS pool for VMs and LXCs
Keep in mind that you usually don't want to use more than 80% of that raw capacity because of the copy-on-write. And in case you want to make use of snapshots you might also want to account some space for that.

Also, do you know if you can create a pool with an NVMe SSD (I can use the 3rd NVMe slot on the MS-01) and a SATA SSD?
You could mix SATA and NVMe. But usually this would slow down those NVMe SSDs down to SATA performance in case you are thinking about something like a raidz1 aka raid5.
 
Thank you for all the info and advice.

Keep in mind that you usually don't want to use more than 80% of that raw capacity because of the copy-on-write. And in case you want to make use of snapshots you might also want to account some space for that.

I might as well get 2X 2TB NVMe SSDs and set up mirror. So it seems like I will just need to keep track of the disk usage of my node then and make sure it does not go over 80%. What happens if it goes over 80%?

You could mix SATA and NVMe. But usually this would slow down those NVMe SSDs down to SATA performance in case you are thinking about something like a raidz1 aka raid5.
I will keep this in mind.
 
I am deciding between enterprise SSDs and a consumer SSDs (such as Samsung/WD9) for my VMs and LXCs.
Right, I forgot to post an update on this thread. Before deciding on which SSDs I would buy, I first spent a short while with ZFS on a single brand new MX500. Within that period, it got about 15-20% (IIRC) wearout, an absolutely insane amount, increasing about 1% every week or two. The workload back then was comprised of about 5-6 VMs (2 Docker VMs, one virtualized NAS (nothing fancy, just SMB and NFS) and K3s on 2 VMs (one etcd node, one worker), and whatever other testing I was doing at the time). I really don't think the load was at all comparable to 1TB/day, more like an average 50KB/s according to htop (which I assume is before write amplification) or 4GBs/day, so bump that up to 80GBs with an assumed write amplification factor of 20.

Considering I now run a larger workload which could increase even more with time and since the enterprise SSDs I went with (Intel SSDSC2KB96 and Samsung MZ7LH960 - different because the Intel one rose in price quite a bit after I bought the first one) will probably last this server's lifespan and the next, I think it was a good buying decision. Current wearout is 3% on the Samsung SSD and still 0% on the Intel one, after roughly 9 months of constant uptime.
 
Last edited:
This might sound stupid, but why would anyone get 20x write amplification in a mirror setup?

As reference, my normal VM workload and 1200TBW SSDs (Kingston 1000 and WD Red) mirrors are at 2% and 5% wear out after over a year.
 
What happens if it goes over 80%?
It will become slower. Both ZFS and SSDs in general need free space for full performance.

This might sound stupid, but why would anyone get 20x write amplification in a mirror setup?
Mixed block sizes (like ext4 in VM writing 4K blocks to a 512B virtio disk to a 16K Zvol), nested filesystems, sync writes that can't be cached without PLP so consumer SSDs won't be able to optimize writes for less wear, raid adding parity data or multiple copies of each block, encryption (don`t know why but doubled write amplification here), fragmented storage, cheap NAND, workload with lots of small random writes instread of big sequential writes, too big volblocksize, ...
 
That seems a little bit excessive to me.
Mirror, 16k volblocksize, ext4 VM with RAW disk 512b would be 8x?
16x with CoW?
That is only for 512b sync writes, which would be the absolute worst case scenario in a "sane" setup?