Storage planning for new installation

shkval

New Member
Apr 18, 2021
5
0
1
41
Newbie here about to install Proxmox soon, also with the idea get integrate a NAS in there (virtualized FreeNas or OMV).

The question I've is regarding a planning for the Disk, the idea I have is:
  • Proxmox installed in a consumer grade SSD
  • Spare 1TB for Iso, Containers, VMs...
    • In the future I'm thinking some NVME SSD but I heard they can get wear out quite fast
    • I'm not worried about HA or fault tolerance so far, I can make backups for the time been until add new drive and RAID in place
  • On NAS pass-through i was thinking use the SAS controller or the spare SATA connector to pass 2 x 12 TB Sata drives (I've a Seagate Ironwolf) thinking in ZFS storage pool probably to expand to 2 x Additional 12TB in the future as well on RAID10
 
  • Proxmox installed in a consumer grade SSD
Thats fine if you just want to run Proxmox from it. But do you a favor and buy a second one so you can mirror them. If your proxmox can't boot you can't access all of your files. You get a 120GB SSD for only 20 bucks so that should totally be worth it.
  • Spare 1TB for Iso, Containers, VMs...
    • In the future I'm thinking some NVME SSD but I heard they can get wear out quite fast
    • I'm not worried about HA or fault tolerance so far, I can make backups for the time been until add new drive and RAID in place
Is your "spare 1TB" a HDD or SSD? HDDs aren't that great for a VM storage because they can't handle all the IOPS you get when running serveral guests in parallel. And yes, depending on your workload a consumer SSD might die within months. You really should spend a little bit more and get a enterprise grade SSD for that. You get useful enterprise SSDs for a little bit over 200€ per 1TB (Intel S4610 for example).
  • On NAS pass-through i was thinking use the SAS controller or the spare SATA connector to pass 2 x 12 TB Sata drives (I've a Seagate Ironwolf) thinking in ZFS storage pool probably to expand to 2 x Additional 12TB in the future as well on RAID10
You can't passthough single ports. If you want that your NAS VM can directly access the drives without a additional virtualization layer with all its problems and overhead, you need to passthrough a complete HBA (NO raid controller if you want to use ZFS/TrueNAS). So best option is to buy a PCIe HBA card and use PCI passthrough so your NAS VM can directly access and manage the drives. Maybe you can passthrough your onboard SATA controller too, but that way all of the ports of that SATA controller can't be used outside of that VM anymore. So most of the time not really an option because you need some ports not passed through for proxmox itself.
 
Last edited:
  • Like
Reactions: shkval
Is your "spare 1TB" a HDD or SSD? HDDs aren't that great for a VM storage because they can't handle all the IOPS you get when running serveral guests in parallel. And yes, depending on your workload a consumer SSD might die within months. You really should spend a little bit more and get a enterprise grade SSD for that. You get useful enterprise SSDs for a little bit over 200€ per 1TB (Intel S4610 for example).
Is there really much difference on life span between Customer grade SSD and Enterprise SSD like the Intel you provide ? more or less what's the lifespan difference or wear out difference (Is just out of curiosity), while reading about other users storage design I saw mostly the main root cause of wearout is if using ZFS Raid for example

You can't passthough single ports. If you want that your NAS VM can directly access the drives without a additional virtualization layer with all its problems and overhead, you need to passthrough a complete HBA (NO raid controller if you want to use ZFS/TrueNAS). So best option is to buy a PCIe HBA card and use PCI passthrough so your NAS VM can directly access and manage the drives. Maybe you can passthrough your onboard SATA controller too, but that way all of the ports of that SATA controller can't be used outside of that VM anymore. So most of the time not really an option because you need some ports not passed through for proxmox itself.
Yup that I know, I described it bad, basically I will use a Supermicro MoBo and I was planning to use the SAS controller for the pass-through, while use the SATA controller for the Proxmox and VM operations itself, is possible that scenario? otherwise yes, I will go to a HBA PCIe card.


many thanks for your feedback!
 
Is there really much difference on life span between Customer grade SSD and Enterprise SSD like the Intel you provide ? more or less what's the lifespan difference or wear out difference (Is just out of curiosity), while reading about other users storage design I saw mostly the main root cause of wearout is if using ZFS Raid for example
Look at the TBW (terabytes writes) value of the SSDs. All manufacturers give you this. As soon as you wrote more terabytes to the SSD than your TBW provides you loose your warranty and your SSD may fail (it might survive longer but it isn't build for that and without warranty you won't get a replacement).
Some Examples:
Consumer QLC SSD 1TB (Samsung 870 QVO; 87€): 360 TBW
Consumer TLC SSD 1TB (Samsung 860 EVO; 105€): 600 TBW
Prosumer TLC SSD 1TB (Samsung 860 PRO; 193€): 1200 TBW
Enterprise TLC SSD 1TB (Intel S4610; 210€): 6000 TBW
Enterprise MLC SSD 1.2TB (Intel S3710; 1004€): 24300 TBW (or 21125 TBW per TB if you want to compare it with 1TB drives)

So, if you compare a consumer grade Samsung 860 EVO with a enterprise grade Intel S4610 you pay double the price but get 10 times the write endurance...and you get powerloss protection for better data integrety and a lower write amplification on sync writes and better performance because the enterprise grade SSDs performance won't drop that hard if you do continuous writes and not only short bursts of writes.

And yes, ZFS is a big point, like any other copy-on-write filesystem. But server workloads in general are bad for consumer SSD. You get alot of parallel small random writes, you get small sync writes from DBs and virtualization is causing alot of write amplification.
I for example got a write amplification of around factor 30 on my homeserver. So for every 1 TB of data a guest writes 30TB are written to the SSD. If you take the write amplification into account that TBW of the drives isn't that much anymore. If a guest writes 20 TB, thats enough that the 600 TB TBW of a Samsung 860 EVO will be exeeded. Warranty of the 860 EVO is 5 years, so if I don't want to loose the warranty before the 5 years are over I can only continously write 3,8 MB/s to the SSD. And because of my write amplification of 30 that is even lowered to 126 kb/s. Thats really not much and you easily can write 126 kb/s just for the logs and metrics.
Right now my homeserver is writing with around 24 MB/s to the NAND cells of the SSDs while ideling. So after 289 days I would have exeeded the TBW and loose the warranty and the drive may fail. I don't want to replace my drive every 289 days so I replaced all my consumer SSDs with enterprise SSDs so they will last for years.

Yup that I know, I described it bad, basically I will use a Supermicro MoBo and I was planning to use the SAS controller for the pass-through, while use the SATA controller for the Proxmox and VM operations itself, is possible that scenario? otherwise yes, I will go to a HBA PCIe card.
You need to try that. That really depends on your mainboard and the controller used. Like I said, you can't use raid controllers if you want to use ZFS/TrueNAS. Often there is no firmware so you can't flash the controller into IT-mode so it would act as a normal dumb HBA.
 
Last edited:
  • Like
Reactions: shkval
Many thanks for your detailed feedback about the SSD - explained like that is crystal clear the Why and the Pros.

You need to try that. That really depends on your mainboard and the controller used. Like I said, you can't use raid controllers if you want to use ZFS/TrueNAS. Often there is no firmware so you can't flash the controller into IT-mode so it would act as a normal dumb HBA.

I will try it out, checking the Supermicro model seems the Onboard SAS controller is a LSI 3108 but as you said who knows if can be managed... Do you have some recommendation about a compatible LSI SAS Controller I could make use of ?
I was tying to avoid that since the motherboard only have 3 x PCI and for the future I was thinking to apply an PCI-E SSD module... but maybe it will be the only way

thanks again!
 
I was tying to avoid that since the motherboard only have 3 x PCI and for the future I was thinking to apply an PCI-E SSD module... but maybe it will be the only way
If you buy a new Supermicro motherboard, get a ATX one, not a microATX. I got a microATX one and already maxed out my RAM/PCIe slots but would like to add more RAM and PCIe cards. Keep in mind that every VM needs its own GPU if that VM somehow needs to encode or playback videos. And sometimes a VM needs a GPU for AI or OpenCL/CUDA offloading. And if you want a good router and create a OPNsense/pfsense VM you might want to add some NICs. And being able to swap in a faster NIC (10GBbit/40Gbit) is also a good idea. And M.2 expansion cards like you already said.
 
Last edited:
One thing leaps out here - 30x “write amplification”. Do you have any more info on that, or why exactly this is please?

regarding Proxmox itself, for home use I guess limiting logging might be a good idea (why write so many logs you generally never need to look at), or writing logs to tmpfs (sure - not great if your system is unstable, but then you can turn on full logging)
 
ZFS, virtualization, sync writes and the way SSDs are working are causing a lot of write amplification.

I got a write amplification from guest to host of around factor 7. And another hidden write amplification inside the SSD (from the host to the NAND cells) of factor 3. So in total thats a write amplification of factor 21.

There are some things that causes this:
1.) mixed block sizes:
If you mix blocksizes you easlily get write amplification. Especially if you try to write with a smaller blocksize to a bigger blocksize. An example:
If you want to write a 8K block to a storage with 4K block size that is no problem. You just need to write 2x 4K blocks so you read nothing and only write 8K in total. Thats all fine and performant.
If you want to write a 4K block to a storage with 16K block size thats a problem. First you need to read the full block with 16K into RAM. Now you need to change 4K of that 16K in RAM. Now you need to write the full 16K block from RAM to storage again. So you only want to write 4K but that will cause 16K in reads and 16K in writes. It will be way slower because all the extra reads and RAM operations and you will get a write amplification of 4.
And you will get alot of mixed blocksizes...here my server for example:
SSD (unknown blocksize; reporting to be 4K but should be way higher because all SSDs are lying about this) <- ZFS pool (4K) <- ZFS dataset (128K) or ZFS zvol (32K) <- virtio SCSI (512B) <- virtual HDD inside guest (4K) <- ext4 filesystem inside guest (4K).

2.) sync writes:
If something is doing sync writes everything will be written twice to the ZFS pool. So you will get another write amplification of factor 2.
And if you don't got a enterprise SSD with powerloss protection your SSD won't be able to use the volatile internal RAM cache to optimize writes. SSDs can only erase big "chucks" of data and if you want to write a small sector you need to erase and rewrite the complete big chunk. Lets say that chunk is 128K and the blocksize is 4K.
If you want to sync write 320x 4KB to a SSD without powerloss protection it will write 320x 128K because for each 4K block a complete chunk needs to be rewritten and that 320x 4KB (1280KB) from the host will write 40960 KB to the NAND cells of the SSD. So you got another write amplification of factor 32.
If that would be a enterprise grade SSD with powerloss protection it would cache the 320x 4KB writes (1280KB)and merge them into 10x 128KB writes (1280KB). That way you wouldn't get a additional write amplification.

3.) ZFS:
ZFS is a copy-on-write filesystem. If you edit a file, it won't edit the file and replace its content, it will keep everything as it was and will write down the changes to the journal. And there is alot of complex stuff running in the background. If you are running some kind of raid you will probably write everything twice. And there is alot of overhead like parity data, metadata, checksums, ...

And a problem is that write amplification won't sum up, they will multiply...for example:
4x write amplification due to mixed block sizes
2x write amplification due to sync writes
32x write amplification due to missing powerloss protection inside the SSD
...so you get 4 * 2 * 32 = 256x write amplification.

These were just easy examples and it is way more complex in reality. But the point is that you get a little bit of write amplification everywhere and that will multiply and cause massive writes. My homeserver for example is writing easily 500GB of logs/metrics every day to the SSDs. And thats just some text written to DBs that the host and guests are producing by themself...
Thats why consumer SSDs are bad for server workloads because a normal work/gaming pc got workloads with way less write amplification. So consumer SSD may live forever in a gaming PC but may die within months if used in a server.
 
Last edited: