How should I configure my drives

Urbz

New Member
Nov 24, 2020
8
0
1
26
I recently got a great deal on a used Dell OptiPlex 7040 SFF that I plan to run PVE 6.3 on. The system has a single M.2 NVME slot and only uses a single SATA power cable coming from the motherboard that splits out to a standard sized SATA power and a slimline SATA power that connects to the included slimline optical drive. This causes me to have to get a little bit creative about how to configure my drives.

I want to run two SSD's configured in a ZFS mirror. I only plan on running pfSense and Unifi Controller at the moment so partitioning these drives for the root and my VM's is enough storage space for me. I also plan to back things up regularly to an external machine using an NFS share. It's possible I might want to add a 2.5 inch HDD for junk storage in the future, this is less important but if you had this system what would be your preferred spot to put this drive after you setup the SSD mirror with the options below.

Option 1
  • M.2 NVME SSD
  • 2.5 inch SATA SSD
Option 2
Option 3
My current thoughts and added info
  • Price of all these options are the same for me
  • I understand that I will lose the speed bonus of NVME using them together like this but honestly I don't mind this as long as this setup is compatible
  • I have no idea the quality or reliability of optical drive to SSD Caddy adapter's
  • I have heard horror stories of fires using molded SATA power splitter's but I can not find a single non molded splitter that will work with my system. The only option seems to be the StarTech black 4x splitter but that one will not fit in this small case because I need the wires to comes straight out the back of the connection and not vertically. The best option seems to be the Dell OEM part I linked above. Its molded but its the exact part Dell uses in very similar systems as this one so I would hope this is a quality part.
Sorry for the long read but I wanted to try and include as much info as possible. Thank you to everyone in advance for the input.
 

Urbz

New Member
Nov 24, 2020
8
0
1
26
Like you already said the molex adapter you linked is known to catch fire. And pictures of it look really awful.

Don't get the molded ones, get the crimped ones for example (molex) https://www.ebay.ca/itm/253907755081

I got a bunch of these sata splitters and they are great https://www.ebay.ca/itm//123881177343
I cant use molex either so im stuck with the sata splitters. I was only seeing ones like this https://www.amazon.com/StarTech-com-Power-Splitter-Adapter-PYO4SATA/dp/B0086OGN9E/

Ill have to measure to be sure but that second link looks like it might actually fit so thank you! I would of preferred a smaller 1 male to 2 female splitter but this beats starting a fire.
 

Dunuin

Famous Member
Jun 30, 2020
6,690
1,553
149
Germany
Keep in mind that it isn't recommended to run ZFS on consumer grade SSDs and there is only one enterprise grade M.2 SSD with powerloss protection and high write endurance and that one is expensive (Intel P4801X: 100GB for 273€).
And mirroring M.2 with SATA drives shoudn't give you a real speed benefit because the M.2 needs to wait for the SATA to finish so both are synced.
If you only want to use cheap consumer grade SSDs it is also possible to use raid1 with mdraid software raid if you install a Debian first and then the proxmox packages ontop of it.
 

Urbz

New Member
Nov 24, 2020
8
0
1
26
Keep in mind that it isn't recommended to run ZFS on consumer grade SSDs and there is only one enterprise grade M.2 SSD with powerloss protection and high write endurance and that one is expensive (Intel P4801X: 100GB for 273€).
And mirroring M.2 with SATA drives shoudn't give you a real speed benefit because the M.2 needs to wait for the SATA to finish so both are synced.
If you only want to use cheap consumer grade SSDs it is also possible to use raid1 with mdraid software raid if you install a Debian first and then the proxmox packages ontop of it.
I guess I'm going to have to look in to this more. I don't have any experience with ZFS so I didn't know how much more intensive it is. I'm only going to to be running extremely simple VM's that wont be writing dozens and dozens of gigabytes a day. Will ZFS really use as much RAM and beat up on consumer SSD's on a machine with 3 VM's and less than 1tb of storage? I just wanted a simple RAID1 setup that I would use as my only storage for Proxmox and it seemed like ZFS was the way to go based on the threads and videos I've seen. Ill have to look in to mdraid option you mentioned because it looks closer to the btrfs RAID1 I'm use to but the wiki says its only recommended for experienced users which I am not.
 

Dunuin

Famous Member
Jun 30, 2020
6,690
1,553
149
Germany
I guess I'm going to have to look in to this more. I don't have any experience with ZFS so I didn't know how much more intensive it is. I'm only going to to be running extremely simple VM's that wont be writing dozens and dozens of gigabytes a day. Will ZFS really use as much RAM and beat up on consumer SSD's on a machine with 3 VM's and less than 1tb of storage? I just wanted a simple RAID1 setup that I would use as my only storage for Proxmox and it seemed like ZFS was the way to go based on the threads and videos I've seen. Ill have to look in to mdraid option you mentioned because it looks closer to the btrfs RAID1 I'm use to but the wiki says its only recommended for experienced users which I am not.
The problem is the write amplification. Unlike HDDs, SSDs don't have a block size of 512B or 4K as they tell you if you look at the LBA. The smallest block they can write is more likely in the size of some MB. So if you just want to change/write a single byte some MB must be written. So you write 1000x 1 Byte and some GB may be written. And the BTW your SSD is rated for (and the warranty covers) is the total data written to the flash and not the amount of data you try to write to the SSD. For async writes the write amplification isn't that bad, because most consumer SSD (except for the really cheap ones) got a RAM cache, so write operations are cached until there is enough data collected, so the SSDs can write some MB at once. But if you encounter a power outage all data in the RAM cache will be lost and therefore RAM caching can't be used for sync writes where it is important that the data is really safely stored. Enterprise grade SSDs offer powerloss protection to save all data in the RAM even on a power outage so they got a way better write amplification on sync writes. And the internal blocksize of enterprise grade SSDs is smaller in general and they often use more durable SLC/MLC flash and not the cheap TLC/QLC flash.
Also keep in mind that the virtualization layer and mismatched padding caused by that and the fact, that ZFS is a copy-on-write file system, adds to the write amplifiction.

My homeserver for example got a write amplification of factor 20. My VMs are writing 30GB of data (mostly logs and metrics) per day when idleing and my SSDs are writing 600GB per day to store those 30GB. Such a Samsung 870 Qvo 1TB is rated for 360 TBW so it won't last 2 years in my case if only looking at the numbers (most Samsung SSDs will last longer then they declare). Most likely even less, because I'm already using enterprise grade SSDs with a better write amplification then the consumer grade Evos/Qvos.

And ZFS needs a lot of RAM and ECC RAM is recommended. 4GB + 1GB RAM per 1TB of raw disk capacity. And 5GB RAM per 1TB raw capacity if you want to use deduplication.

You can try ZFS and look how it works in your case. Just keep the TBW in mind and regularily check your smart values.
Mdraid would be an alternative for low-budget hardware but it is lacking all the nice data safety features ZFS offers.
 
Last edited:

Urbz

New Member
Nov 24, 2020
8
0
1
26
The problem is the write amplification. Unlike HDDs, SSDs don't have a block size of 512B or 4K as they tell you if you look at the LBA. The smallest block they can write is more likely in the size of some MB. So if you just want to change/write a single byte some MB must be written. So you write 1000x 1 Byte and some GB may be written. And the BTW your SSD is rated for (and the warranty covers) is the total data written to the flash and not the amount of data you try to write to the SSD. For async writes the write amplification isn't that bad, because most consumer SSD (except for the really cheap ones) got a RAM cache, so write operations are cached until there is enough data collected, so the SSDs can write some MB at once. But if you encounter a power outage all data in the RAM cache will be lost and therefore RAM caching can't be used for sync writes where it is important that the data is really safely stored. Enterprise grade SSDs offer powerloss protection to save all data in the RAM even on a power outage so they got a way better write amplification on sync writes. And the internal blocksize of enterprise grade SSDs is smaller in general and they often use more durable SLC/MLC flash and not the cheap TLC/QLC flash.
Also keep in mind that the virtualization layer and mismatched padding caused by that and the fact, that ZFS is a copy-on-write file system, adds to the write amplifiction.

My homeserver for example got a write amplification of factor 20. My VMs are writing 30GB of data (mostly logs and metrics) per day when idleing and my SSDs are writing 600GB per day to store those 30GB. Such a Samsung 870 Qvo 1TB is rated for 360 TBW so it won't last 2 years in my case if only looking at the numbers (most Samsung SSDs will last longer then they declare). Most likely even less, because I'm already using enterprise grade SSDs with a better write amplification then the consumer grade Evos/Qvos.

And ZFS needs a lot of RAM and ECC RAM is recommended. 4GB + 1GB RAM per 1TB of raw disk capacity. And 5GB RAM per 1TB raw capacity if you want to use deduplication.

You can try ZFS and look how it works in your case. Just keep the TBW in mind and regularily check your smart values.
Mdraid would be an alternative for low-budget hardware but it is lacking all the nice data safety features ZFS offers.
Ok you've done an amazing job helping me understand things better but I have a few nooby questions.

First when you say your VM's are writing 30GB but your SSDs are writing 600GB per day I understand this is the write amplification you are talking about. My question is when I check iotop the disk writes I see is the TOTAL disk writes including the write amplification with or without ZFS correct?

Second does ZFS use that RAM at all times? In a system that has 8GB of RAM and only 1tb of storage does this mean I can only allocate a total 3GB of RAM spread across any VM's or can ZFS use unused RAM that has been allocated to a VM in order to meet the 5GB needed in this scenario?
 

Dunuin

Famous Member
Jun 30, 2020
6,690
1,553
149
Germany
First when you say your VM's are writing 30GB but your SSDs are writing 600GB per day I understand this is the write amplification you are talking about. My question is when I check iotop the disk writes I see is the TOTAL disk writes including the write amplification with or without ZFS correct?
No, iotop won't tell you the real numbers. If you run iostat 3600 2 (apt install sysstat) inside the VMs and wait for an hour you get the real amount of data your VMs wants to store to the virtual harddisk in that hour (30 GB/day in my case). If you run iostat on your hypervisor and look at sda, nvme0p1 and so on you see the data that should be stored on the physical SSDs (around 200 GB/day in my case). That data is already write amplified because of the virtualization, ZIL writes, journaling, encryption, checksums and other metadata, parity and padding overhead. In my case that is factor 7 of the data the VMs tries to write. But that value is without the write amplification caused by your SSD. If you want to get the real amount of data written to the flash of your SSD you need to run smartctl (for example smartctl -a /dev/sda) to read the SMART values of your drive. If you are lucky your SSD model offers you something like "NAND_32MiB_Written" (raw value * 32MiB = real total amount of writes to NAND flash) or something similar that monitors the real amount of data written to the NAND flash of that drive (600 GB/day in my case, so my SSD also adds a write amplification of factor 3 again). But not every SSD offers this and often you just get stuff like "Host_Writes_32MiB" (raw value * 32MiB = total writes to SSD but not the NAND flash inside it...increases 200 GB/day in my case so it matches the value from iostat) what also isn't the real amount of data written to the NAND flash and it only tells you the amount of data before the final write amplification of the drive itself. Its hard to find out if your SSD model shows you the real amount of data written or the fake one without the write amplification because SMART values aren't standardized. Because the last write amplification is caused inside your SSD there is no way any linux software can monitor the real amount of writes to the NAND flash unless the drives firmware offers you this value using SMART attributes. And you won't find any information of internal blocksize, write amplification and so on in any datasheets. Atleast I didn't found a single one checking several manufacturers.
You could guess if your SMART attribute is showing you the final amount of data written to the NAND flash if you monitor that SMART value and the writes iostat will give you. If both are nearly the same and not a multiple your SSDs won't give you the real numbers.

I just want to warn so you don't need to repeat my mistakes. I first started with a zfs mirror (raid1) of two Samsung 970 Evo NVMe SSDs. Because of the high write amplification the new drives would only survive for about a year. So I removed them and tried a zfs mirror of two WD Red HDDs, because HDDs don't suffer from wearing under high write loads. But because of the high write amplification the IOPS also multiplied by factor 7 and so the HDDs just weren't capable of handling all the small random writes. So I replaced them with some second hand enterprise grade SATA SSDs with powerloss protection + MLC flash and 30x the TBW of that Samsung Evos. That SSDs are working fine and will survive for years even if writing 600GB per day. And because I bought them second hand they were cheaper then my Samung Evos but with still 98-100% TBW left. I chose SATA enterprise SSDs because they were much cheaper compared to the faster U.2 SSDs (there are M.2 to U.2 adapters). For M.2 I only found one durable enterprise SSD with powerloss protection and that is the expensive (3000€ per TB) Intel P4801X you won't find on a second hand market.

You can try cheap consumer SSDs but remember to monitor them, so you don't miss the deadline to send them back if they can't handle the amplified writes. I realized that too late and wasted 250€ on Samsung SSDs I can't use and have no usage for.
Second does ZFS use that RAM at all times? In a system that has 8GB of RAM and only 1tb of storage does this mean I can only allocate a total 3GB of RAM spread across any VM's or can ZFS use unused RAM that has been allocated to a VM in order to meet the 5GB needed in this scenario?
You can manually set the ARC size. By default proxmox will use 50% of your max RAM for ZFS caching. ZFS will fill up the full specified RAM all the time. So yes, if you only got 8GB of RAM and limit the ARC to 5GB, ZFS will always use the full 5GB and you only got around 1 to 2 GB for VMs (proxmox also needs RAM and linux does additional file caching).
 
Last edited:

Urbz

New Member
Nov 24, 2020
8
0
1
26
No, iotop won't tell you the real numbers. If you run iostat 3600 2 (apt install sysstat) inside the VMs and wait for an hour you get the real amount of data your VMs wants to store to the virtual harddisk in that hour (30 GB/day in my case). If you run iostat on your hypervisor and look at sda, nvme0p1 and so on you see the data that should be stored on the physical SSDs (around 200 GB/day in my case). That data is already write amplified because of the virtualization, ZIL writes, journaling, encryption, checksums and other metadata, parity and padding overhead. In my case that is factor 7 of the data the VMs tries to write. But that value is without the write amplification caused by your SSD. If you want to get the real amount of data written to the flash of your SSD you need to run smartctl (for example smartctl -a /dev/sda) to read the SMART values of your drive. If you are lucky your SSD model offers you something like "NAND_32MiB_Written" (raw value * 32MiB = real total amount of writes to NAND flash) or something similar that monitors the real amount of data written to the NAND flash of that drive (600 GB/day in my case, so my SSD also adds a write amplification of factor 3 again). But not every SSD offers this and often you just get stuff like "Host_Writes_32MiB" (raw value * 32MiB = total writes to SSD but not the NAND flash inside it...increases 200 GB/day in my case so it matches the value from iostat) what also isn't the real amount of data written to the NAND flash and it only tells you the amount of data before the final write amplification of the drive itself. Its hard to find out if your SSD model shows you the real amount of data written or the fake one without the write amplification because SMART values aren't standardized. Because the last write amplification is caused inside your SSD there is no way any linux software can monitor the real amount of writes to the NAND flash unless the drives firmware offers you this value using SMART attributes. And you won't find any information of internal blocksize, write amplification and so on in any datasheets. Atleast I didn't found a single one checking several manufacturers.
You could guess if your SMART attribute is showing you the final amount of data written to the NAND flash if you monitor that SMART value and the writes iostat will give you. If both are nearly the same and not a multiple your SSDs won't give you the real numbers.

I just want to warn so you don't need to repeat my mistakes. I first started with a zfs mirror (raid1) of two Samsung 970 Evo NVMe SSDs. Because of the high write amplification the new drives would only survive for about a year. So I removed them and tried a zfs mirror of two WD Red HDDs, because HDDs don't suffer from wearing under high write loads. But because of the high write amplification the IOPS also multiplied by factor 7 and so the HDDs just weren't capable of handling all the small random writes. So I replaced them with some second hand enterprise grade SATA SSDs with powerloss protection + MLC flash and 30x the TBW of that Samsung Evos. That SSDs are working fine and will survive for years even if writing 600GB per day. And because I bought them second hand they were cheaper then my Samung Evos but with still 98-100% TBW left. I chose SATA enterprise SSDs because they were much cheaper compared to the faster U.2 SSDs (there are M.2 to U.2 adapters). For M.2 I only found one durable enterprise SSD with powerloss protection and that is the expensive (3000€ per TB) Intel P4801X you won't find on a second hand market.

You can try cheap consumer SSDs but remember to monitor them, so you don't miss the deadline to send them back if they can't handle the amplified writes. I realized that too late and wasted 250€ on Samsung SSDs I can't use and have no usage for.

You can manually set the ARC size. By default proxmox will use 50% of your max RAM for ZFS caching. ZFS will fill up the full specified RAM all the time. So yes, if you only got 8GB of RAM and limit the ARC to 5GB, ZFS will always use the full 5GB and you only got around 1 to 2 GB for VMs (proxmox also needs RAM and linux does additional file caching).
Thank you for taking the time to explain everything with such detail. I've learned a lot!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!