High SSD wear after a few days

I dont mind to do a new setup. I created the zpool as you suggested, and I think I get it.

So you are suggesting to use 2 small SSD's (mirror) for root, and than 2 large SSD's (mirror) for all my VM's. Root will be LVM (ext4) and the large SSD's will also be LVM (ext4) with ZPOOL on top of that? Do I get it right?
Since I have the 3 SSD's (and 4 hopefully soon), can I do the same, but with 4 SSD's in RAID10?

I am only able to setup proxmox with the installed. What you are suggesting is a manual setup. I am not comfortable to do that, unless I have some guide that I can follow.

I appriciate you helping me. Thank you!!!

Hi,
Just wondering if you ever figured this out?
I am thinking of following a similar format.
My Current setup uses 1 TB NVME and 1 TB SATA SSD with ZFS Raid1, But after a couple months, I put it in on October 7th, 2019. It's showing about 13.9TBW so far. That's something close to 7TB a month being written.
I have 1 VM runnning 24/7 Windows 2019 Standard that is backed up using vzdump snapshot nightly to an NFS which is about 66GB.
It seems like the wear is going fast though. THe SATA SSD has endurance of about 300TBW and shows 90% reamaining life. the NVME is about 600TBW endurance and it shows 2% used so far.
I am going to switch it to to 2 smaller 120GB Mirrored SATA SSDs for the OS and 2 x 960GB SATA SSDs for the VMs and storage.

I really wish I could find out what is resulting in the heavy writes.
Thank so much.
 
Hi,
Just wondering if you ever figured this out?
I am thinking of following a similar format.
My Current setup uses 1 TB NVME and 1 TB SATA SSD with ZFS Raid1, But after a couple months, I put it in on October 7th, 2019. It's showing about 13.9TBW so far. That's something close to 7TB a month being written.
I have 1 VM runnning 24/7 Windows 2019 Standard that is backed up using vzdump snapshot nightly to an NFS which is about 66GB.
It seems like the wear is going fast though. THe SATA SSD has endurance of about 300TBW and shows 90% reamaining life. the NVME is about 600TBW endurance and it shows 2% used so far.
I am going to switch it to to 2 smaller 120GB Mirrored SATA SSDs for the OS and 2 x 960GB SATA SSDs for the VMs and storage.

I really wish I could find out what is resulting in the heavy writes.
Thank so much.

Hi,

yes, I have figured this out. Since me posting here, the system is running in production non-stop. Never had any issues or any drives fail. I just ordered new SSD's to replace the Samsung 840 Pro's with Intel 3700. This is more a precaution, since the drives have been in full time operation since approx Dec 2015. I also have a database running on it.

I highly suggest, to setup 2x (small) SSD's in mirror for the root partition, and than use whatever amount of drives you need, in Mirror or Raid10. The advantage is, if you have to upgrade proxmox, you can just backup all the config files, detach the zpool that has all the vm's, upgrade or reinstall proxmox, copy config files back, import zpool and you are ready to go. I did that already couple of times and it is fast. You can basically redo a new system in about 2 hours.

I setup the root zpool with the proxmox installer, but than manually created the "storage" zpool manually.
 
Hi,

yes, I have figured this out. Since me posting here, the system is running in production non-stop. Never had any issues or any drives fail. I just ordered new SSD's to replace the Samsung 840 Pro's with Intel 3700. This is more a precaution, since the drives have been in full time operation since approx Dec 2015. I also have a database running on it.

I highly suggest, to setup 2x (small) SSD's in mirror for the root partition, and than use whatever amount of drives you need, in Mirror or Raid10. The advantage is, if you have to upgrade proxmox, you can just backup all the config files, detach the zpool that has all the vm's, upgrade or reinstall proxmox, copy config files back, import zpool and you are ready to go. I did that already couple of times and it is fast. You can basically redo a new system in about 2 hours.

I setup the root zpool with the proxmox installer, but than manually created the "storage" zpool manually.

Thank you so much Eddie. Well I have been running in production since October 2019, so you definitely are ahead of me. I am running it in ZFS with and Intel Nuc 8th generation i7 with the specs I showed above.
I have since orderd a Supermicro AS-5019D-FTN4 which is something like an embedded server with a half depth rack size and 1U height. It can accomodate 2 nvmes. 1 on board and the other with PCIE add on card. However, because I have more SATA drives I cannot fit the optional PCIE card.
So my setup is 2 pairs of SSDs. 1 pair like you suggested is 2x 240GB Intel D3-S4610 drives wich have 3 DWPD or 1.4 PBW. The VMs will reside on a pair of 960 GB SSDs Intel D3-S4610 as well with 3 DWPD or something cose to 5-6 PBW which is a lot better than the previous models. I don't have the room for Raid 10. The motherboard itself only has 4x SATA III connectors.

I have everything assembled, just thought I would run some tests on the memory before installing Proxmox 6.1.
Just a question,
when I installed the root partiion it created a local and local-zfs by default. I then added a storage location with the second set of 960 GB SSD in ZFS1 configuration.

Is this normal, or should I be setting it up differently?

I have this machine doing nightly backups using vzdump to a local NFS storage on a NAS on the same network.
How do you copy the config files if you are setuping another system?

I'm quite new to this. Thanks.
 
Hi,

I'm jumping on the wagon as I'm also planning a new proxmox server and there is some good info in this thread.

I got a bit worried about going SSD for a proxmox build, so I started to think on the safer alternative of going with good old mechanical disks.

The pieces are under way, the hardware will be:
- motherboard: Supermicro X9DRH-7TF
- cpu: 2x Intel Xeon E5-2620 v2, 2.1Ghz, 6 Core (12 Cores total)
- ram: 128Gb - Samsung M393B2G70BH0-CK0 (8x 16Gb RDIMM 1600MHz 1.5v)
- chassis: Supermicro CSE-826BE16-R920LPB (12x 3.5" SAS caddies in front)

Now, I've read some horror stories around proxmox on SSD, or maybe the problem just being ZFS on SSD, or even maybe related with the type of ZFS pool ... there is still some inconsistent info around.

Initially I was thinking of going RAIDZ2, then after reading around I found that mirror should be the way to go, safer and more performant.

So after I receive the remaining components and finish assembling the server, it's time to install some disks.

On a safer side, I was thinking of going with:

Proxmox OS:
2x 2.5" WD Blue 500Gb (5400rpm) as ZFS mirror (with grub), created by proxmox install wizard

For virtualization data itself:
1x ZFS pool, composed of:
- zdev-1 (mirror) : 2x WD Ultrastar 1Tb (7200rpm)
- zdev-2 (mirror) : 2x WD Ultrastar 1Tb (7200rpm)

I would love to go SSDs, but to be on the safe side, I was thinking going with the above mechanical disks (although I'm still able to change that, I haven't bought them yet).

Space is not an issue, as I believe 300-400Gb is more than enough for my needs in the following years, so the above will give me 2Tb of space, which is well above my needs.


So the first big question is:

Moving to plain ZFS mirror (Raid10 equivalent) instead of raidzX is what prevents the excessive SSD wearing?


If this is the solution, then going for an SSD solution like the following one would be preferable, or should I not trust those consumer SSD drives for the job?

Proxmox OS:
2x 2.5" SSD Samsung 860 Evo 250Gb as ZFS mirror (with grub), created by proxmox install wizard

For virtualization data itself:
1x ZFS pool, composed of:
- zdev-1 (mirror) : 2x ADATA SX8200 PRO 512GB NVMe
- zdev-2 (mirror) : 2x ADATA SX8200 PRO 512GB NVMe

(The NVMe disks will be mounted internally with a PCIe to NVMe adapter, the OS SSD disks will probably be mounted inside the case to the SATA connectors.)

I also thought about replacing those "ADATA SX8200 PRO 512Gb" with "Samsung SM883 240Gb" ones, to get a more robust SSD wear level, the total space would drop to 480Gb, but I believe that would still be inline with my needs on some years to come.


So what are your opinions? Are SSDs a safe approach if using ZFS mirror and segregating OS from VM Storage data?


Thanks.
 
Hi,

yes, I have figured this out. Since me posting here, the system is running in production non-stop. Never had any issues or any drives fail. I just ordered new SSD's to replace the Samsung 840 Pro's with Intel 3700. This is more a precaution, since the drives have been in full time operation since approx Dec 2015. I also have a database running on it.

I highly suggest, to setup 2x (small) SSD's in mirror for the root partition, and than use whatever amount of drives you need, in Mirror or Raid10. The advantage is, if you have to upgrade proxmox, you can just backup all the config files, detach the zpool that has all the vm's, upgrade or reinstall proxmox, copy config files back, import zpool and you are ready to go. I did that already couple of times and it is fast. You can basically redo a new system in about 2 hours.

I setup the root zpool with the proxmox installer, but than manually created the "storage" zpool manually.

Another thing. Maybe I got sidetracked. On another post, I was investigating whether the high read and write could be a cause of write amplication. Something to do with the drive Block size being the default 8k versus something like 64k.
Did you find anything like that?
I am in the midst of setting up another two Proxmox servers so I want to try and get things right the first time as much as possible.

if I change the blocksize to 64k then I'm supposed to change or format the windows server vm inside to be 64k as well.

Anyways I would appreciate any feedback you would have on this.
Thanks.
 
Something to do with the drive Block size being the default 8k versus something like 64k.
Did you find anything like that?

Mostly ssd use internal 16k. So a 64k is ok (4 x 16k). Yes you can have RMW but this depends a lot of your services/load on this win VM. Also compression could make a things more dificult( for example if your 64k could be compress to 3x 16 k it is ok, because you will write only 3 x16 k insted of 4x 16 k). And also most of ssd are doing compression.

In the end you need to try some tests with your real VM and see what is your optimal settings.


Good luck / Bafta.
 
Well I recently got an error email last night and again today.

The following warning/error was logged by the smartd daemon:

Device: /dev/sdc [SAT], 1 Currently unreadable (pending) sectors.

This was the output of smartctl -a /dev/sdc

Capture29.PNG
 
Code:
root@VMNode01:/# zdb | grep ashift
            ashift: 12
            ashift: 9
            ashift: 9


root@VMNode01:/# lsblk -dt /dev/sd?
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE  RA WSAME
sda          0    512      0     512     512    0 deadline     128 128    0B
sdb          0    512      0     512     512    0 deadline     128 128    0B
sdc          0    512      0     512     512    0 noop         128 128    0B
sdd          0    512      0     512     512    0 noop         128 128    0B
sde          0    512      0     512     512    0 noop         128 128    0B
sdf          0    512      0     512     512    0 noop         128 128    0B
sdg          0    512      0     512     512    1 deadline     128 128    0B


Than the zpool layout:
Code:
root@VMNode01:/# zpool status
  pool: rpool
state: ONLINE
  scan: none requested
config:


        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda2    ONLINE       0     0     0
            sdb2    ONLINE       0     0     0


errors: No known data errors


  pool: storage
state: ONLINE
  scan: none requested
config:


        NAME                                               STATE     READ WRITE CKSUM
        storage                                            ONLINE       0     0     0
          mirror-0                                         ONLINE       0     0     0
            ata-Samsung_SSD_850_PRO_512GB_S250NWAG904013M  ONLINE       0     0     0
            ata-Samsung_SSD_850_PRO_512GB_S250NXAG936968X  ONLINE       0     0     0
          mirror-1                                         ONLINE       0     0     0
            ata-Samsung_SSD_850_PRO_512GB_S250NWAG905249J  ONLINE       0     0     0
            ata-Samsung_SSD_850_PRO_512GB_S250NWAG905234V  ONLINE       0     0     0


errors: No known data errors

And finally the SMART data for all 6 drives:
Code:
root@VMNode01:/# smartctl -a /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     ADATA SX930
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device

...


root@VMNode01:/# smartctl -a /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     ADATA SX930
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical

...


root@VMNode01:/# smartctl -a /dev/sdc
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 PRO 512GB
Firmware Version: EXM02B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical

...


root@VMNode01:/# smartctl -a /dev/sdd
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 PRO 512GB
Firmware Version: EXM02B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical

...


root@VMNode01:/# smartctl -a /dev/sde
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 PRO 512GB
Firmware Version: EXM02B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical

...


root@VMNode01:/# smartctl -a /dev/sdf
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 PRO 512GB
Firmware Version: EXM02B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical

...

Hi,

From what I see from this quoted post, the system was configured as bellow:

Proxmox OS:
2x ADATA SX930 120Gb as ZFS mirror (with grub), created by proxmox install wizard

For virtualization data itself:
1x ZFS pool, composed of:
- zdev-1 (mirror) : 2x Samsung SSD 850 PRO 512Gb
- zdev-2 (mirror) : 2x Samsung SSD 850 PRO 512Gb

This was around the year 2015, so after these 4-5 years, how was the system wearing those SSDs? Any trouble with it?

What made you to pick Samsung 850 PRO SSD's instead of going with some more enterprise grade hardware for the job? Price?


If going for a similar SSD solution, I could think of:

Proxmox OS:
2x ADATA SX8200 PRO 256Gb as ZFS mirror (with grub), created by proxmox install wizard
60€ each SSD

For virtualization data itself:
1x ZFS pool, composed of:
- zdev-1 (mirror) : 2x Samsung SSD 970 PRO 512Gb
- zdev-2 (mirror) : 2x Samsung SSD 970 PRO 512Gb
170€ each SSD

That would put me in around 800€ :eek: which is a huge price if I end up with problems on SSD wear level.

Going for plain hard drives will be around 350€ cheaper, I believe I'll get the performance I need but obviously going SSD is expected to pulverize harddisk performance.


Any opinions are very much appreciated.

Thank you.
 
  • Like
Reactions: Jarvar
Hi, I would like to re-open this thread, since it is few months I have migrated my ESXi setup to proxmox VE and I am seeing very strange SSD disk usage. My old setup as ESXi was using LSI 9211 and two 750 GB MX300 SSD in Raid 1. Now let's see stats from one of this drives:

Code:
Device Model:     Crucial_CT750MX300SSD1
Sector Size:      512 bytes logical/physical
9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       45223
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       193920876835
202 Percent_Lifetime_Remain 0x0030   054   054   001    Old_age   Offline      -       46

45223 hours is 5.158909 years and Total LBA written since disk is 512 sectors gives 90.30 TB in that five years, which gives about 1,46TB/month.
So this disk is somewhere in the middle of his lifetime (TBW is 220). All fine and great and clear.

All that virtual machines has been migrated to PVE 7.0, using two 2TB MX500 in ZFS mirror:

Code:
pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 00:49:47 with 0 errors on Sun Jan  9 01:13:49 2022
config:

        NAME                                  STATE     READ WRITE CKSUM
        storage                               ONLINE       0     0     0
          mirror-0                            ONLINE       0     0     0
            ata-CT2000MX500SSD1_XX  ONLINE       0     0     0
            ata-CT2000MX500SSD1_XX  ONLINE       0     0     0

Now let's see some stats from one of this disks:

Code:
Device Model:     CT2000MX500SSD1
Firmware Version: M3CR033
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1093
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       67185503643
202 Percent_Lifetime_Remain 0x0030   090   090   001    Old_age   Offline      -       10

1093 hours is 45.54167 days. Now when I use logical sector size, total bytes written is 31.29 TB (!!!!) in that 45 days (20,86TB/month) or even when I compute with physical size 4096 it gives insane number 250.29 TB (166,82TB/month).

According to spec, MX500 2TB have endurance 700TBW. None of this numbers gives 10% if 700TB. Anyway, 10% disk wearout in just 45 days seems to be total unequal with load in ESXi setup.

In calculator https://www.virten.net/2016/12/ssd-total-bytes-written-calculator/ in comment somebody also mention this way how to show disk stats"
Code:
root@pve:~# smartctl /dev/sdh -l devstat | less
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-5-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4               1  ---  Lifetime Power-On Resets
0x01  0x010  4            1093  ---  Power-on Hours
0x01  0x018  6      2771135243  ---  Logical Sectors Written
0x01  0x020  6       839078384  ---  Number of Write Commands

It gives different number written sectors (why?!) With 512 sector size it is 1.29 TB (0,86TB/month), with 4096 it gives 10.32 TB (6.88TB/month). Finaly one number is about that 10% wearout.

I like proxmox ZFS setup, but this is totaly insane how ZFS is destroying SSDs. Should I migrate to LVM to solve this?

One more note - PVE itself is installed on different (nvme) drive, so no proxmox agents writes are on this pool, where are VMs.
 
Last edited:
Hi, I would like to re-open this thread, since it is few months I have migrated my ESXi setup to proxmox VE and I am seeing very strange SSD disk usage. My old setup as ESXi was using LSI 9211 and two 750 GB MX300 SSD in Raid 1. Now let's see stats from one of this drives:

Code:
Device Model:     Crucial_CT750MX300SSD1
Sector Size:      512 bytes logical/physical
9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       45223
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       193920876835
202 Percent_Lifetime_Remain 0x0030   054   054   001    Old_age   Offline      -       46

45223 hours is 5.158909 years and Total LBA written since disk is 512 sectors gives 90.30 TB in that five years, which gives about 1,46TB/month.
So this disk is somewhere in the middle of his lifetime (TBW is 220). All fine and great and clear.

All that virtual machines has been migrated to PVE 7.0, using two 2TB MX500 in ZFS mirror:

Code:
pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 00:49:47 with 0 errors on Sun Jan  9 01:13:49 2022
config:

        NAME                                  STATE     READ WRITE CKSUM
        storage                               ONLINE       0     0     0
          mirror-0                            ONLINE       0     0     0
            ata-CT2000MX500SSD1_XX  ONLINE       0     0     0
            ata-CT2000MX500SSD1_XX  ONLINE       0     0     0

Now let's see some stats from one of this disks:

Code:
Device Model:     CT2000MX500SSD1
Firmware Version: M3CR033
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1093
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       67185503643
202 Percent_Lifetime_Remain 0x0030   090   090   001    Old_age   Offline      -       10

1093 hours is 45.54167 days. Now when I use logical sector size, total bytes written is 31.29 TB (!!!!) in that 45 days (20,86TB/month) or even when I compute with physical size 4096 it gives insane number 250.29 TB (166,82TB/month).

According to spec, MX500 2TB have endurance 700TBW. None of this numbers gives 10% if 700TB. Anyway, 10% disk wearout in just 45 days seems to be total unequal with load in ESXi setup.

In calculator https://www.virten.net/2016/12/ssd-total-bytes-written-calculator/ in comment somebody also mention this way how to show disk stats"
Code:
root@pve:~# smartctl /dev/sdh -l devstat | less
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-5-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4               1  ---  Lifetime Power-On Resets
0x01  0x010  4            1093  ---  Power-on Hours
0x01  0x018  6      2771135243  ---  Logical Sectors Written
0x01  0x020  6       839078384  ---  Number of Write Commands

It gives different number written sectors (why?!) With 512 sector size it is 1.29 TB (0,86TB/month), with 4096 it gives 10.32 TB (6.88TB/month). Finaly one number is about that 10% wearout.

I like proxmox ZFS setup, but this is totaly insane how ZFS is destroying SSDs. Should I migrate to LVM to solve this?

One more note - PVE itself is installed on different (nvme) drive, so no proxmox agents writes are on this pool, where are VMs.

To quote the PVE ZFS Benchmark paper FAQ page 8 again:
Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSDs?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests.

ZFS has a high overhead, especially if you got DBs that do alot of small sync writes. Then the virtualization, nested filesystems etc add overhead too and this resulting write amplification is multiplying not adding up so you get exponential growth. According to my ZFS benchmarks with enterprise SSDs I've seen a write amplification between factor 3 (big async sequential writes) and factor 81 (4k random sync writes) with an average of factor 20 with my homeserver measured over months. So you can get really heavy wear even when not writing that much. My homeserver for example only writes 45GB per day but because of the factor 20 write amplification this will hit the SSDs with 900GB per day.
LVM will cause way less writes but then you got no redundancy or bit rot protection so you can never be sure that your data on the disks or the data in all of your backups isn't silently corruped over time.
 
Last edited:
Should I migrate to LVM to solve this?
Hi.

Good question.
Yes you can. This could be an option.
Second option ... the hard way will be to optimize your PMX setup(including zfs) so in the end, you will have a lower wearout/month.
And the 3rd option, could be to use/add your old SSD (MX300 SSD) on your actual PMX setup(so you will be able to split your Total Write on more SSD).

Write Operations could be also be decreases, if you make some settings also on your VM/VMs that you run in PMX.
Also take in account that as general rule, IF you want to be on the safe side, you must run zfs on enterprise grade SSD.

But sometime, you can not have such SSDs(me included in this case) on any setup. Maybe in some case is better to to use a mixed setup(HDDs + SSDs = aka zfs special devices)

In the end you must think if zfs will pay off for the advantages that can provide for your own case, or not.

I can not answer for you, but in my own case, I do not regret the decision to use zfs for ANY of my server(with PMX). And I do not regret .... zfs and PMX have save me in many occasions. And I also teach others to do the same, and all of them are also happy with zfs/PMX.

Good luck / Bafta !
 
I personally moved unimportant but heavy writing guests (for example graylog for logging and zabbix for monitoring) to a single SSD LVM-Thin and all important guest run on a ZFS pool with redundancy. That way I was able to highly decrease my writes and if that LVM-Thin SSD dies I only loose some metrics and logs and can restore a 24h old backup from my PBS. So if you got some guests where uptime and data integrity isn't that important its not a bad idea to move them to a LVM-Thin. Especially if the guests run some kind of DBs doing sync writes.
 
Last edited:
  • Like
Reactions: takeokun
To quote the PVE ZFS Benchmark paper FAQ page 8 again:


ZFS has a high overhead, especially if you got DBs that do alot of small sync writes. Then the virtualization, nested filesystems etc add overhead too and this resulting write amplification is multiplying not adding up so you get exponential growth. According to my ZFS benchmarks with enterprise SSDs I've seen a write amplification between factor 3 (big async sequential writes) and factor 81 (4k random sync writes) with an average of factor 20 with my homeserver measured over months. So you can get really heavy wear even when not writing that much. My homeserver for example only writes 45GB per day but because of the factor 20 write amplification this will hit the SSDs with 900GB per day.
LVM will cause way less writes but then you got no redundancy or bit rot protection so you can never be sure that your data on the disks or the data in all of your backups isn't silently corruped over time.

This is total insane. From my setup write amplification is about 13(!!), which makes whole ZFS on SSD unusable. I am very disappointed that proxmox wiki about zfs https://pve.proxmox.com/wiki/ZFS_on_Linux (which is actually very good!) DOES NOT MENTION THIS ANYWHERE. There is just recommended "If you use a dedicated cache and/or log disk, you should use an enterprise class SSD (e.g. Intel SSD DC S3700 Series)." In reality, from that PVE ZFS Benchmark paper you quoted, you can not use consumer ssd with zfs. No warning about this on that wiki page. o_O
 
No warning about this on that wiki page
Hi,

Nothing is perfect in this world! But any admin must take some time to READ and most important to test and understand any new setup.
I had have also many situation like you in the past with new setups. But after some bad experiences I start to blame myself and not "some wiki or documentaions"! And because of that I think I have become a better admin.



Good luck / Bafta!
 
It also highly depends on your workload. There are alot of people here running their homeservers with consumer SSDs and ZFS for years without that much disk wear. Whats really killing SSDs are small sync writes. If you just got async writes and guests that don't write that much you might be fine.

And like guletz already said you can optimize some stuff:
I would for example disable the pools atime or atleast enable relatime. That way not every read will cause a additional small write to update the metadata.
Then you should check if you use a good volblocksize that fits your pool.
LZ4 compression is also always a good idea as it will lower the amount of data that needs to be written.
And you don't want to write with a lower blocksize to a storage with a bigger blocksize or you get a massive write amplification. Using a raidz1/2/3 for example isn't that great because these need a very big volblocksize or the padding overhead will be big adding to your write amplification and lowering the usable pool capacity. But on the other hand a raidz1/2/3 got a better data-parity-ratio so there is less parity overhead and therefore less write amplification (in total a 5 disk raidz1 will write less than a 6 disk striped mirror. Both got the same capacity but a striped mirror writes +100% parity data and a raidz1 just +25% parity data).
If you got alot of small writes in your guests don't use etx4 and use xfs instead (I measured a write amplification of factor 5 inside the guest just caused by sync writing 4k blocks to a etx4 partition).
Optimize your databases and let them use caching, so they write few big blocks instead of many small blocks.
Let your guests write temporary files to a ram disk so stuff like logs and cached files don't need to be written to SSD.
Don't run a CoW file system ontop of another CoW filesystem. So ZFS ontop of ZFS or qcow2 ontop of ZFS is a bad idea.
Verify that your pools ashift is atleast 12 even if the SSD reports to be using 512B sectors (SSDs are always lying about the sectorsize!).
Don't fill your ZFS pool more than 80% or the pool will become slow and starts fragmenting. The more the SSD is filled up the more your pool will wear as your SLC cache will be lower.
Strip down as much filesystem layers, storage layers and virtualization layers as possible.
Try to avoid using ZFS native encryption as this will double your write amplification (still don't understand why, but my tests showed me it does...explanation would be nice if someone knows it).
 
Last edited:
  • Like
Reactions: takeokun and 0bit
Thanks Dunuin for your post(s). I have also found your tests results (https://forum.proxmox.com/threads/improve-write-amplification.75482/post-410191). I have done some benchmarks during weekend and I can see write amplification about 4. Most of writes comes from zabbix (db) and nextcloud. I have also tried some tuning (volblocksize, mysqld parameters...) but it is all about 10-20%. No significant change can be done. I have still spare old SSD so I will use them to re-work storage from ZFS to LVM - running zabbix on LVM shows no additional write amplification.
I just need to decide
- use LVM raid1 or classic way - mdadm managed?
- Thin provision LVM - because proxmox does not support snapshots without it. But my experience with lvm-thin is only bad.
 
Hi there!

I've just discovered the issue myself after two weeks of running my pilot PVE and TrueNAS (Debian based) installs.

PVE: 64Gb RAM, 2xSSD 128Gb zfs mirror (system + CT templates + ISOs only). VMs and CTs are stored on NAS via 10Gbe network.
NAS: 32Gb RAM, 2xSSD 128Gb zfs mirror (system only) + 4xHDD for data storage

Both systems use the same cheap fresh SSDs.
Both systems were installed using respective installers with default zfs settings.

I have 1 Ubuntu CT and 1 Windows VM to get familiar with PVE. So PVE just sat idle all those weeks.
And I've uploaded from 3rd PC something around 1TB of data to NAS.

And now some numbers:

PVE:

Code:
root@pve:~# smartctl -A /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    30,774 [15.7 GB]
Data Units Written:                 239,730 [122 GB]
Host Read Commands:                 206,192
Host Write Commands:                5,155,750
Controller Busy Time:               0
Power Cycles:                       10
Power On Hours:                     15
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    0
Error Information Log Entries:      14
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

NAS:

Code:
root@truenas[~]# smartctl -A /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    7,697 [3.94 GB]
Data Units Written:                 11,172 [5.72 GB]
Host Read Commands:                 130,334
Host Write Commands:                190,834
Controller Busy Time:               0
Power Cycles:                       6
Power On Hours:                     0
Unsafe Shutdowns:                   1
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Why there is so HUGE difference for just system partitions?
It looks like the issue is not with ZFS but with PVE itself.

After playing with iotop, I have a candidate who is writing 24/7 to a system partition every couple of seconds - pmxcfs (https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs))!

My question is: "Can I stop it if my PVE node is not in cluster?"
I want to stop it to compare amount of writes before and after to make a conclusion.