Minimizing SSD wearout question ?

tlex

Member
Mar 9, 2021
103
14
23
43
I'm running a zfs pool on 2 Samsung SSD 860 Evo (2TB) in (mirror-0).
I've seen that wearout has increased to 7% over the last year. I can't recall how much it was at the beginning of the year and disks were already used in another setup. Now I make sure the disk usage is not really high (I'm at 25% disk usage.) Also this is were I store my vm os disk and containers.
So my questions is, based on the current zpool settings, is there anything else I can do to reduce disks wearout :

Code:
zfs get atime,sync,compression,xattr R1_1.6TB_SSD_EVO860
NAME                 PROPERTY     VALUE           SOURCE
R1_1.6TB_SSD_EVO860  atime        off             local
R1_1.6TB_SSD_EVO860  sync         disabled        local
R1_1.6TB_SSD_EVO860  compression  off             local
R1_1.6TB_SSD_EVO860  xattr        on              default

Trim is enabled via cron job monthly.
I've heard that I could change compression and xattr settings. but I guest this can't be change live right ? Would that drastically impact cpu load ?
I'm running a Ryzen 7 4750 with about 8 moderate cpu usage vm. Any other advises in order to minimize wearout ? Also, how wearout is calculated ? Is it something that is read from the disk firmware or just estimated from the os (meaning the disk could probably have more wearout than 7%)?
 
Last edited:
While disabling sync writes decreases the SSD wear I highly recommend you set it to default again. Disabling it might cause data loss and in worst case you will lose your entire pool (for example on the next power outage or kernel crash). Similar problem with increasing the transaction group interval. Increasing it will reduce SSD wear but you will lose more data when your PVE crashes.

Next time you better buy enterprise SSDs, like it is recommended everywhere, then wear and sync writes are a way smaller problem.
Enterpise SSD are way cheaper than consumer SSD, if you look at the price per TB TBW (write durability) and not at the price per TB of capacity.
For example you pay tripple the price for an enterprise SSD but it is then rated for 9 times the writes.

There isn't really much you can do to prevent SSD wear without sacrificing data integrity.
Exeptions are enabling relatime or disabling atime, not using encryption and disabling some heavy writing PVE services (like the ones for clustering, in case you only run a single node).

Also, how wearout is calculated ? Is it something that is read from the disk firmware or just estimated from the os (meaning the disk could probably have more wearout than 7%)?
Yes, PVE is showing what the SMART monitoring of the SSDs firmware is reporting.
 
Last edited:
  • Like
Reactions: Kingneutron
samsung evo with zfs ? you are lucky that they are not yet dead.

consumer ssd don't have supercapacitor, zfs do sync write for his journal, so each journal write (even a small 4k write) will rewrite a full flash cell. (you can have write amplicatoin over x1000).
That's why you need enterprise ssd. (or don't use zfs, use a classic lvm)
 
I have tricks I use but its fair to say they not supported. I only use on my local at home proxmox rig.

1 - symlink the pveproxy log to /dev/null which stops the pveproxy log updating on disk every time the graph moves on the UI. How big a deal this is depends how often you have the UI open.
2 - I emulate enterprise SSD behaviour by telling the SSD to not honour sync writes via a zfs tunable, note this is not the same as disabling sync on the ZFS dataset it isnt as risky. ZFS will still treat them as sync requests, not going to the dirty cache, sent immediatly to the SSD, but the drive will be allowed to cache internally which significantly decreases write amplification and also the writes are done quicker to reduce the window risk of power loss/kernel panics. I do have power loss protection.

If you interested I will post how but with the disclaimer.

Even without the trick, for light use evo's wont just automatically die on ZFS, would only be a real problem if you stick on a busy production database server or something.

My 860 250 gig EVOs have wear levelling count of 22 (2% rated cycles) for 10450 power on hours (435 days). Every day of their life has been on ZFS proxmox.
 
Last edited:
I have tricks I use but its fair to say they not supported. I only use on my local at home proxmox rig.

1 - symlink the pveproxy log to /dev/null which stops the pveproxy log updating on disk every time the graph moves on the UI. How big a deal this is depends how often you have the UI open.
2 - I emulate enterprise SSD behaviour by telling the SSD to not honour sync writes via a zfs tunable, note this is not the same as disabling sync on the ZFS dataset it isnt as risky. ZFS will still treat them as sync requests, not going to the dirty cache, sent immediatly to the SSD, but the drive will be allowed to cache internally which significantly decreases write amplification and also the writes are done quicker to reduce the window risk of power loss/kernel panics. I do have power loss protection.

If you interested I will post how but with the disclaimer.

Even without the trick, for light use evo's wont just automatically die on ZFS, would only be a real problem if you stick on a busy production database server or something.

My 860 250 gig EVOs have wear levelling count of 22 (2% rated cycles) for 10450 power on hours (435 days). Every day of their life has been on ZFS proxmox.
Thanks for that, yes, that would be great to see how you do that :)
 
While disabling sync writes decreases the SSD wear I highly recommend you set it to default again. Disabling it might cause data loss and in worst case you will lose your entire pool (for example on the next power outage or kernel crash). Similar problem with increasing the transaction group interval. Increasing it will reduce SSD wear but you will lose more data when your PVE crashes.

Next time you better buy enterprise SSDs, like it is recommended everywhere, then wear and sync writes are a way smaller problem.
Enterpise SSD are way cheaper than consumer SSD, if you look at the price per TB TBW (write durability) and not at the price per TB of capacity.
For example you pay tripple the price for an enterprise SSD but it is then rated for 9 times the writes.

There isn't really much you can do to prevent SSD wear without sacrificing data integrity.
Exeptions are enabling relatime or disabling atime, not using encryption and disabling some heavy writing PVE services (like the ones for clustering, in case you only run a single node).


Yes, PVE is showing what the SMART monitoring of the SSDs firmware is reporting.
Thanks for that, yes I agree with you, when I'll have to buy disks I'll get entreprise ssds... As mentioned I had those disks before so thats why I used them for the last year.
I've re-enabled sync (zfs set sync=standard R1_1.6TB_SSD_EVO860). should I do anything else ? I can't figure how to monitor if the disk are (or not synched...)
 
Thanks for that, yes, that would be great to see how you do that :)
Ok so the disclaimer is that this is not a supported configuration of ZFS or Proxmox, and as such you might hit bugs or other issues that otherwise wouldnt normally occur, be wary if you doing this on something used in production.

To adjust the write cache behavior on the SSDs for sync requests.

create/edit file /etc/modprobe.d/zfs.conf
add the following lines
# disable disk flushes on zil writes, writes are still sync through zfs, doesnt affect txg writes
options zfs zil_nocacheflush=1


then reboot.

After booting you can verify if active with following command, should return a 1 value. cat /sys/module/zfs/parameters/zil_nocacheflush

To disable pveproxy web requests logging simply do the following. This will remove the ability to check who is accessing the GUI, so potential security repercussions. You can mitigate this by adding a nginx proxy in front of pveproxy and then making nginx log in a ssd friendly way using write buffers.

cd /var/log/pveproxy
mv access.log access.log.old
ln -s /dev/null access.log


Finally, to prevent the symlink been overwritten, edit /etc/logrotate.d/pve to disable the log rotation for pveproxy.
 
Last edited:
1 - symlink the pveproxy log to /dev/null which stops the pveproxy log updating on disk every time the graph moves on the UI. How big a deal this is depends how often you have the UI open.
So just having the gui open log everything to hdd.. ? And can you tell what is the command you put for that ?

Also, i guess all the sensor info .. it<s into a file, but is it writing to disk or just a symlink to a current state from memory ? If a constant write.. that is quite the worst or easy fix too.
 
Ironic, I managed to get a DC P4600, was been sold at a crazy low price. So now will have a genuine enterprise SSD at home.
 
  • Like
Reactions: Kingneutron
In my case, the access log is relatively uninteresting. It doesn't protect against a determined targeted attack (very little helps in those cases), and as far as a "casual" attack is concerned, that isn't really a threat-scenario that matters much in my case. Having said that, conceivably there could occasionally be some use in having this information available. So, I am not quite ready to link to /dev/null.

The solution is easy, instead of disabling the log entirely, I simply put it on a tmpfs filesystem. That gives me access to the most recent log information unless the server has been rebooted. This is good enough for what I need. Of course, others could very well decide that their use case is very different.

A similar configuration can be done for the systemd logging daemon by editing /etc/systemd/journald.conf But I believe there are diminishing returns here. On my system, there aren't enough entries in the system logs to make it worthwhile to log to RAM instead of disk. Depending on what services you have installed, you might think differently. And of course, this doesn't just apply to the host but also to each of the guests both in VMs and in containers.
 
I removed syslogd as PVE8/Debian12 switched to journald-only. I then set the journald to only log to RAM and limited it to 20MB. So logs are lost on a reboot. But I use filebeat to send all the logs to a centralized log server (in my case graylog and later probably wazuh) so there is still an persistent copy of all logs. From a security perspective, this is even better, as an attacker can't destroy evidence by deleting logs. Makes it also way better to analyse/monitor your logs when they are stored in a proper log DB with powerful search functions and automated log analysis and alerts (I for example find it useful to get alerted in case any host/guest is writing a "oom" log).
 
Last edited:
I tried to add "options zfs zil_nocacheflush=1" to a new host I just built but the option doesn't seem to become enable after a reboot. Any Idea ?
The hole system is on zfs (3 x 4tb ssd)

Code:
# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=10096738304
options zfs zil_nocacheflush=1

reboot

Code:
# cat /sys/module/zfs/parameters/zil_nocacheflush
0

Any idea ?
 
yeah apologies I forgot to add that command in the post, and now I cannot edit it anymore either.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!