SSD wear

pxm0x · Jan 17, 2024

Hi,
I've been reading about others facing a similar issue but I wanted to share mine and see if there is any solution to it. I could not find a solution that I would understand or implement so far... Sorry if it's obvious but please help!

I run proxmox for a while now. In Aug 2022 I bought two Samsung SSD 980 Pro 1 TB and configured them as RAID 1 with ZFS.

Code:

root@pve-1:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool   928G   259G   669G        -         -    41%    27%  1.00x    ONLINE  -

I haven't changed any settings for it, leaving everything at default from the initial installation. I don't run a DB on it. Just a firewall (OPNsense on ZFS in VM) and some smaller VMs and containers.

Now I have to admit I didn't really pay attention to the rapidly increasing wear until I got SMART errors today! But the wear level is astonishing and very annoying because it seems to have killed both SSDs. I know it's not enterprise SSDs but they are also not cheap crap and quite frankly also an enterprise SSD is going to be in trouble with over 700 TB written in 16 months!

It seems 1 disk is killed already with 774TB written in ~16 months!!!

Are there sensible config changes that I can still make to extend the remaining lifetime? I don't want to replace the SSDs now but if I have to, what will be a sensible alternative? I'm not prepared to through another 200+ Euro out of the window!

Some further info below.
Note: It says 3,197 power on hours which would be around 133 days? That's not possible as the machine is running 24/7...

pxm0x · Jan 17, 2024

One more fun fact. I have a similar installation on another machine with a SATA SSD which is running for 16.500 hours with 12% wear according to it's SMART values...

rungekutta · Jan 17, 2024

Interesting, and doing some simple maths - 774TB in 16 months becomes approx 1.6TB per day, nearly 70GB per hour, roughly 1GB per minute, nearly 20MB/s… IF it is evenly distributed like that, you should be able to run some diagnostics and try to figure out where the write pressure comes from (check the i/o stats on the VMs and also ZFS i/o stats on the promox host itself).

As a side, because of its copy-on-write characteristics, certain configurations of ZFS can lead to very large write amplification. More info would be needed on your ZFS record size, alignment shift setting (aka ashift (should match the sector size of your disks, in bits, typically 12)), the filesystem inside your VMs.

I also recall some previous discussions on ZFS cluster/HA daemons that generate a lot of writes continuously, and unnecessarily if you don’t actually run a cluster. Don’t know if that’s been addressed yet in Proxmox or not.

mircolino · Jan 17, 2024

I installed mine about one year ago on a single 980 PRO SSD drive, 3 VMs and 3 LXCs, running 24/7. Since it's a single node I disabled all the clustering and replication services:

Code:

# systemctl stop pve-ha-lrm
# systemctl disable pve-ha-lrm

# systemctl stop pve-ha-crm
# systemctl disable pve-ha-crm

# systemctl stop corosync.service
# systemctl disable corosync.service

# systemctl stop pvesr.timer
# systemctl disable pvesr.timer

Not sure what the impact of that was, but wearout is only 2%.

rungekutta · Jan 17, 2024

I can add another data point. Am running a 2 node cluster and with all the above-mentioned services enabled, with default settings. ~7TB written in approx 2.5 years 24/7. That’s ZFS on single disk (enterprise SSD). All VMs on separate storage, so provides some clues therefore on roughly what to expect from Proxmox itself in an active cluster environment. Op must have a fair bit of load generated from inside the VMs and/or something quite strange going on within Proxmox itself.

pxm0x · Jan 18, 2024

rungekutta said:
ZFS record size, alignment shift setting (aka ashift (should match the sector size of your disks, in bits, typically 12))

how can I check for that?

rungekutta said:
ZFS cluster/HA daemons that generate a lot of writes continuously

Is this confirmed? Because I have another Box that does not seem to have the issue while I believe those HA daemons are on by default and I haven't changed anything there.

mircolino said:
I installed mine about one year ago on a single 980 PRO SSD drive

ZFS on single disk (enterprise SSD)

I start to suspect that the issue may come from the Raid1 ZFS configuration which it seems none of you that are good with SSD wear are having deployed? As mentioned above I also have another box with a single SATA SSD and there the SSD wear is much lower.

pxm0x · Jan 18, 2024

I've taken a snapshot with iotop which looks like this:

So it seems that my OPNsense installation does cause quite a bit of write - however I don't know why!

Also zvol is writing quite a lot.

Any suggestions on how to find out why that is happening?

Edit: I found a process that wrote a lot inside the OPNsense VM. After turning it off I still see this high activity, so it's not really gone away completely.

rungekutta · Jan 18, 2024

You want ashift 12 (4096 byte sectors):

Code:

root@pve01:~# zdb -C | grep ashift
            ashift: 12

Recordsize:

Code:

root@pve01:~# zfs get recordsize rpool
NAME   PROPERTY    VALUE    SOURCE
rpool  recordsize  128K     default

Volblocksize:
Proxmox GUI: Datacenter -> Storage -> [your pool] -> Block Size. Should be something like 8-32k typically.

Your VM disks are probably zvols in which case the volblocksize setting (sort of) defines the smallest unit to write to disk, i.e. due to its copy-on-write semantics, ZFS rewrites entire blocks. This contributes to so called write amplification - consider a scenario where you are doing many small writes: the larger the block size, more data that needs to be rewritten irrespective of how much of it had actually changed. The trade-off is that smaller block size increases fragmentation, metadata overhead, and makes compression less effective (compression is applied per block). But yeah, depending on setup and write patterns, write amplification in ZFS can be very substantial, 10-20x or in extreme scenarios significantly worse still (another order of magnitude).

Regarding the HA daemons - check my other post, mine have generated less than 7TB in 2.5 years so that's unlikely to be your culprit, having written 100 times that in less time...

rungekutta · Jan 18, 2024

pxm0x said:
I've taken a snapshot with iotop which looks like this:

So it seems that my OPNsense installation does cause quite a bit of write - however I don't know why!

Also zvol is writing quite a lot.

Any suggestions on how to find out why that is happening?

Edit: I found a process that wrote a lot inside the OPNsense VM. After turning it off I still see this high activity, so it's not really gone away completely.

Ok good that you found the smoking gun. Just need to keep narrowing it.

rungekutta · Jan 18, 2024

PS. I used to but don't use OpnSense any more - but check if you're logging stats with netflow (heavy writes), you can also configure OpnSense to keep /var in RAM which of course means log files won't survive reboot, but should reduce writes significantly, particularly if something is misbehaving and constantly spamming your logs.

Dunuin · Jan 18, 2024

pxm0x said:
So it seems that my OPNsense installation does cause quite a bit of write - however I don't know why!

While OPNsense shouldn't write that much (make sure to select RAM disks in OPNsense whenever possible) you shouldn't run nested ZFS. ZFS got massive overhead and that write amplification is multiplying (so exponential growth).

pengu1n · Jan 18, 2024

For OPN, Not just netflow. Default logging of firewall rules is to keep 7 days and not to log default rules. That's what I leave at and no problems. Enable logging of default rules should only be needed temporarily when diagnosing rules, otherwise it's too write-intensive for no gain.

Dunuin · Jan 18, 2024

And see that RAM disks are set for logs and temp folder:

pxm0x · Jan 20, 2024

rungekutta said:
You want ashift 12 (4096 byte sectors):

Code:

root@pve01:~# zdb -C | grep ashift ashift: 12

Code:

root@pve-1:~# zdb -C | grep ashift
            ashift: 12
root@pve-1:~# zfs get recordsize rpool
NAME   PROPERTY    VALUE    SOURCE
rpool  recordsize  128K     default

Block size is 8k

So I think all settings look ok. Thanks for the detail you provided!

pxm0x · Jan 20, 2024

Thanks for all that replied.
The culprit was ntop-ng which I had running to analyze some traffic data. It was writing tons of data which is somewhat logic but I am amazed it was so bad.

Still I'm planning to redo the proxmox server with a new SSD that has 2200 TBW (vs 600 TBW on Samsung 980 Pro) which should last longer, even if I encounter excessive writes again. It's a pity I killed the good Samsung so quickly...

Would you recommend to change the filesystem for OPNsense from ZFS to UFS ? I still plan to run Proxmox on ZFS but this time not in Raid-1 configuration. It is not mission critical and I'd rather do some backups and deal with a failure if it happens than throwing money at it yet again.

mnih · Jan 20, 2024

I suggest you use enterprise/data center grade ssd which are much better suited for proxmox.

You can even get used ones like micron 5100, 5200 (pro) or Intel S4510 pretty cheap. They come with 5+ PBW

Edit: VMs can run with simple ext4 as long as the underlying storage is based on zfs (or ceph)

Dunuin · Jan 20, 2024

pxm0x said:
Would you recommend to change the filesystem for OPNsense from ZFS to UFS ?

Yes.

pxm0x said:
I still plan to run Proxmox on ZFS but this time not in Raid-1 configuration. It is not mission critical and I'd rather do some backups and deal with a failure if it happens than throwing money at it yet again.

I would still use a ZFS mirror. Usually not worth the data loss, downtime and additional work to set everything up again from scratch when losing the single system disk. Also keep in mind that a single disk ZFS pool won't protect you from bit rot.
Just make sure you got proper monitoring (zabbix and so on) so you won't be surprised again.

rungekutta · Jan 20, 2024

As mentioned already, as awesome as ZFS is, those features come at a cost, including write amplification and hardware requirements if you’re going to make it fly. And some of those features, for example checksumming and compression, are even pointless if applied twice on top of each other, while at the same time the downsides (e.g. write amplification) can be exponential… Nested ZFS is rarely a good idea unless you have some very specific need for it. UFS, XFS, ext4 inside the VMs a better idea generally…

TheRealMaN_ · Mar 24, 2024

Is compression on ZFS file system lowers SSD wear off by minimizing written data amount to disk (so its compressed in RAM before written)? Or am I wrong?

rungekutta · Mar 24, 2024

Yes. But above still applies - pointless to apply twice.

SSD wear

New Member

New Member

Member

New Member

Member

New Member

New Member

Member

Member

Member

Distinguished Member

Member

Distinguished Member

New Member

New Member

Member

Distinguished Member

Member

New Member

Member