pxm0x

New Member
Oct 25, 2022
22
0
1
Hi,
I've been reading about others facing a similar issue but I wanted to share mine and see if there is any solution to it. I could not find a solution that I would understand or implement so far... Sorry if it's obvious but please help!

I run proxmox for a while now. In Aug 2022 I bought two Samsung SSD 980 Pro 1 TB and configured them as RAID 1 with ZFS.

Code:
root@pve-1:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool   928G   259G   669G        -         -    41%    27%  1.00x    ONLINE  -

I haven't changed any settings for it, leaving everything at default from the initial installation. I don't run a DB on it. Just a firewall (OPNsense on ZFS in VM) and some smaller VMs and containers.

Now I have to admit I didn't really pay attention to the rapidly increasing wear until I got SMART errors today! But the wear level is astonishing and very annoying because it seems to have killed both SSDs. I know it's not enterprise SSDs but they are also not cheap crap and quite frankly also an enterprise SSD is going to be in trouble with over 700 TB written in 16 months!

It seems 1 disk is killed already with 774TB written in ~16 months!!!

Are there sensible config changes that I can still make to extend the remaining lifetime? I don't want to replace the SSDs now but if I have to, what will be a sensible alternative? I'm not prepared to through another 200+ Euro out of the window!

Some further info below.
Note: It says 3,197 power on hours which would be around 133 days? That's not possible as the machine is running 24/7...

1705521407429.png
1705521442907.png

1705521587536.png
1705521614027.png
 
Last edited:
One more fun fact. I have a similar installation on another machine with a SATA SSD which is running for 16.500 hours with 12% wear according to it's SMART values...
 
Interesting, and doing some simple maths - 774TB in 16 months becomes approx 1.6TB per day, nearly 70GB per hour, roughly 1GB per minute, nearly 20MB/s… IF it is evenly distributed like that, you should be able to run some diagnostics and try to figure out where the write pressure comes from (check the i/o stats on the VMs and also ZFS i/o stats on the promox host itself).

As a side, because of its copy-on-write characteristics, certain configurations of ZFS can lead to very large write amplification. More info would be needed on your ZFS record size, alignment shift setting (aka ashift (should match the sector size of your disks, in bits, typically 12)), the filesystem inside your VMs.

I also recall some previous discussions on ZFS cluster/HA daemons that generate a lot of writes continuously, and unnecessarily if you don’t actually run a cluster. Don’t know if that’s been addressed yet in Proxmox or not.
 
Last edited:
I installed mine about one year ago on a single 980 PRO SSD drive, 3 VMs and 3 LXCs, running 24/7. Since it's a single node I disabled all the clustering and replication services:

Code:
# systemctl stop pve-ha-lrm
# systemctl disable pve-ha-lrm

# systemctl stop pve-ha-crm
# systemctl disable pve-ha-crm

# systemctl stop corosync.service
# systemctl disable corosync.service

# systemctl stop pvesr.timer
# systemctl disable pvesr.timer

Not sure what the impact of that was, but wearout is only 2%.

ssd.png
 
Last edited:
I can add another data point. Am running a 2 node cluster and with all the above-mentioned services enabled, with default settings. ~7TB written in approx 2.5 years 24/7. That’s ZFS on single disk (enterprise SSD). All VMs on separate storage, so provides some clues therefore on roughly what to expect from Proxmox itself in an active cluster environment. Op must have a fair bit of load generated from inside the VMs and/or something quite strange going on within Proxmox itself.
 
Last edited:
ZFS record size, alignment shift setting (aka ashift (should match the sector size of your disks, in bits, typically 12))
how can I check for that?
ZFS cluster/HA daemons that generate a lot of writes continuously
Is this confirmed? Because I have another Box that does not seem to have the issue while I believe those HA daemons are on by default and I haven't changed anything there.

I installed mine about one year ago on a single 980 PRO SSD drive
ZFS on single disk (enterprise SSD)

I start to suspect that the issue may come from the Raid1 ZFS configuration which it seems none of you that are good with SSD wear are having deployed? As mentioned above I also have another box with a single SATA SSD and there the SSD wear is much lower.
 
I've taken a snapshot with iotop which looks like this:
1705566255506.png

So it seems that my OPNsense installation does cause quite a bit of write - however I don't know why!

Also zvol is writing quite a lot.

Any suggestions on how to find out why that is happening?

Edit: I found a process that wrote a lot inside the OPNsense VM. After turning it off I still see this high activity, so it's not really gone away completely.
1705566853493.png
 
Last edited:
You want ashift 12 (4096 byte sectors):
Code:
root@pve01:~# zdb -C | grep ashift
            ashift: 12

Recordsize:
Code:
root@pve01:~# zfs get recordsize rpool
NAME   PROPERTY    VALUE    SOURCE
rpool  recordsize  128K     default

Volblocksize:
Proxmox GUI: Datacenter -> Storage -> [your pool] -> Block Size. Should be something like 8-32k typically.

Your VM disks are probably zvols in which case the volblocksize setting (sort of) defines the smallest unit to write to disk, i.e. due to its copy-on-write semantics, ZFS rewrites entire blocks. This contributes to so called write amplification - consider a scenario where you are doing many small writes: the larger the block size, more data that needs to be rewritten irrespective of how much of it had actually changed. The trade-off is that smaller block size increases fragmentation, metadata overhead, and makes compression less effective (compression is applied per block). But yeah, depending on setup and write patterns, write amplification in ZFS can be very substantial, 10-20x or in extreme scenarios significantly worse still (another order of magnitude).

Regarding the HA daemons - check my other post, mine have generated less than 7TB in 2.5 years so that's unlikely to be your culprit, having written 100 times that in less time...
 
I've taken a snapshot with iotop which looks like this:


So it seems that my OPNsense installation does cause quite a bit of write - however I don't know why!

Also zvol is writing quite a lot.

Any suggestions on how to find out why that is happening?

Edit: I found a process that wrote a lot inside the OPNsense VM. After turning it off I still see this high activity, so it's not really gone away completely.
Ok good that you found the smoking gun. Just need to keep narrowing it.
 
PS. I used to but don't use OpnSense any more - but check if you're logging stats with netflow (heavy writes), you can also configure OpnSense to keep /var in RAM which of course means log files won't survive reboot, but should reduce writes significantly, particularly if something is misbehaving and constantly spamming your logs.
 
So it seems that my OPNsense installation does cause quite a bit of write - however I don't know why!
While OPNsense shouldn't write that much (make sure to select RAM disks in OPNsense whenever possible) you shouldn't run nested ZFS. ZFS got massive overhead and that write amplification is multiplying (so exponential growth).
 
For OPN, Not just netflow. Default logging of firewall rules is to keep 7 days and not to log default rules. That's what I leave at and no problems. Enable logging of default rules should only be needed temporarily when diagnosing rules, otherwise it's too write-intensive for no gain.
 
You want ashift 12 (4096 byte sectors):
Code:
root@pve01:~# zdb -C | grep ashift
            ashift: 12

Code:
root@pve-1:~# zdb -C | grep ashift
            ashift: 12
root@pve-1:~# zfs get recordsize rpool
NAME   PROPERTY    VALUE    SOURCE
rpool  recordsize  128K     default

Block size is 8k

So I think all settings look ok. Thanks for the detail you provided!
 
Thanks for all that replied.
The culprit was ntop-ng which I had running to analyze some traffic data. It was writing tons of data which is somewhat logic but I am amazed it was so bad.

Still I'm planning to redo the proxmox server with a new SSD that has 2200 TBW (vs 600 TBW on Samsung 980 Pro) which should last longer, even if I encounter excessive writes again. It's a pity I killed the good Samsung so quickly...

Would you recommend to change the filesystem for OPNsense from ZFS to UFS ? I still plan to run Proxmox on ZFS but this time not in Raid-1 configuration. It is not mission critical and I'd rather do some backups and deal with a failure if it happens than throwing money at it yet again. :)
 
I suggest you use enterprise/data center grade ssd which are much better suited for proxmox.

You can even get used ones like micron 5100, 5200 (pro) or Intel S4510 pretty cheap. They come with 5+ PBW

Edit: VMs can run with simple ext4 as long as the underlying storage is based on zfs (or ceph)
 
Last edited:
Would you recommend to change the filesystem for OPNsense from ZFS to UFS ?
Yes.
I still plan to run Proxmox on ZFS but this time not in Raid-1 configuration. It is not mission critical and I'd rather do some backups and deal with a failure if it happens than throwing money at it yet again.
I would still use a ZFS mirror. Usually not worth the data loss, downtime and additional work to set everything up again from scratch when losing the single system disk. Also keep in mind that a single disk ZFS pool won't protect you from bit rot.
Just make sure you got proper monitoring (zabbix and so on) so you won't be surprised again.
 
As mentioned already, as awesome as ZFS is, those features come at a cost, including write amplification and hardware requirements if you’re going to make it fly. And some of those features, for example checksumming and compression, are even pointless if applied twice on top of each other, while at the same time the downsides (e.g. write amplification) can be exponential… Nested ZFS is rarely a good idea unless you have some very specific need for it. UFS, XFS, ext4 inside the VMs a better idea generally…
 
Is compression on ZFS file system lowers SSD wear off by minimizing written data amount to disk (so its compressed in RAM before written)? Or am I wrong?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!