NVM SSD extreme high wearout

xlemassacre

New Member
Jan 31, 2023
12
0
1
Hi,

I noticed by chance today that my NVM SSDs have extremely high wearout. The two disks have been in use for a year and are already at 36%.
I have already searched the forum for similar problems, but couldn't find anything specific that could help me.

can someone help me get to the bottom of this and slow down the wearout.

1711271713754.png
1711271821769.png
1711272212281.png

Code:
root@proxmox:~# zpool get all
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           952G                           -
rpool  capacity                       48%                            -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  guid                           5390239608252388090            -
rpool  version                        -                              default
rpool  bootfs                         rpool/ROOT/pve-1               local
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupratio                     1.00x                          -
rpool  free                           494G                           -
rpool  allocated                      458G                           -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  43%                            -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  checkpoint                     -                              -
rpool  load_guid                      14967033024946245314           -
rpool  autotrim                       off                            default
rpool  compatibility                  off                            default
rpool  bcloneused                     0                              -
rpool  bclonesaved                    0                              -
rpool  bcloneratio                    1.00x                          -
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            enabled                        local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local
rpool  feature@encryption             enabled                        local
rpool  feature@project_quota          active                         local
rpool  feature@device_removal         enabled                        local
rpool  feature@obsolete_counts        enabled                        local
rpool  feature@zpool_checkpoint       enabled                        local
rpool  feature@spacemap_v2            active                         local
rpool  feature@allocation_classes     enabled                        local
rpool  feature@resilver_defer         enabled                        local
rpool  feature@bookmark_v2            enabled                        local
rpool  feature@redaction_bookmarks    enabled                        local
rpool  feature@redacted_datasets      enabled                        local
rpool  feature@bookmark_written       enabled                        local
rpool  feature@log_spacemap           active                         local
rpool  feature@livelist               enabled                        local
rpool  feature@device_rebuild         enabled                        local
rpool  feature@zstd_compress          enabled                        local
rpool  feature@draid                  enabled                        local
rpool  feature@zilsaxattr             disabled                       local
rpool  feature@head_errlog            disabled                       local
rpool  feature@blake3                 disabled                       local
rpool  feature@block_cloning          disabled                       local
rpool  feature@vdev_zaps_v2           disabled                       local
 
Last edited:
1. Use Enterprise grade HW - yes that Kingston KC3000 says "High- Performance" but also says "for Desktop and Laptop PCs".
2. Definitely don't use ZFS/mirror on above.

I believe your 1TB has a TBW of 800TB. At 5 years (warranty) that's approx. a DWPD of 0.44 TB. No where near good enough.
 
What exactly do you mean by don't use ZFS/mirror on above.
Because it's not Enterprise grade HW, or is it misconfigured?

Regarding the high amount of data units written: I guess the data units read with 57.7TB could be correct due to backups. But non of my workloads explain that high amount of write commands.

Something is causing way to many write commands and i don't know where it's coming from.
 
I have improved some metrics now, and could get closer to the issue now:

Starting from 23:30 to 03:00 there are >1TB of data written, but my VMs do have a constant I/O write.
1711435067202.png
1711435944046.png

There was no backup at the time. So I'm pretty sure now that there is something happening in proxmox during that time causing the amount of data written.

I also suspect that this has been happening since I updated to Proxmox 8 (currently 8.1.5). Since the update, I have also had occasional VM failures at this time of day.
 
Also saw that there is a lot of CPU activity on the proxmox host during that time, but (except two peaks) constant cpu load on the VMs
1711436713919.png
 
Welcome to the wacky world of SSDs - as already stated this is because of the use of non-Enterprise SSDs, which btw I have done as well but went into the decision with my eyes opened. It's not the the VMs that cause this, it's the the ZFS filesystem itself which in the background does many more accesses to the drives on top of the VM's ones which is why they don't show in your graphs.

Hopefully by the time mine are worn out the enterprise equivalents will have reduced in price and will replace them with those - in my case I did choose Samsung QVO drives which from what I read are a little better for wear than their normal desktop SSDs. After about 9-10months mine are at around 12/13%.

You should also read up on Steve Gibson's research of SSD wear vs performance, as during his recent upgrade of SpinRite it was discovered that all SSDs, enterprise or not, will eventually encounter sections of the disk that will slow down considerably - I had one where many parts were 10x slower than other areas on the SSD. Using SpinRite you can run a check/rewrite across the entire disk that brings the speed back but at the cost of some extra wear. Future versions will try to limit the rewrites to just the affected sections.
 
My eyes weren't that wide open when I bought the SSDs, but I guess this is an eye opener. I won't be replacing them now, but next time I'll probably have to go for enterprise SSDs.

I think, or at least I hope, that I have now found the reason for the high write values at night. I was running a scheduled script that was writing suspicious (public) IP addresses to the firewall via the proxmox rest API. Since this can only be done per IP address and not in bulk, the API was called several thousand times. Presumably it saved the complete configuration state to the file system with each call. I have now disabled the script and will see if it stops doing this the next night.
 
If you're in a real datacenter setup - you're anyway into enterprise HW, (if not change your job!).

In my personal home, non-datacenter opinion/experience; If you're not into REAL Enterprise SSDs (expensive, and if it's not expensive, it either isn't real, or its a second-hand one in ATB an unknown condition), then avoid any sort of RAID config, and definitely no ZFS. Yes, there's a price for this - you don't have any sort of auto-mirroring disk, but you can improvise yourself in a variety of ways - as I do.

If we're into the numbers game, some stats from the last 6 months of one of my home PVEs, No RAID, No ZFS etc. @24/7:

A nothing-special Kingston 512gb NVME (like this) - PVE boot drive, Local & LVM-thin - 1 VM @24/7
A below-special Silicon Power 2TB SATA SSD (like this) directory type storage - 4 LXCs @10/7, 3 VMs @6/7, All VZDump backup files, ISO images etc.

I run an average of 5 VZDumps a week (approx. 5-12gb each to the 2TB above).

Both the above drives show 0% wear out.

From my experience over the years, the reason for the limited to no wear, is also about the percentage of disk space used. I try to never reach 50% usage. This will ensure the best economical output. So just purchase double the size you actually need - you'll get more than double your usage!
 
No this is just a single host setup for personal use.

Before proxmox I had a ubuntu setup with software raid. When changing to proxmox (with new hardware) a year ago it seemed like ZFS is the best option for my case. I'll try to monitor now for the next weeks to also get better numbers.

But 0% in 6 months is probably also without ZFS unrealistic for my workloads, running lots of applications and also some databases on kubernetes cluster in the 2 VMs.
 
Samsung QVO drives which from what I read are a little better for wear than their normal desktop SSDs
No, QLC NAND like those QVOs is the worst you could buy.
SLC is better than eMLC, better than MLC, better than TLC and QLC is worst.
The power-loss protection and bigger DRAM cache of enterprise SSDs highly increases performance and reduces wear whenever doing sync writes.
And the better TBW/DWPD will help to counter the write amplification too.

Proper Samsung disks would be PM983 or PM9A3 and not any PROs, EVOs or QVOs.
 
Last edited:
  • Like
Reactions: Kingneutron
What did your wear out look like then? Do also remember that PVE writes a lot of logging.
No I don't know the numbers there.
But I still hope that the script that i mentioned in my post (#8) was causing the really high wear out.

I checked the wear out last year from time to time and it was looking like ~10% per year, which was quite ok in my opinion.
But then I didn't look at it for a couple of months. Now I've seen this big leap...
 
No I don't know the numbers there.
But I still hope that the script that i mentioned in my post (#8) was causing the really high wear out.

I checked the wear out last year from time to time and it was looking like ~10% per year, which was quite ok in my opinion.
But then I didn't look at it for a couple of months. Now I've seen this big leap...
Problem is usually the write amplification. Here I got an average of factor 20. So writing 1TB of data inside a VM will cause 20TB of combined writes to the SSDs. So here the SSDs will die 20 times faster because of the exponential overhead / write amplification.
 
Problem is usually the write amplification. Here I got an average of factor 20. So writing 1TB of data inside a VM will cause 20TB of combined writes to the SSDs. So here the SSDs will die 20 times faster because of the exponential overhead / write amplification.
I have read about the write amplification in other threads, but have not found out if it is a general drawback of ZFS or if it can be fixed/optimised in some other way.
 
Everything adds some overhead. Each filesystem, volume management, the hardware itself, virtualization, mixed blocksizes and so on. And these won't add up but multiply. Lets say this overhead for each step is 2x, 5x, 3x, 4x, 2x. Then your write amplification isn't 2+5+3+4+2=16x but 2*5*3*4*2=240x. So the SSD would die 240 times faster and perform 240 times slower and not just 16 times. So you see this can easily grow exponentially why you want to keep overhead as low as possible and avoid things like nested filesystems and mixed block sizes. And ZFS is known for its massive overhead. Especially when doing small sync writes it really helps to keep the internal write amplification of the SSDs low when using an Enterprise SSD that is capable of caching these in DRAM thanks to the power-loss protection. Consumer SSDs are missing this PLP so they can't make use of caching and will have do the writes directly to the NAND without being able to optimize them in DRAM-cache first for less wear.
 
Last edited:
Ok then I guess my configuration is a bit messed up. In my VMs I have ext4 filesystem with 4K Blocksize. And Proxmox shows the following:

Code:
root@proxmox:~# zfs get volblocksize
NAME                          PROPERTY      VALUE     SOURCE
rpool                         volblocksize  -         -
rpool/ROOT                    volblocksize  -         -
rpool/ROOT/pve-1              volblocksize  -         -
rpool/data                    volblocksize  -         -
rpool/data/vm-100-disk-0      volblocksize  8K        -
rpool/data/vm-102-disk-0      volblocksize  16K       default
 
just for reference.
this is the wearout on my intel sata enterprise ssds after 4800 hours of use (about half your time):
1711460670156.png

as you can see the wearout is at 0
granted im only running a homelab with like 10 vms and 9 containers, but it should give you an idea.

compared to the 800TBW of your disks these drives have a write endurance of over 10PB (so 10000 TBW).
combined with less wear due to caching and optimizing the writes before hitting the flash it gives them just that much more of lifespan.
 
Last edited:
combined with less wear due to caching and optimizing the writes before hitting the flash it gives them just that much more of lifespan.
Those are pretty good numbers :cool:

I've been trying to find out which SSDs would be suitable for my server, but it's not easy to get an overview of the SSDs available.
If I understand correctly, there are two/three values that are important and differentiate a consumer from an enterprise SSD, the TBW/DWPD and the DRAM cache.
But most manufacturers don't seem to provide much information on DRAM cache, are there any specific points/values to look out for with DRAM cache?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!