Quick wear out on raid1 - Looking for suggestions

en4ble · Aug 27, 2024

This been first deployment using raid1 using wd red ssd (not the best tier but would never expect such a quick wear out).

Server been up for 75 days with wear out at 30% already (almost .5 % per day).

Purpose of those drives was primary OS (only) but they also hold couple (11) virtual security appliances (pfsense) (this could potentially be the reason of the high wear - but haven't tested that theory).

iostat for the raid:

Code:

 zpool iostat -v
                                          capacity     operations     bandwidth
pool                                    alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                    171G   293G     24    214   351K  10.1M
  mirror-0                               171G   293G     24    214   351K  10.1M
    ata-WDC_WDS500G1R0A-68A4W0_24070L800900-part3      -      -     12    107   174K  5.03M
    ata-WDC_WDS500G1R0A-68A4W0_24070L800864-part3      -      -     12    107   176K  5.03M
--------------------------------------  -----  -----  -----  -----  -----  -----

And they seem heavy read/write based on smart:
241 Host_Writes_GiB 0x0030 253 253 --- Old_age Offline - 32044
242 Host_Reads_GiB 0x0030 253 253 --- Old_age Offline - 1091

I would appreciate some opinions on best approach here to perhaps minimize the wearout - something I may not be aware with raid1 we could potentially turn off?

Options and my ask:

1)migrate the vFWs to other pools so they don't use the raid - and monitor if the % is slowing down.
2)migrate to enterprise grade ssds - perhaps hot swap. Would love options on best SSDs for this based on the use case and read/write its doing now
3)any options that could be looked at with raid1 to perhaps improve hw longevity?
4)any other suggestions perhaps?!

Thank You in advance for replies!

mram · Aug 27, 2024

You might want to use the Proxmox post-install helper script and turn off Cluster mode to save some wear and tear.

https://tteck.github.io/Proxmox/

en4ble · Aug 27, 2024

mram said:
You might want to use the Proxmox post-install helper script and turn off Cluster mode to save some wear and tear.

https://tteck.github.io/Proxmox/

mram - thank you for this resource (awesome), even if you don't run cluster by default there are some settings ON?!

esi_y · Aug 27, 2024

Have a look at:

https://github.com/isasmendiagus/pmxcfs-ram

Also, consider that there's write amplification going on with ZFS on SSDs - not a great filesystem for this century.

https://forums.freebsd.org/threads/my-zfs-ssd-experiences.86810/

mram · Aug 27, 2024

IIRC it helps disable logging of services that you are not using.

mram · Aug 27, 2024

Used pair of cheap Intel S3700 / S3710 400GB makes a great Proxmox ZFS mirrored boot drive IMHO.

esi_y · Aug 27, 2024

en4ble said:
2)migrate to enterprise grade ssds - perhaps hot swap. Would love options on best SSDs for this based on the use case and read/write its doing now

You don't really have much meaningful choices, price/value ratio-wise, perhaps Kingston DC600M for SATA SSD. For M.2 NVMe, you are limited to Microns 7300 or 7450 especially if you need 2280 size.

en4ble · Aug 27, 2024

mram said:
IIRC it helps disable logging of services that you are not using.

@mram I'm sorry what do you mean IIRC? Still talking about the post-install script?!

en4ble · Aug 27, 2024

esi_y said:
Have a look at:

https://github.com/isasmendiagus/pmxcfs-ram

Also, consider that there's write amplification going on with ZFS on SSDs - not a great filesystem for this century.

https://forums.freebsd.org/threads/my-zfs-ssd-experiences.86810/

@esi_y I like the cfs-ram and what it does but man....guessing last resort. ty for the input!

en4ble · Aug 27, 2024

mram said:
Used pair of cheap Intel S3700 / S3710 400GB makes a great Proxmox ZFS mirrored boot drive IMHO.

Could you go on lower storage when performing hotswap? I was under the impression I need same size or bigger when upgrading? Those are 500GB WD Reds.

en4ble · Aug 27, 2024

mram said:
You might want to use the Proxmox post-install helper script and turn off Cluster mode to save some wear and tear.

https://tteck.github.io/Proxmox/

again, if i'm not running cluster - this is still recommended? "You might want to use the Proxmox post-install helper script and turn off Cluster mode to save some wear and tear."

_gabriel · Aug 27, 2024

I bet on VM pfSense, if ZFS over ZFS + if you enable things like ntopng.
https://forum.proxmox.com/threads/high-data-unites-written-ssd-wearout-in-proxmox.139119/post-665450

for me PVE itself isn't angry.

en4ble · Aug 27, 2024

_gabriel said:
I bet on VM pfSense, if ZFS over ZFS + if you enable things like ntopng.
https://forum.proxmox.com/threads/high-data-unites-written-ssd-wearout-in-proxmox.139119/post-665450

for me PVE itself isn't angry.

@_gabriel yes its my guess as well. I have a maintenance in 1 day so I'll move them over to nvmes, Should never go for raid1 just do 1 lvm for os and 1 lvm for pfsense and be done :/

louie1961 · Aug 27, 2024

I use consumer NVMe SSDs in my Proxmox machines, without a problem. I do disable the corosync, pve-ha-crm, and pve-ha-lrm services to minimize drive writes (no clusters here). These drives are just about a year old and have zero % wear out, so I would say something is not right with your set up. I also really don't store any data on my Proxmox nodes. All application/persistence data, docker volumes and VM/CT backups, etc. reside on my Synology NAS. This particular machine has the OS and the VMs all on the same mirrored ZFS drives.

en4ble · Aug 27, 2024

louie1961 said:
I use consumer NVMe SSDs in my Proxmox machines, without a problem. I do disable the corosync, pve-ha-crm, and pve-ha-lrm services to minimize drive writes (no clusters here). These drives are just about a year old and have zero % wear out, so I would say something is not right with your set up. I also really don't store any data on my Proxmox nodes. All application/persistence data, docker volumes and VM/CT backups, etc. reside on my Synology NAS. This particular machine has the OS and the VMs all on the same mirrored ZFS drives.

View attachment 73787

View attachment 73788

Thanks for the info man. I was shocked seeing wd red going down like that. Again I do run 11 vPfsense appliances that are being utilized heavily. Other than that its a vanilla proxmox deployment

I should be able to verify if its in fact pfsense in about a day. I'll move those off the raid1 and I think I'll run the post install script that was mentioned and remove the "cluster" since its also running solo.

en4ble · Aug 27, 2024

louie1961 said:
I use consumer NVMe SSDs in my Proxmox machines, without a problem. I do disable the corosync, pve-ha-crm, and pve-ha-lrm services to minimize drive writes (no clusters here). These drives are just about a year old and have zero % wear out, so I would say something is not right with your set up. I also really don't store any data on my Proxmox nodes. All application/persistence data, docker volumes and VM/CT backups, etc. reside on my Synology NAS. This particular machine has the OS and the VMs all on the same mirrored ZFS drives.

View attachment 73787

View attachment 73788

@louie1961 what is your iostat for this raid? Could you please share?

louie1961 · Aug 28, 2024

en4ble said:
@louie1961 what is your iostat for this raid? Could you please share?

I have never used this command before, so I am not sure what the significance of these metrics are. But here you go

Kingneutron · Aug 28, 2024

Personally, I'll never use WD RED SSD ever again. Had a 1TB in my mostly-off-except-weekends ZFS server and it died right after the warranty expired.

I would also not recommend trying to boot/run Proxmox on sub-1TB SSD if you can help it, larger sizes typically have higher TBW ratings.

You can get by on a budget with a 256GB SSD for OS and a small lvm-thin, but you have to turn off cluster services + turn off atime everywhere (including in-guests), set swappiness to 0 -- and look into zram and log2ram. And have enough RAM in the server so things don't swap

The no-name nvme ~238GB that came with my Qotom firewall appliance is only at 1% wear (mostly 24/7 operation since Feb) with the above mitigations in place. I may upgrade it to a 1TB when it gets to ~50% wear but for now it's no worries

Disk model: YSO256GTLCW-E3C-2

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 1%
Data Units Read: 23,287,249 [11.9 TB]
Data Units Written: 5,732,131 [2.93 TB]
Host Read Commands: 159,202,847
Host Write Commands: 267,908,696
Controller Busy Time: 1,327
Power Cycles: 62
Power On Hours: 4,614
Unsafe Shutdowns: 47
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 56 Celsius

I put in a 2nd Lexar NM790 1TB (with heat sink) to run the VMs on, and it's just recently crossed over to 1% wear.

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 15,424,533 [7.89 TB]
Data Units Written: 10,967,567 [5.61 TB]
Host Read Commands: 65,840,472
Host Write Commands: 239,758,570
Controller Busy Time: 481
Power Cycles: 24
Power On Hours: 4,519
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 1
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 37 Celsius

EDIT:
2024.0518 enabled RAM disk logging for less writes to disk
[[
By default, pfsense is wrtting logs very often, so you can turn on a RAM disk option so it writes to RAM and flushes to SSD much less often.

In pfsense: "System" -> "Advanced" -> "Miscellaneous" -> "RAM Disk Settings"
]]

If you did the default install of pfsense to zfs boot/root, you should probably move the VM disk storage to lvm-thin so you're not doing COW+COW write amplification. And enable logging to RAM in-VM as described

en4ble · Aug 28, 2024

Kingneutron said:
Personally, I'll never use WD RED SSD ever again. Had a 1TB in my mostly-off-except-weekends ZFS server and it died right after the warranty expired.

I would also not recommend trying to boot/run Proxmox on sub-1TB SSD if you can help it, larger sizes typically have higher TBW ratings.

You can get by on a budget with a 256GB SSD for OS and a small lvm-thin, but you have to turn off cluster services + turn off atime everywhere (including in-guests), set swappiness to 0 -- and look into zram and log2ram. And have enough RAM in the server so things don't swap

The no-name nvme ~238GB that came with my Qotom firewall appliance is only at 1% wear (mostly 24/7 operation since Feb) with the above mitigations in place. I may upgrade it to a 1TB when it gets to ~50% wear but for now it's no worries

Disk model: YSO256GTLCW-E3C-2

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 1%
Data Units Read: 23,287,249 [11.9 TB]
Data Units Written: 5,732,131 [2.93 TB]
Host Read Commands: 159,202,847
Host Write Commands: 267,908,696
Controller Busy Time: 1,327
Power Cycles: 62
Power On Hours: 4,614
Unsafe Shutdowns: 47
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 56 Celsius

I put in a 2nd Lexar NM790 1TB (with heat sink) to run the VMs on, and it's just recently crossed over to 1% wear.

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 15,424,533 [7.89 TB]
Data Units Written: 10,967,567 [5.61 TB]
Host Read Commands: 65,840,472
Host Write Commands: 239,758,570
Controller Busy Time: 481
Power Cycles: 24
Power On Hours: 4,519
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 1
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 37 Celsius

EDIT:
2024.0518 enabled RAM disk logging for less writes to disk
[[
By default, pfsense is wrtting logs very often, so you can turn on a RAM disk option so it writes to RAM and flushes to SSD much less often.

In pfsense: "System" -> "Advanced" -> "Miscellaneous" -> "RAM Disk Settings"
]]

If you did the default install of pfsense to zfs boot/root, you should probably move the VM disk storage to lvm-thin so you're not doing COW+COW write amplification. And enable logging to RAM in-VM as described

@Kingneutron I really appreciate your feedback.

Yes, I know about the RAM disk in pfsense. This will be part of my migration day after tomorrow. But good info in case someone did not know and can improve on that.

Will be moving those vpfsense onto lvm-thin as you described.

So I know about the cluster services (someone mentioned the post install script - I will run it) but I'm not sure what is "turn off atime" - never heard that one. Could you elaborate please on this aspect?! Thank you. EDIT: think I found some info: https://www.unixtutorial.org/zfs-performance-basics-disable-atime/

But still would appreciate some guidance on how to disable it globally? I can see a lot of entries for "atime"

Code:

mount | grep "atime"

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)

proc on /proc type proc (rw,relatime)

udev on /dev type devtmpfs (rw,nosuid,relatime,size=528274036k,nr_inodes=132068509,mode=755,inode64)

devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)

tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=105661512k,mode=755,inode64)

rpool/ROOT/pve-1 on / type zfs (rw,relatime,xattr,posixacl,casesensitive)

securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)

tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,inode64)

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)

efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)

bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)

systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=232193)

hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)

mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)

debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)

tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)

fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)

configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)

ramfs on /run/credentials/systemd-sysctl.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)

ramfs on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)

ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)

rpool on /rpool type zfs (rw,relatime,xattr,noacl,casesensitive)

rpool/var-lib-vz on /var/lib/vz type zfs (rw,relatime,xattr,noacl,casesensitive)

rpool/ROOT on /rpool/ROOT type zfs (rw,relatime,xattr,noacl,casesensitive)

rpool/data on /rpool/data type zfs (rw,relatime,xattr,noacl,casesensitive)

ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)

binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)

lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)

sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)

/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=105661508k,nr_inodes=26415377,mode=700,inode64)

Swap is off and I have plenty of ram so no worries there.

Worse case if I'm not able to stop the degradation I think I'll just need to figure out the hot swap way to get some heavy enterprise ssds but I'm really hoping I can slow it down with migration off the raid1. This is my first experience with zfs and probably my last

Thank you again!

en4ble · Aug 28, 2024

louie1961 said:
I have never used this command before, so I am not sure what the significance of these metrics are. But here you go

View attachment 73807

Thanks for the screengrab, not sure which one is your raid device with os. I ran this:

Code:

zpool iostat -v

Quick wear out on raid1 - Looking for suggestions

Member

Renowned Member

Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Member

Member

Member

Member

Distinguished Member

Member

Well-Known Member

Member

Member

Well-Known Member

Renowned Member

Member

Member

We value your privacy