Is this solved? This abnormal wearout is normal with PVE on SSD ?
I dont mind to do a new setup. I created the zpool as you suggested, and I think I get it.
So you are suggesting to use 2 small SSD's (mirror) for root, and than 2 large SSD's (mirror) for all my VM's. Root will be LVM (ext4) and the large SSD's will also be LVM (ext4) with ZPOOL on top of that? Do I get it right?
Since I have the 3 SSD's (and 4 hopefully soon), can I do the same, but with 4 SSD's in RAID10?
I am only able to setup proxmox with the installed. What you are suggesting is a manual setup. I am not comfortable to do that, unless I have some guide that I can follow.
I appriciate you helping me. Thank you!!!
Hi,
Just wondering if you ever figured this out?
I am thinking of following a similar format.
My Current setup uses 1 TB NVME and 1 TB SATA SSD with ZFS Raid1, But after a couple months, I put it in on October 7th, 2019. It's showing about 13.9TBW so far. That's something close to 7TB a month being written.
I have 1 VM runnning 24/7 Windows 2019 Standard that is backed up using vzdump snapshot nightly to an NFS which is about 66GB.
It seems like the wear is going fast though. THe SATA SSD has endurance of about 300TBW and shows 90% reamaining life. the NVME is about 600TBW endurance and it shows 2% used so far.
I am going to switch it to to 2 smaller 120GB Mirrored SATA SSDs for the OS and 2 x 960GB SATA SSDs for the VMs and storage.
I really wish I could find out what is resulting in the heavy writes.
Thank so much.
Hi,
yes, I have figured this out. Since me posting here, the system is running in production non-stop. Never had any issues or any drives fail. I just ordered new SSD's to replace the Samsung 840 Pro's with Intel 3700. This is more a precaution, since the drives have been in full time operation since approx Dec 2015. I also have a database running on it.
I highly suggest, to setup 2x (small) SSD's in mirror for the root partition, and than use whatever amount of drives you need, in Mirror or Raid10. The advantage is, if you have to upgrade proxmox, you can just backup all the config files, detach the zpool that has all the vm's, upgrade or reinstall proxmox, copy config files back, import zpool and you are ready to go. I did that already couple of times and it is fast. You can basically redo a new system in about 2 hours.
I setup the root zpool with the proxmox installer, but than manually created the "storage" zpool manually.
Hi,
yes, I have figured this out. Since me posting here, the system is running in production non-stop. Never had any issues or any drives fail. I just ordered new SSD's to replace the Samsung 840 Pro's with Intel 3700. This is more a precaution, since the drives have been in full time operation since approx Dec 2015. I also have a database running on it.
I highly suggest, to setup 2x (small) SSD's in mirror for the root partition, and than use whatever amount of drives you need, in Mirror or Raid10. The advantage is, if you have to upgrade proxmox, you can just backup all the config files, detach the zpool that has all the vm's, upgrade or reinstall proxmox, copy config files back, import zpool and you are ready to go. I did that already couple of times and it is fast. You can basically redo a new system in about 2 hours.
I setup the root zpool with the proxmox installer, but than manually created the "storage" zpool manually.
Something to do with the drive Block size being the default 8k versus something like 64k.
Did you find anything like that?
Code:root@VMNode01:/# zdb | grep ashift ashift: 12 ashift: 9 ashift: 9 root@VMNode01:/# lsblk -dt /dev/sd? NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME sda 0 512 0 512 512 0 deadline 128 128 0B sdb 0 512 0 512 512 0 deadline 128 128 0B sdc 0 512 0 512 512 0 noop 128 128 0B sdd 0 512 0 512 512 0 noop 128 128 0B sde 0 512 0 512 512 0 noop 128 128 0B sdf 0 512 0 512 512 0 noop 128 128 0B sdg 0 512 0 512 512 1 deadline 128 128 0B
Than the zpool layout:
Code:root@VMNode01:/# zpool status pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sda2 ONLINE 0 0 0 sdb2 ONLINE 0 0 0 errors: No known data errors pool: storage state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-Samsung_SSD_850_PRO_512GB_S250NWAG904013M ONLINE 0 0 0 ata-Samsung_SSD_850_PRO_512GB_S250NXAG936968X ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 ata-Samsung_SSD_850_PRO_512GB_S250NWAG905249J ONLINE 0 0 0 ata-Samsung_SSD_850_PRO_512GB_S250NWAG905234V ONLINE 0 0 0 errors: No known data errors
And finally the SMART data for all 6 drives:
Code:root@VMNode01:/# smartctl -a /dev/sda smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: ADATA SX930 User Capacity: 120,034,123,776 bytes [120 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device ... root@VMNode01:/# smartctl -a /dev/sdb smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: ADATA SX930 User Capacity: 120,034,123,776 bytes [120 GB] Sector Size: 512 bytes logical/physical ... root@VMNode01:/# smartctl -a /dev/sdc smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: Samsung SSD 850 PRO 512GB Firmware Version: EXM02B6Q User Capacity: 512,110,190,592 bytes [512 GB] Sector Size: 512 bytes logical/physical ... root@VMNode01:/# smartctl -a /dev/sdd smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: Samsung SSD 850 PRO 512GB Firmware Version: EXM02B6Q User Capacity: 512,110,190,592 bytes [512 GB] Sector Size: 512 bytes logical/physical ... root@VMNode01:/# smartctl -a /dev/sde smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: Samsung SSD 850 PRO 512GB Firmware Version: EXM02B6Q User Capacity: 512,110,190,592 bytes [512 GB] Sector Size: 512 bytes logical/physical ... root@VMNode01:/# smartctl -a /dev/sdf smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: Samsung SSD 850 PRO 512GB Firmware Version: EXM02B6Q User Capacity: 512,110,190,592 bytes [512 GB] Sector Size: 512 bytes logical/physical ...
Device Model: Crucial_CT750MX300SSD1
Sector Size: 512 bytes logical/physical
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 45223
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 193920876835
202 Percent_Lifetime_Remain 0x0030 054 054 001 Old_age Offline - 46
pool: storage
state: ONLINE
scan: scrub repaired 0B in 00:49:47 with 0 errors on Sun Jan 9 01:13:49 2022
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-CT2000MX500SSD1_XX ONLINE 0 0 0
ata-CT2000MX500SSD1_XX ONLINE 0 0 0
Device Model: CT2000MX500SSD1
Firmware Version: M3CR033
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1093
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 67185503643
202 Percent_Lifetime_Remain 0x0030 090 090 001 Old_age Offline - 10
root@pve:~# smartctl /dev/sdh -l devstat | less
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-5-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 1 --- Lifetime Power-On Resets
0x01 0x010 4 1093 --- Power-on Hours
0x01 0x018 6 2771135243 --- Logical Sectors Written
0x01 0x020 6 839078384 --- Number of Write Commands
Hi, I would like to re-open this thread, since it is few months I have migrated my ESXi setup to proxmox VE and I am seeing very strange SSD disk usage. My old setup as ESXi was using LSI 9211 and two 750 GB MX300 SSD in Raid 1. Now let's see stats from one of this drives:
Code:Device Model: Crucial_CT750MX300SSD1 Sector Size: 512 bytes logical/physical 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 45223 246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 193920876835 202 Percent_Lifetime_Remain 0x0030 054 054 001 Old_age Offline - 46
45223 hours is 5.158909 years and Total LBA written since disk is 512 sectors gives 90.30 TB in that five years, which gives about 1,46TB/month.
So this disk is somewhere in the middle of his lifetime (TBW is 220). All fine and great and clear.
All that virtual machines has been migrated to PVE 7.0, using two 2TB MX500 in ZFS mirror:
Code:pool: storage state: ONLINE scan: scrub repaired 0B in 00:49:47 with 0 errors on Sun Jan 9 01:13:49 2022 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-CT2000MX500SSD1_XX ONLINE 0 0 0 ata-CT2000MX500SSD1_XX ONLINE 0 0 0
Now let's see some stats from one of this disks:
Code:Device Model: CT2000MX500SSD1 Firmware Version: M3CR033 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1093 246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 67185503643 202 Percent_Lifetime_Remain 0x0030 090 090 001 Old_age Offline - 10
1093 hours is 45.54167 days. Now when I use logical sector size, total bytes written is 31.29 TB (!!!!) in that 45 days (20,86TB/month) or even when I compute with physical size 4096 it gives insane number 250.29 TB (166,82TB/month).
According to spec, MX500 2TB have endurance 700TBW. None of this numbers gives 10% if 700TB. Anyway, 10% disk wearout in just 45 days seems to be total unequal with load in ESXi setup.
In calculator https://www.virten.net/2016/12/ssd-total-bytes-written-calculator/ in comment somebody also mention this way how to show disk stats"
Code:root@pve:~# smartctl /dev/sdh -l devstat | less smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-5-pve] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org Device Statistics (GP Log 0x04) Page Offset Size Value Flags Description 0x01 ===== = = === == General Statistics (rev 1) == 0x01 0x008 4 1 --- Lifetime Power-On Resets 0x01 0x010 4 1093 --- Power-on Hours 0x01 0x018 6 2771135243 --- Logical Sectors Written 0x01 0x020 6 839078384 --- Number of Write Commands
It gives different number written sectors (why?!) With 512 sector size it is 1.29 TB (0,86TB/month), with 4096 it gives 10.32 TB (6.88TB/month). Finaly one number is about that 10% wearout.
I like proxmox ZFS setup, but this is totaly insane how ZFS is destroying SSDs. Should I migrate to LVM to solve this?
One more note - PVE itself is installed on different (nvme) drive, so no proxmox agents writes are on this pool, where are VMs.
Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSDs?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests.
Hi.Should I migrate to LVM to solve this?
To quote the PVE ZFS Benchmark paper FAQ page 8 again:
ZFS has a high overhead, especially if you got DBs that do alot of small sync writes. Then the virtualization, nested filesystems etc add overhead too and this resulting write amplification is multiplying not adding up so you get exponential growth. According to my ZFS benchmarks with enterprise SSDs I've seen a write amplification between factor 3 (big async sequential writes) and factor 81 (4k random sync writes) with an average of factor 20 with my homeserver measured over months. So you can get really heavy wear even when not writing that much. My homeserver for example only writes 45GB per day but because of the factor 20 write amplification this will hit the SSDs with 900GB per day.
LVM will cause way less writes but then you got no redundancy or bit rot protection so you can never be sure that your data on the disks or the data in all of your backups isn't silently corruped over time.
Hi,No warning about this on that wiki page
root@pve:~# smartctl -A /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 0%
Data Units Read: 30,774 [15.7 GB]
Data Units Written: 239,730 [122 GB]
Host Read Commands: 206,192
Host Write Commands: 5,155,750
Controller Busy Time: 0
Power Cycles: 10
Power On Hours: 15
Unsafe Shutdowns: 3
Media and Data Integrity Errors: 0
Error Information Log Entries: 14
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
root@truenas[~]# smartctl -A /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 29 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 0%
Data Units Read: 7,697 [3.94 GB]
Data Units Written: 11,172 [5.72 GB]
Host Read Commands: 130,334
Host Write Commands: 190,834
Controller Busy Time: 0
Power Cycles: 6
Power On Hours: 0
Unsafe Shutdowns: 1
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0