ProxMox 4.x is killing my SSDs

Honestly man, I think you should read something about SSD (i.e. wikipedia has nice article about "write amplification"). Call it "broken by design", but write amplification is common feature of every flash memory due to way it works: before writing to flash memory, you have to erase it (which counts as writing). And this is done in so called "erase-blocks", which is much bigger unit than "sector" (typical erase-block size ranges from 128kB up to 4MB, depending on vendor and ssd-size).

You might want to write just 100bytes of data (be it log-message or whatever), but if you write it to SSD (which has none never-written-to blocks), you actually have to write at least 128kB. All SSDs are affected by this, albeit some more than others (depends on erase-block size, controller logic, etc). This write amplification factor can be quite high when writing small files, but close to 1 for big files. But is never 1! It means, it does not matter how much data OS wrote to disk (iotop, iostat, etc). What does matter is how much data wrote ssd-controller actually to ssd (value you can find in SMART table).

SSD-controller tries to fight this problem by "garbage collection", "wear leveling", sequential writes, etc. Moreover, there are ways how user can help:
1. do not write in small chunks (i.e. log-files line-by-line)
2. have plenty of free/overprovisioned space (I recommend 20-25%)
3. fire up trim command regularly

I'm familiar with all of that. "broken by design" was maybe a little harsh ... let's call it "wrong tool for the job" (of storing any kind of server data). If I reach my DWPD by writing 1 MB in random 1 byte write blocks, something is terribly wrong.
 
Okay, checked the disk I changed yesterday.

Yesterday, I cloned the failed drive to the new one and since the drive is 60GB sized, when looking at the SMART value, I got this :

Code:
SSD_Life_Left : 99
Lifetime_Writes_GiB : 56
Lifetime_Reads_GiB : 1

Today, I read this :

Code:
SSD_Life_Left : 99
Lifetime_Writes_GiB : 106
Lifetime_Reads_GiB : 1

So if I understand this clearly, there were 50GiB data written on the disk in just 1 day ...

Looking at iotop, the only processes doing ios I see are mainly :
  • [jbd2/dm-0-8] : Ext4 log flush which accounts for most ios
  • ceph-[osd|mon] processes : they write logs to the SSDs but stored data is done on other drives and I don't use this SSD as Ceph journal drive (in fact, I don't use a drive as separated OSDs journal)
  • rrdcached : graphs and stats data for ProxMox but I don't see how it would have written that much data in just 1 day even though I have 15 nodes and 100 running VMs

@Rhinox : Regarding write amplification, what settings would you recommend in order to compensate ?

Regards.
 
The best what you can do is to avoid writing small files (i.e. every log-message as it arrives) where write amplification is highest. Use memory-buffers for logging, or redirect logs to tmpfs and rotate them to disk frequently. In addition to that, trim, big ssd, a lot of free space, etc.
 
  • >>ceph-[osd|mon] processes : they write logs to the SSDs but stored data is done on other drives and I don't use this SSD as Ceph journal drive (in fact, I don't use a drive as separated OSDs journal)
what is the size of ceph-mon log ? I'm seeing 2MB/s in your iotop, is it constant ? I don't have more than 150kb/s in my ceph cluster, but
 
Interesting thread ! My two cents :
- write amplification is on rotational disks and RAID arrays too (to my knowledge, the job is always : Read/Modify/Write)
- buffered logs and noatime/nodiratime can help, but I don't know if Proxmox daemons will be pleased with that
- an other improvment : use TRIM, in order that the wear leveling algorithm do its job well (but your case is far beyond this optimisation)
- an other idea much more probable to my mind : look at the SMART logs of your drives, maybe PVE 4.3 is triggering complete disk 'surface' checks too often ?
- and an other : do you store backups of VMs on the SSD ?
 
Last edited:
As for reference, the SSD with proxmox on it (at home) :
- used for proxmox installation root
- used for VMs (for the moment 2 VMs with very low activity)
I didn't went to the manufacturer spec for SMART values, but everything seems to be OK :

Code:
root@ruche:~# smartctl -x /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.4.19-1-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     KINGSTON SKC400S37128G
Serial Number:    50026B7267004702
Firmware Version: SAFM00.Y
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Oct 13 00:46:13 2016 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   30) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (   2) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   050    -    0
  2 Throughput_Performance  P-S---   100   100   050    -    0
  3 Spin_Up_Time            POS---   100   100   050    -    0
  5 Reallocated_Sector_Ct   PO--C-   100   100   050    -    0
  7 Unknown_SSD_Attribute   PO-R--   100   100   050    -    0
  8 Unknown_SSD_Attribute   P-S---   100   100   050    -    0
  9 Power_On_Hours          -O--C-   100   100   000    -    787
12 Power_Cycle_Count       -O--C-   100   100   000    -    22
168 Unknown_Attribute       -O--C-   100   100   000    -    0
170 Unknown_Attribute       PO----   100   100   010    -    232
173 Unknown_Attribute       -O--C-   100   100   000    -    65538
175 Program_Fail_Count_Chip PO--C-   100   100   050    -    0
187 Reported_Uncorrect      -O--C-   100   100   000    -    0
192 Power-Off_Retract_Count -O--C-   100   100   000    -    17
194 Temperature_Celsius     PO---K   076   066   030    -    24 (Min/Max 22/34)
196 Reallocated_Event_Count -O----   100   100   010    -    0
197 Current_Pending_Sector  -O--CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    PO-R--   100   100   050    -    0
218 Unknown_Attribute       PO-R--   100   100   050    -    0
231 Temperature_Celsius     PO--C-   100   100   000    -    100
233 Media_Wearout_Indicator PO-R--   100   100   000    -    132
240 Unknown_SSD_Attribute   PO--C-   100   100   000    -    0
241 Total_LBAs_Written      -O--C-   100   100   000    -    71
242 Total_LBAs_Read         -O--C-   100   100   000    -    41
244 Unknown_Attribute       -O----   100   100   000    -    1
245 Unknown_Attribute       -O----   100   100   000    -    2
246 Unknown_Attribute       -O--C-   100   100   000    -    35200
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O     51  Comprehensive SMART error log
0x03       GPL     R/O     64  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      6  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  SATA NCQ Queued Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      8  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page Offset Size         Value  Description
  1  =====  =                =  == General Statistics (rev 2) ==
  1  0x008  4               22  Lifetime Power-On Resets
  1  0x018  6        150125784  Logical Sectors Written
  1  0x020  6          5077842  Number of Write Commands
  1  0x028  6         87040380  Logical Sectors Read
  1  0x030  6         16670076  Number of Read Commands
  4  =====  =                =  == General Errors Statistics (rev 1) ==
  4  0x008  4            29456  Number of Reported Uncorrectable Errors
  4  0x010  4                0  Resets Between Cmd Acceptance and Completion
  5  =====  =                =  == Temperature Statistics (rev 1) ==
  5  0x008  1               32  Current Temperature
  5  0x010  1               32  Average Short Term Temperature
  5  0x018  1               32  Average Long Term Temperature
  5  0x020  1               50  Highest Temperature
  5  0x028  1                5  Lowest Temperature
  5  0x030  1               50  Highest Average Short Term Temperature
  5  0x038  1               16  Lowest Average Short Term Temperature
  5  0x040  1               50  Highest Average Long Term Temperature
  5  0x048  1               16  Lowest Average Long Term Temperature
  5  0x050  4                0  Time in Over-Temperature
  5  0x058  1               50  Specified Maximum Operating Temperature
  5  0x060  4                0  Time in Under-Temperature
  5  0x068  1                5  Specified Minimum Operating Temperature
  6  =====  =                =  == Transport Statistics (rev 1) ==
  6  0x008  4               42  Number of Hardware Resets
  6  0x018  4                0  Number of Interface CRC Errors
  7  =====  =                =  == Solid State Device Statistics (rev 1) ==
  7  0x008  1                5  Percentage Used Endurance Indicator

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  4           42  Transition from drive PhyRdy to drive PhyNRdy
0x000a  4           55  Device-to-host register FISes sent due to a COMRESET
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC
 
I spent several days at my precedent job in order to understand the SSD technology and in order to choose the right disk for the right usage. The disk I use is not a professional one, but there is a good warranty :)
Disk is installed with Proxmox since around 20 days.
 
Then there is something definitely wrong, if your SSD collected ~30.000 unrecoverable errors and its endurance indicator dropped 5%, all this in just 20 days! But smartctl-readings might not be accurate. You could try to use vendor-specific tool instead...
 
  • >>ceph-[osd|mon] processes : they write logs to the SSDs but stored data is done on other drives and I don't use this SSD as Ceph journal drive (in fact, I don't use a drive as separated OSDs journal)
what is the size of ceph-mon log ? I'm seeing 2MB/s in your iotop, is it constant ? I don't have more than 150kb/s in my ceph cluster, but

my ceph-mon.log is about 50k for each mons I have ... quite not much.
ceph-mon process is eating about 150~200kb/s usually ... it must have peaked when I grabbed the stat.
 
Interesting thread ! My two cents :
- write amplification is on rotational disks and RAID arrays too (to my knowledge, the job is always : Read/Modify/Write)
- buffered logs and noatime/nodiratime can help, but I don't know if Proxmox daemons will be pleased with that
- an other improvment : use TRIM, in order that the wear leveling algorithm do its job well (but your case is far beyond this optimisation)
- an other idea much more probable to my mind : look at the SMART logs of your drives, maybe PVE 4.3 is triggering complete disk 'surface' checks too often ?
- and an other : do you store backups of VMs on the SSD ?

I'll try some of your tips.
No, no backups on those disks.
In fact those disks are nearly empty, I have a little less than 5GB occupied by ProxMox itself, otherwise, the remaining space is unused.
 
I did a little summary with the SSDs installed in my nodes ... all same model (Corsair Force LS 60GB) .

As you would see, SSD_Life_Left and Lifetime_Writes_GiB correlate quite well except for some rare cases ... but this is not always true when related to Power_On_Hours which is odd since all those nodes have the same activity.
 

Attachments

  • ssds.pdf
    19.4 KB · Views: 20
@hybrid512: So, around 4 years lifespan with these values. What about the DWPD and warranty?

I just compared a cheap Samsung 840 EVO (256 GB) with an enterprise Samsung SV843/MZ7WD960 (960GB) and compared the output of this smartctl-TUI:

Cheap 840 EVO has similar bad values than the ones described here (running 4.3 with desktop system on it):

Code:
------------------------------
SSD Status:   /dev/sda
------------------------------
On time:      887 hr
------------------------------
Data written:
           MB: 1,796,261.178
           GB: 1,754.161
           TB: 1.713
------------------------------
Mean write rate:
        MB/hr: 2,025.097
------------------------------
Drive health: 88 %
------------------------------

And the enterprise SSD, running for almost 2 years with Proxmox 3.x and upgraded last week to 4.3

Code:
root@proxmox4 /tmp > ./samsung_ssd_get_lifetime_writes.bash
------------------------------
SSD Status:   /dev/sda
------------------------------
On time:      17,954 hr
------------------------------
Data written:
           MB: 12,152,309.427
           GB: 11,867.489
           TB: 11.589
------------------------------
Mean write rate:
        MB/hr: 676.858
------------------------------
Drive health: 99 %
------------------------------

I'll monitor both and report back
 
Here one can clearly see difference between consumer and enterprise SSD. Thanks for posting those values, LnxBil
 
Then there is something definitely wrong, if your SSD collected ~30.000 unrecoverable errors and its endurance indicator dropped 5%, all this in just 20 days! But smartctl-readings might not be accurate. You could try to use vendor-specific tool instead...
I may be mistaken, but I always saw errors on SATA buses with atom based "servers" I build at home. I think that unrecoverable errors comes from that. After all, this is a high speed bus, subject to perturbation, and with consumer grade hardware :)
Concerning the endurance, I don't know, this is comfor to : https://www.smartmontools.org/attachment/ticket/673/smartctl-KINGSTON-SKC400S37128G.txt. I don't know what really it is representing ?
 
Just to follow up, the cheap 840 EVO disk almost three weeks later:

Code:
------------------------------
SSD Status:   /dev/sda
------------------------------
On time:      1,394 hr
------------------------------
Data written:
           MB: 2,745,038.933
           GB: 2,680.702
           TB: 2.617
------------------------------
Mean write rate:
        MB/hr: 1,969.181
------------------------------
Drive health: 79 %
------------------------------

The disk itself has ZFS (as the enterprise disks) and a data written increase of almost 1 TB in that time. Maybe it is really a shitty SSD. The enterprise SSD has much less data written (and it is a server, not a desktop system). I'm just started to monitor filesystem writes vs. block disk writes and see a huge difference, but still not that big.
 
Woah. But the evos are not the best ssds out there. The pros are, but I remind of huge firmware problems some years ago and lots of our evos slowing down massively here on upgraded Mac minis.

By the way, which tool is this?
 
Woah. But the evos are not the best ssds out there. The pros are, but I remind of huge firmware problems some years ago and lots of our evos slowing down massively here on upgraded Mac minis.

I thought that they'll not be good, but that bad? A real bumper. The laptop I cramped that SSD is a ultrabook which were not made for easy access, so it took me a good hour to install the SSD. It the wearout continues with that speed I need to replace the ssd next year.

By the way, which tool is this?

simple frontent for smartctl from this website:
http://www.jdgleaver.co.uk/blog/2014/05/23/samsung_ssds_reading_total_bytes_written_under_linux.html
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!