High SSD wear after a few days

eddi1984

Active Member
Oct 8, 2014
46
0
26
Hello folks,

I have problems with high SSD usage.

it has been approx 5 days, since I installed 3 Samsung 850 Pro 512GB SSD's in RAIDz with proxmox 4.0. SSD's are connected to the onboard SATA ports, no raid card ...

I use ZFS as the filesystem (not zvol). I use the full RAID array and store my VM's (qcow2, VIRT-IO SCSI, fixed memory). I store my backups on a seperate spinning 1TB disk.
I have total 48GB memory and have assigned a total of 36GB assigned to VM's. I can upgrade the memory to 72GB, if needed.

My local storage has 75GB on it, that includes VM's, Snapshots, ISO's. Now I had to redo the server 2 times (change of raid setup).

My problem is the high SSD usage that I am seeing. I copied some SMART values below.

If I estimate the usage that I had up till now, I would say, that I wrote 1TB or less in the 94 hours uptime of the HDD.
According the the SMART 177 below, I wrote 4.5TB. That is 49GB/hour!!!
There was approx 3.5TB written and not from me coping stuff onto the server and cerating VM's ect. This was created by the system and VMs running.

Sorry for the formatting:
Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
[B] 9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       94[/B]
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       4
[B]177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       9[/B]

For reference ONLY my Samsung 840 Pro used in my laptop:
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   ---    Pre-fail  Always       -       0
[B]  9 Power_On_Hours          0x0032   099   099   ---    Old_age   Always       -       2607[/B]
 12 Power_Cycle_Count       0x0032   099   099   ---    Old_age   Always       -       721
[B]177 Wear_Leveling_Count     0x0013   099   099   ---    Pre-fail  Always       -       17[/B]


Problem is, if this keeps up this way, I will reach the 300TBW warranty in only 260 days!!! Quite short of the 10 years that I could have (I am hoping to have the drives in operation for at least 5 years and stay in warranty period/TBW).

My proxmox hyper-visor has currently 3 OS installed on it. One is SBS2011, one Win2k3Std and Win7Pro. None of them are very active. I included picture of the average Daily IO, hope that helps.

According to this guy (http://www.anandtech.com/show/8239/update-on-samsung-850-pro-endurance-vnand-die-size), apparently a Samsung 850 Pro, can do 6000 cycles (based on his test on the SMART 177, the calculated result was 6000 cycles). That would give me a total of 6.3 years, based on 50/GB hour write.

Problem with that is, that the SSD can die much sooner than reaching the 6000 cycle count. I am monitoring the SMART, however, this is a bit uncomfortable for me to gamble in a production environment.

The good thing is, that I can still return the SSD's, if this does not work out. However, I want to use the SSD's if that is possible. The Win2k3 has a Database, so IO is needed.

How can I fix this? Where should I look? I do not find the wiki to helpful, unless I am looking at all the wrong places.

I am comfortable with command line, however, I do not have to much experience with Linux in general (more a windows guy).

Please help.

THANK YOU!!!


Proxmox Status 25Nov.PNGWinSBS2011-IO.PNGWin2k3-IO.PNG
 
Last edited:
Is there some way you can use zpool iostat or some other utility to double check and make sure the SMART reading is accurate?
 
...

If I estimate the usage that I had up till now, I would say, that I wrote 1TB or less in the 94 hours uptime of the HDD.
According the the SMART 177 below, I wrote 4.5TB. That is 49GB/hour!!!
There was approx 3.5TB written and not from me coping stuff onto the server and cerating VM's ect. This was created by the system and VMs running.
...
Hi,
do you see much writes on the disks with iostat? (apt-get install sysstat)

Like
Code:
iostat -dm 5 sdb sdc sdd
BTW. on the ceph mailinglist some people have samsung (not DC) as journal-ssds - they died suddenly without warning before!

Udo
 
BTW. on the ceph mailinglist some people have samsung (not DC) as journal-ssds - they died suddenly without warning before!

Udo

Thats what I am worried about, sudden, no-notice, quick death and take everything with them ... worst case scenario ...

I will run Process Explorer for 24 hours and see where the IO writes are. I will also try the iostat.

I am debating on returning the SSD's and getting 4x 15k SAS and using SSD's (smaller) for L2ARC and ZIL (if needed).

I will report back.

Thank you so far!
 
This is the result of iostat

Code:
root@VMNode01:/# iostat -dm 5 sda sdb sdc
Linux 4.2.3-2-pve (VMNode01)    11/25/2015      _x86_64_        (16 CPU)


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda             105.41         0.73         1.92      61085     161841
sdb             105.39         0.73         1.92      61560     161816
sdc             106.82         0.73         1.92      61044     161893

Hope that helps.

EDIT:

I also wanted to add the output of "zpool iostat"

Code:
root@VMNode01:/# zpool iostat
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool        114G  1.28T     32    109  2.12M  3.71M
 
Last edited:
Don't journal drives wear the fastest?
Might not be a good reference point.
What you need to do is work out if the volume of data being written to the SSD is what you expected, and if it's sent to the SSD in a way as to maximize it's life. I know we don't have TRIM but things like block size turning might be relevant.
 
Don't journal drives wear the fastest?
Might not be a good reference point.
[...]

Yes they do, because each write that gets put on a OSD gets put on the Journal first. Then after a sync_intervall the journal gets trimmed again. This process is done for every Replicated Copy of a file placed on other OSD's.

Now lets say you have a file replicated 3 times, then your journal devices get eventually 6 writes.
The kicker is, that most journals back multiple osd's. so if you e.g. have a journal SSD backing 5 HDD's you end up with a 10x increase in writes over a single OSD, so you wear em out alot faster cause you are writing alot more data to it.


The 840 EVo has generally been plagued by performance issues, this is just the latest of em:
http://www.thessdreview.com/daily-n...date-and-magician-4-6-software-now-available/

altho samsung does not list TBW for the 840 Evos 512GB they list 40GB/day for 22 years or roughly 300 TBW (the 128 GB ones are rated for only 100 TBW btw, which is probably where their bad rep as a journal comes from)

I think it is telling that on the new 850 evo's samsung is now using the same controller as on the Pro variants.
 
Yes they do, because each write that gets put on a OSD gets put on the Journal first. Then after a sync_intervall the journal gets trimmed again. This process is done for every Replicated Copy of a file placed on other OSD's.

Now lets say you have a file replicated 3 times, then your journal devices get eventually 6 writes.
The kicker is, that most journals back multiple osd's. so if you e.g. have a journal SSD backing 5 HDD's you end up with a 10x increase in writes over a single OSD, so you wear em out alot faster cause you are writing alot more data to it.


The 840 EVo has generally been plagued by performance issues, this is just the latest of em:
http://www.thessdreview.com/daily-n...date-and-magician-4-6-software-now-available/

altho samsung does not list TBW for the 840 Evos 512GB they list 40GB/day for 22 years or roughly 300 TBW (the 128 GB ones are rated for only 100 TBW btw, which is probably where their bad rep as a journal comes from)

I think it is telling that on the new 850 evo's samsung is now using the same controller as on the Pro variants.

I have an 840 Evo in my gaming PC but it doesn't do enough to be a benchmark.
Here is an endurance test on the 840 Pro it comes up pretty well http://techreport.com/review/27062/the-ssd-endurance-experiment-only-two-remain-after-1-5pb/2

The secret with commodity SSD is to keep spares, and make sure you have smartmontools or similar email you if bad sectors start to appear. Does a RAIDZx of pure SSD need an SSD SLOG at all? Does the ZIL wear out an SSD fast, or is it spread across the drives?
 
So, I started the Process Explorer last night and also IOSTAT on proxmox.

For some reason, proxmox crashed and rebooted. Syslog is not telling, why it happened.
I restarted both programs again today, and will have it run during work hours, for a few hours to get some numbers. Will report back.

To address some posts from before.

I do not have an EVO, I use the Pro's. The MLC used in EVO's is defiantly not good for production/server environment. I have smartmontools setup, however, my biggest fear is, the sudden, no-notice death ... that is the reason, why I cannot sleep at night ... and I love to sleep ...

I can see, that a lot of journalist is going on. Is there a way to slow it down or turn it off? I am not familiar with journals and their function, maybe someone can enlighten me. However, I thought that journaling is only used by extX file systems, not ZFS.

Would it make sense for me, to use 15k spindles (maybe even in RAID10) for the root partition (I assume that's where jornaling is happening) and than use SSD's for the actual VMs? How would that affect the performance of the VM, will it still be as fast as pure SSD environment, of will jornaling on 15k spindles slow own the VM's being stored and run of SSD's?

Thanks so far.
 
You seem to have a high write I/O profile. Udo's iostat returned 161GB written on each SSD, 61 GB read and 2MB/sec write since last reboot.
Btw, did you buy these drives new?
What's your current smartctl status (since 13 hours ago on your first post)
 
I purchased the drive brand new last Saturday (Nov. 21st)

this is the smartctl output for each drive:

Code:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2015.11.26 09:38:55 =~=~=~=~=~=~=~=~=~=~=~=
root
root@172.24.0.50's password: 


The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.


Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Nov 26 09:31:16 2015 from 172.24.0.111


root@VMNode01:/# smartctl -a /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 PRO 512GB
Serial Number:    S250NWAG904013M
LU WWN Device Id: 5 002538 870136306
Firmware Version: EXM02B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Nov 26 09:39:19 2015 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x53) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 272) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.


SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       108
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       4
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       10
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   074   069   000    Old_age   Always       -       26
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       2
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       1725710600


SMART Error Log Version: 1
No Errors Logged


SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


root@VMNode01:/# smartctl -a /dev/sda[Kb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 PRO 512GB
Serial Number:    S250NWAG905249J
LU WWN Device Id: 5 002538 8701373c2
Firmware Version: EXM02B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Nov 26 09:39:24 2015 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x53) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 272) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.


SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       89
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       3
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       9
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   073   068   000    Old_age   Always       -       27
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       2
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       1207517928


SMART Error Log Version: 1
No Errors Logged


SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


root@VMNode01:/# smartctl -a /dev/sdb[Kc
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.3-2-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 PRO 512GB
Serial Number:    S250NXAG936968X
LU WWN Device Id: 5 002538 8400a6bf7
Firmware Version: EXM02B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Nov 26 09:39:26 2015 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x53) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 272) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.


SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       108
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       3
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       10
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   074   069   000    Old_age   Always       -       26
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       1
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       1726482600


SMART Error Log Version: 1
No Errors Logged


SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


root@VMNode01:/# exit
logout
EDIT: You will see, that sdb has less hours than sda or sdc. That is, because I had a Mirror first (due to hardware limitations, that I fixed later), and than I switched to RAIDZ1.


This is my zpool IOstat (every 60 seconds):
Code:
zpool iostat rpool 60

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool        116G  1.28T     12     65   769K  1.52M
rpool        116G  1.28T      3     58   142K  1.31M
rpool        116G  1.28T      3     64  94.9K  1.35M
rpool        116G  1.28T      1     68  87.5K  1.55M
rpool        116G  1.28T      1     79  79.5K  2.06M
rpool        116G  1.28T      2     78   118K  2.17M
rpool        116G  1.28T      5     67   505K  1.56M
rpool        116G  1.28T      0     69  65.3K  1.74M
rpool        116G  1.28T      0     64  21.3K  1.13M
rpool        116G  1.28T      0     52  9.93K   862K
rpool        116G  1.28T      0     41  4.93K   491K
rpool        116G  1.28T      0     41  3.20K   437K
rpool        116G  1.28T      1     54  86.1K  1.03M



this is the the iostat output for each drive this morning (every 60 seconds):
Code:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2015.11.26 09:09:00 =~=~=~=~=~=~=~=~=~=~=~=
^C
root@VMNode01:/# iostat -dm 60 sda sdb sdc
Linux 4.2.3-2-pve (VMNode01)     11/26/2015     _x86_64_    (16 CPU)


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              48.34         0.28         0.82       2687       7908
sdb              48.35         0.28         0.82       2684       7909
sdc              48.96         0.28         0.82       2689       7911


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              35.38         0.02         0.77          1         46
sdb              35.77         0.02         0.77          1         46
sdc              35.82         0.02         0.77          1         46


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              45.55         0.03         0.97          1         58
sdb              45.42         0.03         0.98          1         58
sdc              45.67         0.02         0.97          1         58


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              28.68         0.01         0.62          0         37
sdb              28.33         0.01         0.62          0         37
sdc              29.57         0.01         0.61          0         36


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              25.42         0.00         0.57          0         34
sdb              25.28         0.00         0.57          0         34
sdc              25.83         0.00         0.57          0         34


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              53.93         0.72         0.67         43         40
sdb              54.95         0.72         0.67         43         40
sdc              55.00         0.70         0.67         41         40


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              31.98         0.02         0.64          1         38
sdb              32.05         0.02         0.64          1         38
sdc              32.65         0.02         0.64          1         38


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              39.45         0.01         0.77          0         46
sdb              40.30         0.01         0.78          0         46
sdc              40.45         0.01         0.77          0         46


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              43.12         0.01         0.62          0         36
sdb              44.70         0.01         0.62          0         37
sdc              43.27         0.01         0.62          0         36


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              38.40         0.04         0.56          2         33
sdb              39.63         0.04         0.57          2         34
sdc              39.22         0.04         0.56          2         33


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              34.30         0.00         0.50          0         29
sdb              35.63         0.00         0.51          0         30
sdc              34.78         0.00         0.50          0         30


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              52.67         0.01         0.73          0         43
sdb              53.00         0.01         0.73          0         44
sdc              52.75         0.01         0.73          0         44


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              63.18         0.02         1.01          1         60
sdb              63.45         0.02         1.01          1         60
sdc              64.03         0.02         1.01          1         60


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              49.58         0.02         0.75          1         45
sdb              48.68         0.02         0.75          1         45
sdc              49.75         0.02         0.76          1         45


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              39.77         0.01         0.58          0         34
sdb              39.90         0.01         0.58          0         34
sdc              40.13         0.01         0.58          0         34


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              51.12         0.16         0.79          9         47
sdb              51.77         0.21         0.79         12         47
sdc              51.62         0.21         0.79         12         47


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              48.45         0.01         0.69          0         41
sdb              47.88         0.02         0.69          1         41
sdc              49.00         0.02         0.69          1         41


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              51.95         0.01         0.75          0         44
sdb              52.00         0.01         0.75          0         44
sdc              52.53         0.00         0.75          0         44


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              41.90         0.01         0.58          0         34
sdb              42.07         0.01         0.58          0         34
sdc              42.78         0.01         0.58          0         34


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              43.33         0.01         0.65          0         39
sdb              43.30         0.01         0.65          0         39
sdc              43.50         0.01         0.65          0         39


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              64.74         0.57         0.73         34         43
sdb              64.27         0.55         0.73         32         43
sdc              66.39         0.58         0.73         34         43


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              66.88         0.26         0.88         15         52
sdb              66.72         0.27         0.88         16         52
sdc              67.35         0.24         0.88         14         53


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              86.67         0.04         1.50          2         90
sdb              87.68         0.05         1.51          2         90
sdc              87.35         0.04         1.51          2         90


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              50.77         0.07         0.85          4         50
sdb              50.07         0.06         0.85          3         50
sdc              50.73         0.06         0.85          3         51


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              28.03         0.05         0.68          2         40
sdb              28.13         0.04         0.69          2         41
sdc              27.83         0.04         0.69          2         41


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              33.42         0.03         0.73          1         44
sdb              33.85         0.04         0.74          2         44
sdc              34.12         0.04         0.73          2         44


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              35.40         0.03         0.74          1         44
sdb              35.42         0.03         0.75          1         44
sdc              35.63         0.03         0.75          1         44


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              47.82         0.03         1.22          1         72
sdb              48.42         0.03         1.22          1         73
sdc              49.40         0.03         1.22          1         73


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              38.15         0.03         0.96          1         57
sdb              38.37         0.03         0.96          1         57
sdc              39.23         0.04         0.96          2         57


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              38.10         0.03         0.99          1         59
sdb              38.33         0.03         0.99          1         59
sdc              38.72         0.02         0.99          1         59


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              40.98         0.15         0.93          9         56
sdb              41.28         0.20         0.94         11         56
sdc              41.72         0.14         0.93          8         55


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              41.87         0.01         0.68          0         40
sdb              42.03         0.01         0.68          0         40
sdc              42.43         0.01         0.68          0         40


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              29.03         0.00         0.50          0         29
sdb              29.13         0.00         0.50          0         29
sdc              29.92         0.00         0.50          0         29


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              18.07         0.00         0.33          0         19
sdb              18.47         0.00         0.33          0         19
sdc              18.38         0.00         0.33          0         19


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              15.90         0.00         0.25          0         15
sdb              16.28         0.00         0.25          0         15
sdc              16.60         0.00         0.25          0         15


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              23.35         0.03         0.53          2         32
sdb              22.83         0.03         0.53          1         32
sdc              23.40         0.02         0.53          1         32


Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              34.28         0.00         0.63          0         37
sdb              34.52         0.00         0.63          0         37
sdc              34.58         0.00         0.63          0         37
 
Last edited:
That's strange. For each ~1MB written to the pool, ~120MB reach the drives. In a raidz config this would be expected to be ~3x.

Ignore attribute 177 to count the writes, because a SSD has write amplification. Attribute 241 gives you the total 512 byte LBAs written so ~820GB on each drive. For a 10x P/E cycle count (attr 177) @512GB drive means 5.1TB, so ~6x write amplification.

Did you set the recordsize on ZFS dataset to a low value, by chance?

Off-topic: why don't you like ZVOL? There is no single advantage in using qcow2 over ZFS dataset.
 
Did you set the recordsize on ZFS dataset to a low value, by chance?

I did not set anything different. Just a straight install from USB, default settings.
What would you recommend I change? (I do not know, what the default record size is, and also dont know, what is recommended and howto change it.)


Off-topic: why don't you like ZVOL? There is no single advantage in using qcow2 over ZFS dataset.
I dont have a preference. I like ZFS because the way it handles data (secure). How would you recommend I do the setup? (I was originally planning on using 4 SSD's in RAID10 (and for safety I will actually use RAIDZ2). So the plan is (if I stick to SSD's) to do the setup again, as soon as I can find the same SSD in town (sold out at the moment).
Are you suggesting in using ZVOL and RAW VM files? Do I get that right?

EDIT:
Ignore attribute 177 to count the writes, because a SSD has write amplification. Attribute 241 gives you the total 512 byte LBAs written so ~820GB on each drive. For a 10x P/E cycle count (attr 177) @512GB drive means 5.1TB, so ~6x write amplification.
Can I change the amplification to 3x, instead of 6x? Or is this something that is HDD specific?
 
Last edited:
ZVOL and "raw" vm files. There is no actual file, the volume is a device.
Also, do not do raidz (worse z2/3) on SSDs, it looks like it has huge write amplification:https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/hUlryHtJMnw

1. Use "raid10" like setup, which is striped mirrors. It is fast and pretty safe. You get cheap backups with ZFS anyway (using incremental send/receive) so you can backup often (e.g. google zrep)
2. check the pool ashift value (zdb | grep ashift). This is informational. Also please post the output of lsblk -dt /dev/sd?


For zvol/raw. Do this:

Code:
# zfs create rpool/vms

In UI add a storage of type ZFS and select the pool to be "rpool/vms". That's all. You can do this now, no need to reinstall.
 
  • Like
Reactions: takeokun
ZVOL and "raw" vm files. There is no actual file, the volume is a device.
Do you have some suggestions where I find info on that, so I can learn and maybe do a test setup?

Also, do not do raidz (worse z2/3) on SSDs, it looks like it has huge write amplification:https://groups.google.com/a/zfsonlin...ss/hUlryHtJMnw
I will go back to my original plan and use RAID10.





zdb | grep ashift
this does not give me an output. It also does not give an error ...

Just doing zdb, produces this output:
root@VMNode01:/# zdbcannot open '/etc/zfs/zpool.cache': No such file or directory




lsblk -dt /dev/sd?
Code:
root@VMNode01:/# lsblk -dt /dev/sd?
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE  RA WSAME
sda          0    512      0     512     512    0 deadline     128 128    0B
sdb          0    512      0     512     512    0 deadline     128 128    0B
sdc          0    512      0     512     512    0 deadline     128 128    0B
sdd          0    512      0     512     512    1 deadline     128 128    0B



EDIT:
For zvol/raw. Do this:

Code:
# zfs create rpool/vms

In UI add a storage of type ZFS and select the pool to be "rpool/vms". That's all. You can do this now, no need to reinstall.
I will try this in a test environment first, so I can get comfortable with it and hopefully understand it better.
That would mean, that root is installed on ext4 (LVM)? On top of that I create ZPOOL's?
 
Last edited:
Not good. The sector size of the flash devices is 512 bytes, so ashift should be 9. I assume this pool was created by the proxmox installer. For raidz this means at least wasted space.

If you don't mind a suggestion, I would go with a pair of small SSDs (I use a single one, 32GB) for the root mirror (you don't need ZFS for that) and create your own setup/layout of the "storage" ZFS pool (your big SSDs). All you need to backup for a new install is /etc/pve folder (/etc/pve/qemu-server for KVM and /etc/pve/lxc for containers).
 
If you don't mind a suggestion, I would go with a pair of small SSDs (I use a single one, 32GB) for the root mirror (you don't need ZFS for that) and create your own setup/layout of the "storage" ZFS pool (your big SSDs). All you need to backup for a new install is /etc/pve folder (/etc/pve/qemu-server for KVM and /etc/pve/lxc for containers).

I dont mind to do a new setup. I created the zpool as you suggested, and I think I get it.

So you are suggesting to use 2 small SSD's (mirror) for root, and than 2 large SSD's (mirror) for all my VM's. Root will be LVM (ext4) and the large SSD's will also be LVM (ext4) with ZPOOL on top of that? Do I get it right?
Since I have the 3 SSD's (and 4 hopefully soon), can I do the same, but with 4 SSD's in RAID10?

I am only able to setup proxmox with the installed. What you are suggesting is a manual setup. I am not comfortable to do that, unless I have some guide that I can follow.

I appriciate you helping me. Thank you!!!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!