zfs: finding the bottleneck

michaeljk · Sep 5, 2016

We switched from a HP hardware raid (HP P420i) system to our new Dell R730xd system with 128GB RAM and ZFS. The integrated PERC H730 Mini controller has been configured for HBA mode. There are 30 VMs running (KVM with zvol plugin, thin-provisioning), with assigned memory of 33GB (sum). Most VMs are using the IDE-Bus (no Virtio) because we migrated them from physical machines some years ago.

When the Dell system is freshly booted, it behaves really fine - it needs some time to load the data from the Toshiba SATA harddisks, but then it's stable. ARC is limited to 64GB and the system begins slowly to fill this until the limit is reached. After some hours/days with disk activity (e.g. rsync backup jobs at night) the first I/O problems are beginning - some VMs behave "sluggy" and unresponsive, others are working. Today, one of the bigger VMs showed very high I/O wait (> 80-90%) and was nearly unresponsive. After a reboot of the Dell host, performance is back again.

Unfortunatly I cannot find the bottleneck in the system - in my opinion, there could be 3 main reasons:

1) The H730 controller cannot deliver the performance which is needed
2) The SATA drives are too slow
3) There's not enough RAM for the VMs

Some info about the system:

zpool status

Code:

  pool: rpool
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
     mirror-0  ONLINE       0     0     0
       sda2    ONLINE       0     0     0
       sdb2    ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: none requested
config:

    NAME                                                     STATE     READ WRITE CKSUM
    tank                                                     ONLINE       0     0     0
     mirror-0                                               ONLINE       0     0     0
       ata-TOSHIBA_DT01ACA200_13S8D22AS                     ONLINE       0     0     0
       ata-TOSHIBA_DT01ACA200_43N2AXSGS                     ONLINE       0     0     0
     mirror-1                                               ONLINE       0     0     0
       ata-TOSHIBA_DT01ACA200_43N2EV5GS                     ONLINE       0     0     0
       ata-TOSHIBA_DT01ACA200_43N2JP8GS                     ONLINE       0     0     0
     mirror-2                                               ONLINE       0     0     0
       ata-TOSHIBA_DT01ACA200_43O0ZHLAS                     ONLINE       0     0     0
       ata-TOSHIBA_DT01ACA200_43O2H8VGS                     ONLINE       0     0     0
     mirror-3                                               ONLINE       0     0     0
       ata-TOSHIBA_DT01ACA200_43O2K1GGS                     ONLINE       0     0     0
       ata-TOSHIBA_DT01ACA200_53VE7WTGS                     ONLINE       0     0     0
    logs
     mirror-4                                               ONLINE       0     0     0
       ata-SAMSUNG_MZ7KM120HAFD-00005_S2HPNX0H500037-part1  ONLINE       0     0     0
       ata-SAMSUNG_MZ7KM120HAFD-00005_S2HPNX0H500035-part1  ONLINE       0     0     0

errors: No known data errors

pveperf /tank

Code:

CPU BOGOMIPS:      115208.52
REGEX/SECOND:      2461409
HD SIZE:           5874.93 GB (tank)
FSYNCS/SECOND:     5621.81
DNS EXT:           32.33 ms
DNS INT:           97.47 ms

zpool iostat -v 2 300

Code:

                                                            capacity     operations    bandwidth

pool                                                     alloc   free   read  write   read  write
-------------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                                    1.97G   109G      0     61      0   237K
  mirror                                                 1.97G   109G      0     61      0   237K
    sda2                                                     -      -      0     23      0   266K
    sdb2                                                     -      -      0     23      0   266K
-------------------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                                     1.29T  5.96T     33  2.56K   118K  14.5M
  mirror                                                  329G  1.49T      9    537  23.2K  2.93M
    ata-TOSHIBA_DT01ACA200_13S8D22AS                         -      -      3     40  20.0K  2.94M
    ata-TOSHIBA_DT01ACA200_43N2AXSGS                         -      -      5     42  24.0K  2.94M
  mirror                                                  329G  1.49T      6    821  13.5K  4.17M
    ata-TOSHIBA_DT01ACA200_43N2EV5GS                         -      -      2     51  14.0K  4.19M
    ata-TOSHIBA_DT01ACA200_43N2JP8GS                         -      -      3     54  14.0K  4.19M
  mirror                                                  329G  1.49T      4    502  22.0K  2.88M
    ata-TOSHIBA_DT01ACA200_43O0ZHLAS                         -      -      2     25  16.0K  2.90M
    ata-TOSHIBA_DT01ACA200_43O2H8VGS                         -      -      1     26  10.0K  2.90M
  mirror                                                  329G  1.49T     13    752  59.2K  3.83M
    ata-TOSHIBA_DT01ACA200_43O2K1GGS                         -      -      5     50  56.0K  3.85M
    ata-TOSHIBA_DT01ACA200_53VE7WTGS                         -      -      4     48  24.0K  3.85M
logs                                                         -      -      -      -      -      -
  mirror                                                 4.97M  15.9G      0      5      0   674K
    ata-SAMSUNG_MZ7KM120HAFD-00005_S2HPNX0H500037-part1      -      -      0      5      0   674K
    ata-SAMSUNG_MZ7KM120HAFD-00005_S2HPNX0H500035-part1      -      -      0      5      0   674K
-------------------------------------------------------  -----  -----  -----  -----  -----  -----

vmstat 1

Code:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  1      0 64548264 18282908 100120    0    0   584   633   82  643  8  2 81  9  0
 2  0      0 64549680 18283156 100216    0    0   304   596 34982 74746  9  1 89  1  0
 3  0      0 64549676 18283648 100016    0    0   260    60 34683 75484  9  1 89  1  0
 2  0      0 64544416 18284772  99816    0    0  1792 65060 35431 78989 10  3 86  2  0
 4  0      0 64544316 18285244 100224    0    0  1812  2080 34002 74553 10  1 87  1  0
 4  0      0 64540464 18285784 100052    0    0  3976    64 34523 74093 11  1 87  1  0
 4  0      0 64539544 18286044 100328    0    0   444   220 35031 73226 14  2 84  1  0
 3  2      0 64537408 18286492 100116    0    0  2568   392 34367 73940 11  2 86  1  0
 3  0      0 64536724 18286848 100220    0    0   576 21924 34567 77946 11  2 86  1  0
 2  0      0 64535928 18287112 100352    0    0   828  2732 34894 75539 11  1 87  1  0
 2  0      0 64533608 18287368 100260    0    0    96  1112 34360 74720  8  1 90  0  0
 4  0      0 64533348 18287576 100316    0    0   176  1592 33807 72820  9  1 89  1  0
 2  1      0 64532112 18288024 100204    0    0  1132   108 33895 74384 10  1 88  1  0
 2  0      0 64512964 18293328 100008    0    0 16332 18192 36548 84743  9  2 88  1  0
 2  1      0 64507796 18295756  99480    0    0  5444  4576 34483 77125 10  1 87  2  0

uptime + arc-stats

Code:

 18:50:27 up  2:58,  2 users,  load average: 1.97, 2.00, 1.95

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
18:50:31     3     0      0     0    0     0    0     0    0    45G   64G  
18:50:32    28     7     25     5   19     2  100     2   11    45G   64G  
18:50:33   443    67     15    62   14     5   50     3   60    45G   64G  
18:50:34  2.6K    47      1    13    0    34   64     5   22    45G   64G  
18:50:35  4.5K    15      0    15    0     0    0     0    0    45G   64G  
18:50:36  1.0K   290     28    33    4   257   71     1    4    45G   64G  
18:50:37   694   169     24    11    2   158   49     2    0    45G   64G  
18:50:38    90    18     20    11   13     7  100     2  100    45G   64G  
18:50:39     3     1     33     1   33     0    0     0    0    45G   64G  
18:50:40    94     7      7     7    7     0    0     0    0    45G   64G

/etc/modprobe.d/zfs.conf

Code:

options zfs zfs_arc_max=68719476736

free

Code:

             total       used       free     shared    buffers     cached
Mem:     131915736   94668292   37247444      55156   23380992     100564
-/+ buffers/cache:   71186736   60729000
Swap:      8388604          0    8388604

michaeljk · Sep 5, 2016

pveversion -v

Code:

proxmox-ve: 4.2-60 (running kernel: 4.4.15-1-pve)
pve-manager: 4.2-17 (running version: 4.2-17/e1400248)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.15-1-pve: 4.4.15-60
lvm2: 2.02.116-pve2
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-43
qemu-server: 4.0-85
pve-firmware: 1.1-8
libpve-common-perl: 4.0-72
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-56
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6-1
pve-container: 1.0-72
pve-firewall: 2.0-29
pve-ha-manager: 1.0-33
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.3-4
lxcfs: 2.0.2-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5.7-pve10~bpo8

zpool list

Code:

NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool   111G  1.97G   109G         -     0%     1%  1.00x  ONLINE  -
tank   7.25T  1.29T  5.96T         -    10%    17%  1.00x  ONLINE  -

There's no deduplication in use, lz4 compression is active for all volumes. atime is off for /tank, ashift=12.

My guess is that this is (hopefully) a RAM-problem. ARC fills up very slowly, so perhaps 64GB is already oversized? Will the VMs use more then the configured 33GB (because of caches, buffering, ...)? If so, where can I check the sum of RAM used without ZFS arc?

fireon · Sep 5, 2016

Where is you cache, i only see logdisk. Have a look at my config:

Code:

pool  alloc  free  read  write  read  write 
-----------------------------------------------------  -----  -----  -----  -----  -----  ----- 
rpool  2.77G  25.0G  0  31  1.08K  149K 
 mirror  2.77G  25.0G  0  31  1.08K  149K 
 sdh3  -  -  0  16  633  161K 
 sdi3  -  -  0  15  587  161K 
-----------------------------------------------------  -----  -----  -----  -----  -----  ----- 
v-machines  3.44T  2.00T  5  124  178K  657K 
 mirror  1.26T  569G  0  29  42.5K  122K 
 ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D0KRWP  -  -  0  7  21.7K  187K 
 ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D0AZMA  -  -  0  7  21.7K  187K 
 mirror  1.34T  487G  1  35  49.8K  146K 
 ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D688XW  -  -  0  18  26.4K  225K 
 ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D63WM0  -  -  0  18  26.4K  225K 
 mirror  867G  989G  2  57  85.4K  247K 
 ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D6KCJD  -  -  1  22  44.8K  372K 
 ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D8180T  -  -  1  22  45.0K  372K 
logs  -  -  -  -  -  - 
 ata-SAMSUNG_MZ7KM240HAGR-00005_S2HRNXAH300789-part1  2.20M  49.7G  0  2  4  142K 
cache  -  -  -  -  -  - 
 ata-SAMSUNG_MZ7KM240HAGR-00005_S2HRNXAH300789-part2  67.8G  106G  2  7  19.5K  280K 
-----------------------------------------------------  -----  -----  -----  -----  -----  -----

https://pve.proxmox.com/wiki/Storage:_ZFS#Create_a_new_pool_with_Cache_.28L2ARC.29

I don't know that the hbamode from PERC H730 really set the controller to an real SAS controller. On my controller i had do erase the bios completly. On ZFS forums that is highly recommended. We use also DELL with the same H730, but only with HW Raidfunction. But we use DELL other models also with ZFS, but only with real SATA and real SAS controler.

LnxBil · Sep 6, 2016

SATA is not (in general) a good choice for ZFS per se. Please check the IO response time with something like this per disk:

Code:

iostat -x 5

IO errors on SATA will not immediately be answered (for devices with are not suited for enterprise usage), such that single IO operation can vary significantly in time and that will slow down everything. "Good" SATA Enterprise disks will return an disk error much quicker.

michaeljk · Sep 6, 2016

iostat looks good:

Code:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.39    0.00    2.16   11.90    0.00   71.55

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20     0.00    2.20   12.20    11.20   130.40    19.67     0.00    0.06    0.00    0.07   0.06   0.08
sdb               0.20     0.00    2.20   12.40    14.40   130.40    19.84     0.00    0.05    0.00    0.06   0.05   0.08
sdl               0.00     0.00    0.00   11.00     0.00   374.40    68.07     0.00    0.29    0.00    0.29   0.15   0.16
sdk               0.00     0.00    0.00   11.00     0.00   374.40    68.07     0.00    0.29    0.00    0.29   0.15   0.16
sdf               1.40     0.00   36.20   23.60   226.40  1151.20    46.07     0.49    8.24   13.41    0.31   3.53  21.12
sdj               0.20     0.00   35.60   37.80   252.00  2446.40    73.53     0.54    7.36   14.83    0.32   2.68  19.68
sdh               0.00     0.00   38.40   38.00   246.40  2446.40    70.49     0.43    5.68   10.98    0.32   1.88  14.40
sdi               0.40     0.00   29.40  130.80   234.40   732.80    12.07     5.27   23.18   97.90    6.39   6.23  99.84
sdc               0.20     0.00   38.80  171.80   231.20   789.60     9.69     2.35    7.72   24.95    3.83   4.30  90.64
sde               0.40     0.00   40.20   22.60   244.80  1151.20    44.46     0.68   10.82   16.72    0.32   3.77  23.68
sdd               0.20     0.00   31.20  329.60   416.00  1776.00    12.15     1.38    3.83   29.23    1.43   1.82  65.52
sdg               0.20     0.00   42.80  329.60   260.80  1776.00    10.94     1.16    3.13   16.62    1.38   1.69  63.12

After the first night with some rsync-tasks, the Proxmox website shows an I/O delay between 4-15%, Load Average 3,5-6, RAM usage has increased (total: 125.80 GB, used: 97.99).

free

Code:

             total       used       free     shared    buffers     cached
Mem:     131915736  130358716    1557020      51248   27503396      75072
-/+ buffers/cache:  102780248   29135488
Swap:      8388604     184076    8204528

arc-stats

Code:

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
16:28:19     3     1     33     1   33     0    0     0    0    63G   63G  
16:28:20  2.1K   179      8   179    8     0    0     2   14    63G   63G  
16:28:21  6.8K   925     13   799   11   126  100     4    3    63G   63G  
16:28:22   586    19      3    18    3     1  100     0    0    63G   63G  
16:28:23   679    19      2    19    2     0    0     5  100    63G   63G  
16:28:24   277   254     91     6   20   248  100     1  100    63G   63G  
16:28:25   107    12     11    10    9     2  100     0    0    63G   63G

So it seems that the complete memory has been used and now the system begins to feel "sluggy" within some VMs (longer login time when using ssh or executing commands). We also used a zfs cache partition before, but this made no difference. Why does Proxmox display "97.99 GB" used on the website, is it the sum of all VM usage? If this is really the case, we have not enough RAM (128 - 97.99 = 30.01, but ZFS arc is configured with 64 GB).

LnxBil · Sep 6, 2016

Yes, the IO times via iostat are not bad, yet 44ms response is tolerable - ideally you have something like 9ms average seek time.

Your free output shows that almost all your ram is used and some parts are swapped out (could have happend before). If you also have swap on ZFS, this can be huge impact on performance, yet there is a separate tank for that, so that is not very likely. (I also had crashes with that constellation. There are some best-practice guidelines for running swap on ZFS - https://pve.proxmox.com/wiki/Storage:_ZFS#SWAP_on_ZFS, yet that is not all. Try to disable sync and metadata caching on swap also.)

I also suggest to install zram-config from the ubuntu package repository which create a compressed ram-based swap before using the "real deal".

Normally, ZFS should release memory if there is a lack of ram, so this should happen if the memory is really needed. Increased io delay means that you want to access stuff that is not fast read. Could you try to perform (at night or not-so-loaded times) a fio disk check on a volume with random IO pattern to see where the maximum of your system lies? Then it'll be easier to determine if you already reached the maximum capacity of your system of not.

michaeljk · Sep 6, 2016

Swap is configured on rpool, which are 2x Samsung SSDs - configured by Proxmox installer with mirror (RAID1). I also set "vm.swappiness = 10" in "/etc/sysctl.conf" like mentioned in the wiki. Yesterday (after the fresh reboot) we had no swap usage at all, so this is a change from today. I'm still wondering why the VMs or ZFS are using more then the configured RAM. arcstat.py shows the 64GB limit, but for which reason does the system use the other 64GB RAM, if I only have configured 33GB of them for the VMs? Will the OS take all the remaining 31G for caches / buffering?

michaeljk · Sep 7, 2016

A few hours ago, I installed the new Proxmox updates and reduced the ARC size to 16 GB. After reboot and now 5 hours uptime, Proxmox gui shows 38.93 GiB of 125.80 GiB RAM used (which would be approx. 33GB used by VMs and the rest for OS).

free

Code:

             total       used       free     shared    buffers     cached
Mem:     131915736   78231260   53684476      55160   37287760     101672
-/+ buffers/cache:   40841828   91073908
Swap:      8388604          0    8388604

IO delay looks good, I will take a look on the system tomorrow after the rsync tasks have been completed. If memory consumption keeps on this level, I could try to increase ARC size to 32GB.

LnxBil · Sep 7, 2016

What do you rsync in a ZFS environment?

michaeljk · Sep 7, 2016

RAM usage seems to be quite stable this time (interface shows 45.20 GiB of 125.80 GiB used now).

free

Code:

total       used       free     shared    buffers     cached
Mem:     131915736  131585728     330008      52012   84086552     116752
-/+ buffers/cache:   47382424   84533312
Swap:      8388604         28    8388576

What do you rsync in a ZFS environment?

As I mentioned, we have just recently switched to the new server. I know that snapshots would be more efficient now, but before we rely on a zfs only solution I want to make sure that the system runs stable.

Could you try to perform (at night or not-so-loaded times) a fio disk check on a volume with random IO pattern to see where the maximum of your system lies? Then it'll be easier to determine if you already reached the maximum capacity of your system of not.

Can you provide me a command line for this? (fio can use many parameters and I want to make it compareable) - I'll test it then on the former HW-Raid system and the new Dell server.

mir · Sep 7, 2016

# This job file tries to mimic the Intel IOMeter File Server Access Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern

[iometer]
bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10
rw=randrw
rwmixread=80
direct=1
size=4g
ioengine=libaio
# IOMeter defines the server loads as the following:
# iodepth=1 Linear
# iodepth=4 Very Light
# iodepth=8 Light
# iodepth=64 Moderate
# iodepth=256 Heavy
iodepth=64

michaeljk · Sep 7, 2016

Thank you, I'll do the fio tests as soon as possible. There's another thing that I noticed today:

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sde               0.00     0.00    0.50    0.00     2.00     0.00     8.00     0.00    8.00    8.00    0.00   8.00   0.40
sdf               0.00     0.00    1.00    0.00     4.00     0.00     8.00     0.01   14.00   14.00    0.00  14.00   1.40
sdj               0.00     0.00    4.00    0.00    16.00     0.00     8.00     0.00    0.50    0.50    0.00   0.50   0.20
sdi               0.00     0.00    6.50  159.50    32.00   972.00    12.10     4.27   14.34  215.08    6.16   6.02 100.00
sdd               0.00     0.00    0.50    0.00     2.00     0.00     8.00     0.00    8.00    8.00    0.00   8.00   0.40
sdc               0.00     0.00    0.00  160.00     0.00   970.00    12.12     4.56    6.21    0.00    6.21   6.25 100.00
sdh               0.00     0.00    0.50    0.00     2.00     0.00     8.00     0.00    8.00    8.00    0.00   8.00   0.40
sdg               0.00     0.00    3.50    0.00    16.00     0.00     9.14     0.01    2.86    2.86    0.00   2.86   1.00

It seems that /dev/sdi and /dev/sdc are very busy all the time, both belong to mirror-0 on /tank. The pool is balanced (329G on each mirror):

Code:

tank                                                     1.29T  5.96T    772      0  4.35M  32.0K
  mirror                                                  329G  1.49T    186      0  1.24M      0
    ata-TOSHIBA_DT01ACA200_13S8D22AS                         -      -     57      0   524K      0
    ata-TOSHIBA_DT01ACA200_43N2AXSGS                         -      -     64      0   794K      0
  mirror                                                  329G  1.49T    265      0  1.34M      0
    ata-TOSHIBA_DT01ACA200_43N2EV5GS                         -      -    119      0   612K      0
    ata-TOSHIBA_DT01ACA200_43N2JP8GS                         -      -    113      0   786K      0
  mirror                                                  329G  1.49T    174      0   932K      0
    ata-TOSHIBA_DT01ACA200_43O0ZHLAS                         -      -     53      0   326K      0
    ata-TOSHIBA_DT01ACA200_43O2H8VGS                         -      -    118      0   626K      0
  mirror                                                  329G  1.49T    144      0   876K      0
    ata-TOSHIBA_DT01ACA200_43O2K1GGS                         -      -     67      0   534K      0
    ata-TOSHIBA_DT01ACA200_53VE7WTGS                         -      -     59      0   348K      0

Perhaps a hardware error (cable, plugs, ...)? A smartctl short test didn't show any errors on these two devices.

michaeljk · Sep 7, 2016

I have modified the ARC configuration again, increased it to 32GB and rebooted the server. Now 18GB ARC is used and iostat looks better:

Code:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.80    0.00    0.78    0.76    0.00   92.66

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdh               0.00     0.00    4.00    0.00    18.00     0.00     9.00     0.02    6.00    6.00    0.00   5.00   2.00
sdg               0.00     0.00    5.50    0.00    24.00     0.00     8.73     0.03    5.09    5.09    0.00   4.73   2.60
sdi               0.00     0.00    3.00    0.00    14.00     0.00     9.33     0.07   24.67   24.67    0.00  18.67   5.60
sde               0.00     0.00    2.50    0.00    14.00     0.00    11.20     0.01    5.60    5.60    0.00   5.60   1.40
sdd               0.00     0.00    4.00    0.00    20.00     0.00    10.00     0.03    7.50    7.50    0.00   6.00   2.40
sdc               0.00     0.00    5.50    0.00    30.00     0.00    10.91     0.06   10.91   10.91    0.00   8.36   4.60
sdf               0.50     0.00    2.50    0.00    18.00     0.00    14.40     0.01    2.40    2.40    0.00   2.40   0.60
sdj               0.00     0.00    4.00    0.00    18.00     0.00     9.00     0.03    9.00    9.00    0.00   7.00   2.80

Proxmox interface shows 34.83 GiB of 125.80 GiB used. When I booted the server I saw that it uses the "megaraid_sas" driver for the communication to the SATA drives - could this be a problem?

LnxBil · Sep 8, 2016

You can recall all the messages from kernel from boot with the command dmesg, please send the corresponding entries.

michaeljk · Sep 8, 2016

dmesg |grep mega

Code:

[    2.310088] megasas: 06.810.09.00-rc1
[    2.310747] megaraid_sas 0000:03:00.0: FW now in Ready state
[    2.311337] megaraid_sas 0000:03:00.0: firmware supports msix    : (96)
[    2.311340] megaraid_sas 0000:03:00.0: current msix/online cpus    : (24/24)
[    2.311342] megaraid_sas 0000:03:00.0: RDPQ mode    : (disabled)
[    2.311599] megaraid_sas 0000:03:00.0: Current firmware maximum commands: 928    LDIO threshold: 0
[    2.336767] megaraid_sas 0000:03:00.0: Init cmd success
[    2.360800] megaraid_sas 0000:03:00.0: firmware type    : Extended VD(240 VD)firmware
[    2.360804] megaraid_sas 0000:03:00.0: controller type    : MR(1024MB)
[    2.360806] megaraid_sas 0000:03:00.0: Online Controller Reset(OCR)    : Enabled
[    2.360807] megaraid_sas 0000:03:00.0: Secure JBOD support    : No
[    2.384959] megaraid_sas 0000:03:00.0: INIT adapter done
[    2.385047] megaraid_sas 0000:03:00.0: Jbod map is not supported megasas_setup_jbod_map 4941
[    2.390189] megaraid_sas 0000:03:00.0: pci id        : (0x1000)/(0x005d)/(0x1028)/(0x1f49)
[    2.390191] megaraid_sas 0000:03:00.0: unevenspan support    : yes
[    2.390192] megaraid_sas 0000:03:00.0: firmware crash dump    : no
[    2.390193] megaraid_sas 0000:03:00.0: jbod sync map        : no

lshw -class disk -class storage

Code:

*-storage
       description: RAID bus controller
       product: MegaRAID SAS-3 3108 [Invader]
       vendor: LSI Logic / Symbios Logic
       physical id: 0
       bus info: pci@0000:03:00.0
       logical name: scsi0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: storage pm pciexpress vpd msi msix bus_master cap_list
       configuration: driver=megaraid_sas latency=0
       resources: irq:37 ioport:2000(size=256) memory:91d00000-91d0ffff memory:91c00000-91cfffff
     *-disk:0
          description: ATA Disk
          product: SAMSUNG MZ7KM120
          physical id: 0.0.0
          bus info: scsi@0:0.0.0
          logical name: /dev/sda
          version: 003Q
          serial: S2HPNX0H500074
          size: 111GiB (120GB)
          capacity: 111GiB (120GB)
          capabilities: gpt-1.00 partitioned partitioned:gpt
          configuration: ansiversion=6 guid=16e88499-2e4d-4da7-bbb3-733ab450e991 logicalsectorsize=512 sectorsize=512
[...]
     *-disk:4
          description: ATA Disk
          product: TOSHIBA DT01ACA2
          vendor: Toshiba
          physical id: 0.2.0
          bus info: scsi@0:0.2.0
          logical name: /dev/sdc
          version: ABB0
          serial: 13S8D22AS
          size: 1863GiB (2TB)
          capacity: 1863GiB (2TB)
          capabilities: gpt-1.00 partitioned partitioned:gpt
          configuration: ansiversion=6 guid=c6fe59cd-9ad5-a244-ac50-d48c87ca0273 logicalsectorsize=512 sectorsize=4096

LnxBil · Sep 8, 2016

Does not look wrong, so I think it's not the problem.

When is the fio test ready?

michaeljk · Sep 8, 2016

This is the fio test from inside a VM (virtio) on the Dell system:

Code:

iometer: (g=0): rw=randrw, bs=512-64K/512-64K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
iometer: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [m] [100.0% done] [120.7M/30887K /s] [97.5K/24.5K iops] [eta 00m:00s]
iometer: (groupid=0, jobs=1): err= 0: pid=8543
  Description  : [Emulation of Intel IOmeter File Server Access Pattern]
  read : io=3270.5MB, bw=250916KB/s, iops=82257 , runt= 13347msec
    slat (usec): min=2 , max=4026 , avg= 5.79, stdev= 8.85
    clat (usec): min=51 , max=245691 , avg=613.04, stdev=2497.63
     lat (usec): min=57 , max=245697 , avg=619.15, stdev=2497.72
    clat percentiles (usec):
     |  1.00th=[  306],  5.00th=[  414], 10.00th=[  454], 20.00th=[  490],
     | 30.00th=[  510], 40.00th=[  524], 50.00th=[  540], 60.00th=[  548],
     | 70.00th=[  564], 80.00th=[  588], 90.00th=[  708], 95.00th=[  964],
     | 99.00th=[ 1160], 99.50th=[ 1400], 99.90th=[ 4080], 99.95th=[ 8640],
     | 99.99th=[128512]
    bw (KB/s)  : min=44195, max=798092, per=100.00%, avg=254456.00, stdev=191066.62
  write: io=845329KB, bw=63335KB/s, iops=20543 , runt= 13347msec
    slat (usec): min=2 , max=8037 , avg= 7.58, stdev=25.62
    clat (usec): min=69 , max=245478 , avg=618.81, stdev=2378.01
     lat (usec): min=74 , max=245481 , avg=626.75, stdev=2378.31
    clat percentiles (usec):
     |  1.00th=[  338],  5.00th=[  430], 10.00th=[  466], 20.00th=[  494],
     | 30.00th=[  516], 40.00th=[  532], 50.00th=[  540], 60.00th=[  556],
     | 70.00th=[  572], 80.00th=[  596], 90.00th=[  724], 95.00th=[  972],
     | 99.00th=[ 1176], 99.50th=[ 1416], 99.90th=[ 3984], 99.95th=[ 8512],
     | 99.99th=[128512]
    bw (KB/s)  : min=10813, max=214445, per=100.00%, avg=64236.31, stdev=49754.91
    lat (usec) : 100=0.01%, 250=0.37%, 500=24.16%, 750=66.45%, 1000=4.83%
    lat (msec) : 2=3.99%, 4=0.10%, 10=0.06%, 20=0.01%, 50=0.01%
    lat (msec) : 250=0.02%
  cpu          : usr=24.76%, sys=65.88%, ctx=13447, majf=1, minf=20
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1097886/w=274197/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=3270.5MB, aggrb=250915KB/s, minb=250915KB/s, maxb=250915KB/s, mint=13347msec, maxt=13347msec
  WRITE: io=845329KB, aggrb=63334KB/s, minb=63334KB/s, maxb=63334KB/s, mint=13347msec, maxt=13347msec

Disk stats (read/write):
  vda: ios=1072322/267635, merge=0/19, ticks=96688/30600, in_queue=126848, util=99.05%

And this is from the old hardware raid system (fio in file on /var/lib/vz):

Code:

iometer: (g=0): rw=randrw, bs=512-64K/512-64K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
iometer: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [m] [100.0% done] [4409K/1084K /s] [3771 /935  iops] [eta 00m:00s]  
iometer: (groupid=0, jobs=1): err= 0: pid=849352
  Description  : [Emulation of Intel IOmeter File Server Access Pattern]
  read : io=3276.3MB, bw=12651KB/s, iops=4120 , runt=265193msec
    slat (usec): min=6 , max=2303 , avg=16.20, stdev= 6.76
    clat (usec): min=1 , max=1210.1K, avg=15489.43, stdev=22600.41
     lat (usec): min=39 , max=1210.1K, avg=15506.08, stdev=22600.40
    clat percentiles (usec):
     |  1.00th=[   37],  5.00th=[ 1368], 10.00th=[ 2160], 20.00th=[ 3440],
     | 30.00th=[ 4768], 40.00th=[ 6496], 50.00th=[ 8768], 60.00th=[11968],
     | 70.00th=[16192], 80.00th=[22912], 90.00th=[36096], 95.00th=[50944],
     | 99.00th=[92672], 99.50th=[116224], 99.90th=[203776], 99.95th=[264192],
     | 99.99th=[700416]
    bw (KB/s)  : min= 4298, max=60676, per=100.00%, avg=12658.06, stdev=9024.45
  write: io=839475KB, bw=3165.6KB/s, iops=1030 , runt=265193msec
    slat (usec): min=7 , max=407 , avg=17.78, stdev= 6.56
    clat (usec): min=1 , max=206641 , avg=58.90, stdev=1132.63
     lat (usec): min=40 , max=206666 , avg=77.15, stdev=1132.63
    clat percentiles (usec):
     |  1.00th=[   34],  5.00th=[   36], 10.00th=[   37], 20.00th=[   37],
     | 30.00th=[   38], 40.00th=[   38], 50.00th=[   39], 60.00th=[   39],
     | 70.00th=[   40], 80.00th=[   42], 90.00th=[   49], 95.00th=[   55],
     | 99.00th=[   83], 99.50th=[  100], 99.90th=[  462], 99.95th=[ 6944],
     | 99.99th=[43264]
    bw (KB/s)  : min= 1029, max=16126, per=100.00%, avg=3167.48, stdev=2281.73
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=21.24%
    lat (usec) : 100=1.90%, 250=0.10%, 500=0.08%, 750=0.04%, 1000=0.07%
    lat (msec) : 2=3.62%, 4=12.41%, 10=24.08%, 20=17.41%, 50=14.92%
    lat (msec) : 100=3.50%, 250=0.59%, 500=0.03%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=5.12%, sys=10.37%, ctx=1034656, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1092808/w=273254/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=3276.3MB, aggrb=12650KB/s, minb=12650KB/s, maxb=12650KB/s, mint=265193msec, maxt=265193msec
  WRITE: io=839474KB, aggrb=3165KB/s, minb=3165KB/s, maxb=3165KB/s, mint=265193msec, maxt=265193msec

Disk stats (read/write):
    dm-2: ios=1094015/274665, merge=0/0, ticks=16931776/14852, in_queue=16947887, util=100.00%, aggrios=1093694/275210, aggrmerge=993/753, aggrticks=16926927/14655, aggrin_queue=16941045, aggrutil=100.00%
  sda: ios=1093694/275210, merge=993/753, ticks=16926927/14655, in_queue=16941045, util=100.00%

LnxBil · Sep 8, 2016

I'd say pretty normal for that setup. Let's see hat @mir has to say about that.

michaeljk · Sep 8, 2016

I have two options for upgrading this server: +64GB RAM (= 192GB) or changing the 7200 RPM SATA drives to 15k SAS 600GB. But before doing modifications, I must be sure that the integrated controller can be safely used for ZFS. Otherwise we need to order a new HP system with hardware raid (LVM-thin) instead, but loose all the benefits of zfs

mir · Sep 8, 2016

I see nothing unusual with both results. The result from the Dell system is more than adequate for a large number of VM's.

zfs: finding the bottleneck

Renowned Member

Renowned Member

Distinguished Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Famous Member