Help interprete ZFS stats (Grafana / Telegraf metrics)

Jan 21, 2016
96
8
73
43
Germany
www.pug.org
hi,

we've got notifications from our monitoring (Icinga2), which is a VM on a PVE 5.2 host, with WD RED(WDC WD10JFCX-68N6GN0) 6 x 1TB (2.5") as Raidz2, because of timeouts (check_icmp).
After a longer investigation, we found out, that these alerts where false positives, because the monitoring VM itself wasn't able to execute checks.
Our metrics from Icinga2 and Telegraf (both in InfluxDB) shows, that the I/O was going up, at exact 6:25, which is the time from cron.daily.
The CPU is a E3-1270 v5 @ 3.5Ghz and we have 64GB DDR4 ECC. Arc is limited from min 6GB till max 12GB ram.
Glad that we are collecting via Telegraf also ZFS stats, but I'm not sure, how to interpret them. Maybe someone can help us out.
We using the Supermicro X11SSH-TF and a LSI/Broadcom controller (SAS3008). The only thing what we can do: we have a single M.2 slot free .. maybe we can use it as cache ?

Any suggestions?
 

Attachments

  • Bildschirmfoto 2018-10-10 um 00.14.13.png
    Bildschirmfoto 2018-10-10 um 00.14.13.png
    401.6 KB · Views: 92
hi,


Code:
zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 35h54m with 0 errors on Mon Sep 10 12:18:32 2018
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sda2    ONLINE       0     0     0
        sdb2    ONLINE       0     0     0
        sdc2    ONLINE       0     0     0
        sdd2    ONLINE       0     0     0
        sde2    ONLINE       0     0     0
        sdf2    ONLINE       0     0     0
        sdg2    ONLINE       0     0     0
        sdh2    ONLINE       0     0     0
errors: No known data errors

Code:
arc_summary
------------------------------------------------------------------------
ZFS Subsystem Report                Fri Oct 12 12:06:06 2018
ARC Summary: (HEALTHY)
    Memory Throttle Count:            0

ARC Misc:
    Deleted:                40.23M
    Mutex Misses:                1.57k
    Evict Skips:                24.43k

ARC Size:                64.42%    7.73    GiB
    Target Size: (Adaptive)        64.77%    7.77    GiB
    Min Size (Hard Limit):        50.00%    6.00    GiB
    Max Size (High Water):        2:1    12.00    GiB

ARC Size Breakdown:
    Recently Used Cache Size:    30.34%    2.16    GiB
    Frequently Used Cache Size:    69.66%    4.96    GiB

ARC Hash Breakdown:
    Elements Max:                3.66M
    Elements Current:        49.55%    1.81M
    Collisions:                73.65M
    Chain Max:                7
    Chains:                    168.25k

ARC Total accesses:                    5.57G
    Cache Hit Ratio:        98.88%    5.51G
    Cache Miss Ratio:        1.12%    62.29M
    Actual Hit Ratio:        98.35%    5.48G

    Data Demand Efficiency:        99.48%    4.62G
    Data Prefetch Efficiency:    58.03%    39.18M

    CACHE HITS BY CACHE LIST:
      Anonymously Used:        0.37%    20.12M
      Most Recently Used:        5.04%    277.45M
      Most Frequently Used:        94.43%    5.20G
      Most Recently Used Ghost:    0.09%    4.85M
      Most Frequently Used Ghost:    0.08%    4.64M

    CACHE HITS BY DATA TYPE:
      Demand Data:            83.47%    4.60G
      Prefetch Data:        0.41%    22.74M
      Demand Metadata:        15.72%    866.07M
      Prefetch Metadata:        0.40%    21.79M

    CACHE MISSES BY DATA TYPE:
      Demand Data:            38.41%    23.92M
      Prefetch Data:        26.40%    16.44M
      Demand Metadata:        33.98%    21.16M
      Prefetch Metadata:        1.21%    754.86k

DMU Prefetch Efficiency:                    284.27M
    Hit Ratio:            19.50%    55.45M
    Miss Ratio:            80.50%    228.83M

ZFS Tunables:
    dbuf_cache_hiwater_pct                            10
    dbuf_cache_lowater_pct                            10
    dbuf_cache_max_bytes                              104857600
    dbuf_cache_max_shift                              5
    dmu_object_alloc_chunk_shift                      7
    ignore_hole_birth                                 1
    l2arc_feed_again                                  1
    l2arc_feed_min_ms                                 200
    l2arc_feed_secs                                   1
    l2arc_headroom                                    2
    l2arc_headroom_boost                              200
    l2arc_noprefetch                                  1
    l2arc_norw                                        0
    l2arc_write_boost                                 8388608
    l2arc_write_max                                   8388608
    metaslab_aliquot                                  524288
    metaslab_bias_enabled                             1
    metaslab_debug_load                               0
    metaslab_debug_unload                             0
    metaslab_fragmentation_factor_enabled             1
    metaslab_lba_weighting_enabled                    1
    metaslab_preload_enabled                          1
    metaslabs_per_vdev                                200
    send_holes_without_birth_time                     1
    spa_asize_inflation                               24
    spa_config_path                                   /etc/zfs/zpool.cache
    spa_load_verify_data                              1
    spa_load_verify_maxinflight                       10000
    spa_load_verify_metadata                          1
    spa_slop_shift                                    5
    zfetch_array_rd_sz                                1048576
    zfetch_max_distance                               8388608
    zfetch_max_streams                                8
    zfetch_min_sec_reap                               2
    zfs_abd_scatter_enabled                           1
    zfs_abd_scatter_max_order                         10
    zfs_admin_snapshot                                1
    zfs_arc_average_blocksize                         8192
    zfs_arc_dnode_limit                               0
    zfs_arc_dnode_limit_percent                       10
    zfs_arc_dnode_reduce_percent                      10
    zfs_arc_grow_retry                                0
    zfs_arc_lotsfree_percent                          10
    zfs_arc_max                                       12884901888
    zfs_arc_meta_adjust_restarts                      4096
    zfs_arc_meta_limit                                0
    zfs_arc_meta_limit_percent                        75
    zfs_arc_meta_min                                  0
    zfs_arc_meta_prune                                10000
    zfs_arc_meta_strategy                             1
    zfs_arc_min                                       6442450944
    zfs_arc_min_prefetch_lifespan                     0
    zfs_arc_p_dampener_disable                        1
    zfs_arc_p_min_shift                               0
    zfs_arc_pc_percent                                0
    zfs_arc_shrink_shift                              0
    zfs_arc_sys_free                                  0
    zfs_autoimport_disable                            1
    zfs_checksums_per_second                          20
    zfs_compressed_arc_enabled                        1
    zfs_dbgmsg_enable                                 0
    zfs_dbgmsg_maxsize                                4194304
    zfs_dbuf_state_index                              0
    zfs_deadman_checktime_ms                          5000
    zfs_deadman_enabled                               1
    zfs_deadman_synctime_ms                           1000000
    zfs_dedup_prefetch                                0
    zfs_delay_min_dirty_percent                       60
    zfs_delay_scale                                   500000
    zfs_delays_per_second                             20
    zfs_delete_blocks                                 20480
    zfs_dirty_data_max                                4294967296
    zfs_dirty_data_max_max                            4294967296
    zfs_dirty_data_max_max_percent                    25
    zfs_dirty_data_max_percent                        10
    zfs_dirty_data_sync                               67108864
    zfs_dmu_offset_next_sync                          0
    zfs_expire_snapshot                               300
    zfs_flags                                         0
    zfs_free_bpobj_enabled                            1
    zfs_free_leak_on_eio                              0
    zfs_free_max_blocks                               100000
    zfs_free_min_time_ms                              1000
    zfs_immediate_write_sz                            32768
    zfs_max_recordsize                                1048576
    zfs_mdcomp_disable                                0
    zfs_metaslab_fragmentation_threshold              70
    zfs_metaslab_segment_weight_enabled               1
    zfs_metaslab_switch_threshold                     2
    zfs_mg_fragmentation_threshold                    85
    zfs_mg_noalloc_threshold                          0
    zfs_multihost_fail_intervals                      5
    zfs_multihost_history                             0
    zfs_multihost_import_intervals                    10
    zfs_multihost_interval                            1000
    zfs_multilist_num_sublists                        0
    zfs_no_scrub_io                                   0
    zfs_no_scrub_prefetch                             0
    zfs_nocacheflush                                  0
    zfs_nopwrite_enabled                              1
    zfs_object_mutex_size                             64
    zfs_pd_bytes_max                                  52428800
    zfs_per_txg_dirty_frees_percent                   30
    zfs_prefetch_disable                              0
    zfs_read_chunk_size                               1048576
    zfs_read_history                                  0
    zfs_read_history_hits                             0
    zfs_recover                                       0
    zfs_recv_queue_length                             16777216
    zfs_resilver_delay                                2
    zfs_resilver_min_time_ms                          3000
    zfs_scan_idle                                     50
    zfs_scan_ignore_errors                            0
    zfs_scan_min_time_ms                              1000
    zfs_scrub_delay                                   4
    zfs_send_corrupt_data                             0
    zfs_send_queue_length                             16777216
    zfs_sync_pass_deferred_free                       2
    zfs_sync_pass_dont_compress                       5
    zfs_sync_pass_rewrite                             2
    zfs_sync_taskq_batch_pct                          75
    zfs_top_maxinflight                               32
    zfs_txg_history                                   0
    zfs_txg_timeout                                   5
    zfs_vdev_aggregation_limit                        131072
    zfs_vdev_async_read_max_active                    3
    zfs_vdev_async_read_min_active                    1
    zfs_vdev_async_write_active_max_dirty_percent     60
    zfs_vdev_async_write_active_min_dirty_percent     30
    zfs_vdev_async_write_max_active                   10
    zfs_vdev_async_write_min_active                   2
    zfs_vdev_cache_bshift                             16
    zfs_vdev_cache_max                                16384
    zfs_vdev_cache_size                               0
    zfs_vdev_max_active                               1000
    zfs_vdev_mirror_non_rotating_inc                  0
    zfs_vdev_mirror_non_rotating_seek_inc             1
    zfs_vdev_mirror_rotating_inc                      0
    zfs_vdev_mirror_rotating_seek_inc                 5
    zfs_vdev_mirror_rotating_seek_offset              1048576
    zfs_vdev_queue_depth_pct                          1000
    zfs_vdev_raidz_impl                               [fastest] original scalar sse2 ssse3 avx2
    zfs_vdev_read_gap_limit                           32768
    zfs_vdev_scheduler                                noop
    zfs_vdev_scrub_max_active                         2
    zfs_vdev_scrub_min_active                         1
    zfs_vdev_sync_read_max_active                     10
    zfs_vdev_sync_read_min_active                     10
    zfs_vdev_sync_write_max_active                    10
    zfs_vdev_sync_write_min_active                    10
    zfs_vdev_write_gap_limit                          4096
    zfs_zevent_cols                                   80
    zfs_zevent_console                                0
    zfs_zevent_len_max                                128
    zil_replay_disable                                0
    zil_slog_bulk                                     786432
    zio_delay_max                                     30000
    zio_dva_throttle_enabled                          1
    zio_requeue_io_start_cut_in_line                  1
    zio_taskq_batch_pct                               75
    zvol_inhibit_dev                                  0
    zvol_major                                        230
    zvol_max_discard_blocks                           16384
    zvol_prefetch_bytes                               131072
    zvol_request_sync                                 0
    zvol_threads                                      32
    zvol_volmode                                      1


cu denny
 
hi,

problem is, that we have issues with the VMs and a lot of
Code:
[Sun Oct 14 02:06:07 2018] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 52s! [check_ ....

nearly every day. Every check command, which needs to be executes produces the same messages. I think, it happens, while the underlaying ZFS isn't fast enough with all requests ...
 
I've gotten a lot of the soft lockups, as well as just experiencing atrocious performance with RAIDz2. RAIDz2/RAID6 just isn't very good for performant small block operations.

You might be able to skirt around the issue by having a couple of SSDs mirrored for ZIL and L2ARC. But ultimately I would recommend moving away from RAIDz2 towards a pool of mirrored vdevs. Without adding any hardware, I was able to move some of my clients over to a pool of mirrors, and resolve nearly all of their performance issues, and the soft lockups messages.
 
hi,

exactly that is what we are try to do. The only sad thing is, that we have to reinstall the hypervisor. On a different host I've choosen a 2 x raidz1 with stripping and it works well with the exact same hardware.

But, one question: We have a single M.2 slot, does it make sense to use it as write cache ? What happens, if this drive dies ?
 
hi,

exactly that is what we are try to do. The only sad thing is, that we have to reinstall the hypervisor. On a different host I've choosen a 2 x raidz1 with stripping and it works well with the exact same hardware.

But, one question: We have a single M.2 slot, does it make sense to use it as write cache ? What happens, if this drive dies ?

if you've a lot of sync writes a slog can of course help

when slog fails it will use the zil on the disks but you won't lose any data, except in that time frame you also loose the whole storage and tgx hasn't flushed the data to the zil but this is very unlikely, I guess ;)
 
  • Like
Reactions: Denny Fuchs
when slog fails it will use the zil on the disks but you won't lose any data, except in that time frame you also loose the whole storage and tgx hasn't flushed the data to the zil but this is very unlikely, I guess

ZIL is a speciall storage zone where is landing most of the sync write I/O(but not all) . If you do not have any SLOG, then zil is on the same disks as the zfs pool. When you have a dedicated SLOG device, then the ZIL zone is located ONLY on the SLOG device. Now, when a sync I/O is need, the data go first in the RAM, and on the SLOG(and the application will receive the message: OK, data is on disk now). When the zfs buffers will need to go to the pool(5 secs by default) all data including sync data that was here is write on the pool disks(excuding the SLOG).
So at any moment sync data IS present in 2 location: RAM and on SLOG. For this resons, IF you lose your SLOG, is not any problem because the data is also in the RAM, and on next buffer flush will be writen on the disks pools. From the moment when the SLOG is broken, then all future sync writes will be done on the disks directly in SYNC mode(with a write speed degradation).
SLOG data is READ only after a kernel crash/power problem, and IF are some data that are PRESENT on the SLOG, but not present on the disks pool.
 
  • Like
Reactions: Denny Fuchs

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!