ZFS write (High IO and High Delay)

masterdaweb

Well-Known Member
Apr 17, 2017
87
5
48
32
Hello guys,


I'd like to hear from you about the write speed of your ZFS setup.

I'm using SSD, when a VM is being cloned, IO goes up to 30 - 40%.

I see , from iotop command, that txg_sync is at 99%, and write oscilates from Kilobytes to a couple Megabytes, every second.

I don't know what is causing this write bootleneck, I'm using SSD and Read speed is fine.


Code:
root@br01:~# ./arc_summary.py

------------------------------------------------------------------------
ZFS Subsystem Report                            Fri Apr 20 07:26:21 2018
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                36.77M
        Mutex Misses:                           5.22k
        Evict Skips:                            3.56M

ARC Size:                               68.25%  1023.77 MiB
        Target Size: (Adaptive)         70.58%  1.03    GiB
        Min Size (Hard Limit):          68.27%  1.00    GiB
        Max Size (High Water):          1:1     1.46    GiB

ARC Size Breakdown:
        Recently Used Cache Size:       12.51%  125.04  MiB
        Frequently Used Cache Size:     87.49%  874.37  MiB

ARC Hash Breakdown:
        Elements Max:                           50.19k
        Elements Current:               63.38%  31.81k
        Collisions:                             1.30M
        Chain Max:                              3
        Chains:                                 119

ARC Total accesses:                                     751.51M
        Cache Hit Ratio:                93.78%  704.78M
        Cache Miss Ratio:               6.22%   46.73M
        Actual Hit Ratio:               93.69%  704.10M

        Data Demand Efficiency:         81.61%  210.64M
        Data Prefetch Efficiency:       15.22%  2.60M

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           23.62%  166.44M
          Most Frequently Used:         76.29%  537.66M
          Most Recently Used Ghost:     0.99%   6.96M
          Most Frequently Used Ghost:   0.03%   229.03k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  24.39%  171.90M
          Prefetch Data:                0.06%   395.49k
          Demand Metadata:              75.43%  531.64M
          Prefetch Metadata:            0.12%   838.44k

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  82.89%  38.74M
          Prefetch Data:                4.71%   2.20M
          Demand Metadata:              11.64%  5.44M
          Prefetch Metadata:            0.76%   355.06k


DMU Prefetch Efficiency:                                        257.96M
        Hit Ratio:                      2.95%   7.62M
        Miss Ratio:                     97.05%  250.34M



ZFS Tunables:
        dbuf_cache_hiwater_pct                            10
        dbuf_cache_lowater_pct                            10
        dbuf_cache_max_bytes                              49152000
        dbuf_cache_max_shift                              5
        dmu_object_alloc_chunk_shift                      7
        ignore_hole_birth                                 1
        l2arc_feed_again                                  1
        l2arc_feed_min_ms                                 200
        l2arc_feed_secs                                   1
        l2arc_headroom                                    2
        l2arc_headroom_boost                              200
        l2arc_noprefetch                                  1
        l2arc_norw                                        0
        l2arc_write_boost                                 8388608
        l2arc_write_max                                   8388608
        metaslab_aliquot                                  524288
        metaslab_bias_enabled                             1
        metaslab_debug_load                               0
        metaslab_debug_unload                             0
        metaslab_fragmentation_factor_enabled             1
        metaslab_lba_weighting_enabled                    1
        metaslab_preload_enabled                          1
        metaslabs_per_vdev                                200
        send_holes_without_birth_time                     1
        spa_asize_inflation                               24
        spa_config_path                                   /etc/zfs/zpool.cache
        spa_load_verify_data                              1
        spa_load_verify_maxinflight                       10000
        spa_load_verify_metadata                          1
        spa_slop_shift                                    5
        zfetch_array_rd_sz                                1048576
        zfetch_max_distance                               8388608
        zfetch_max_streams                                8
        zfetch_min_sec_reap                               2
        zfs_abd_scatter_enabled                           1
        zfs_abd_scatter_max_order                         10
        zfs_admin_snapshot                                1
        zfs_arc_average_blocksize                         8192
        zfs_arc_dnode_limit                               0
        zfs_arc_dnode_limit_percent                       10
        zfs_arc_dnode_reduce_percent                      10
        zfs_arc_grow_retry                                0
        zfs_arc_lotsfree_percent                          10
        zfs_arc_max                                       1572864000
        zfs_arc_meta_adjust_restarts                      4096
        zfs_arc_meta_limit                                1610612736
        zfs_arc_meta_limit_percent                        75
        zfs_arc_meta_min                                  0
        zfs_arc_meta_prune                                10000
        zfs_arc_meta_strategy                             1
        zfs_arc_min                                       1073741824
        zfs_arc_min_prefetch_lifespan                     0
        zfs_arc_p_aggressive_disable                      1
        zfs_arc_p_dampener_disable                        1
        zfs_arc_p_min_shift                               0
        zfs_arc_pc_percent                                0
        zfs_arc_shrink_shift                              0
        zfs_arc_sys_free                                  0
        zfs_autoimport_disable                            1
        zfs_compressed_arc_enabled                        1
        zfs_dbgmsg_enable                                 0
        zfs_dbgmsg_maxsize                                4194304
        zfs_dbuf_state_index                              0
        zfs_deadman_checktime_ms                          5000
        zfs_deadman_enabled                               1
        zfs_deadman_synctime_ms                           1000000
        zfs_dedup_prefetch                                0
        zfs_delay_min_dirty_percent                       60
        zfs_delay_scale                                   500000
        zfs_delete_blocks                                 20480
        zfs_dirty_data_max                                3354684620
        zfs_dirty_data_max_max                            4294967296
        zfs_dirty_data_max_max_percent                    25
        zfs_dirty_data_max_percent                        10
        zfs_dirty_data_sync                               67108864
        zfs_dmu_offset_next_sync                          0
        zfs_expire_snapshot                               300
        zfs_flags                                         0
        zfs_free_bpobj_enabled                            1
        zfs_free_leak_on_eio                              0
        zfs_free_max_blocks                               100000
        zfs_free_min_time_ms                              1000
        zfs_immediate_write_sz                            32768
        zfs_max_recordsize                                1048576
        zfs_mdcomp_disable                                0
        zfs_metaslab_fragmentation_threshold              70
        zfs_metaslab_segment_weight_enabled               1
        zfs_metaslab_switch_threshold                     2
        zfs_mg_fragmentation_threshold                    85
        zfs_mg_noalloc_threshold                          0
        zfs_multihost_fail_intervals                      5
        zfs_multihost_history                             0
        zfs_multihost_import_intervals                    10
        zfs_multihost_interval                            1000
        zfs_multilist_num_sublists                        0
        zfs_no_scrub_io                                   0
        zfs_no_scrub_prefetch                             0
        zfs_nocacheflush                                  0
        zfs_nopwrite_enabled                              1
        zfs_object_mutex_size                             64
        zfs_pd_bytes_max                                  52428800
        zfs_per_txg_dirty_frees_percent                   30
        zfs_prefetch_disable                              0
        zfs_read_chunk_size                               1048576
        zfs_read_history                                  0
        zfs_read_history_hits                             0
        zfs_recover                                       0
        zfs_resilver_delay                                2
        zfs_resilver_min_time_ms                          3000
        zfs_scan_idle                                     50
        zfs_scan_min_time_ms                              1000
        zfs_scrub_delay                                   4
        zfs_send_corrupt_data                             0
        zfs_sync_pass_deferred_free                       2
        zfs_sync_pass_dont_compress                       5
        zfs_sync_pass_rewrite                             2
        zfs_sync_taskq_batch_pct                          75
        zfs_top_maxinflight                               32
        zfs_txg_history                                   0
        zfs_txg_timeout                                   5
        zfs_vdev_aggregation_limit                        131072
        zfs_vdev_async_read_max_active                    3
        zfs_vdev_async_read_min_active                    1
        zfs_vdev_async_write_active_max_dirty_percent     60
        zfs_vdev_async_write_active_min_dirty_percent     30
        zfs_vdev_async_write_max_active                   10
        zfs_vdev_async_write_min_active                   2
        zfs_vdev_cache_bshift                             16
        zfs_vdev_cache_max                                16384
        zfs_vdev_cache_size                               0
        zfs_vdev_max_active                               1000
        zfs_vdev_mirror_non_rotating_inc                  0
        zfs_vdev_mirror_non_rotating_seek_inc             1
        zfs_vdev_mirror_rotating_inc                      0
        zfs_vdev_mirror_rotating_seek_inc                 5
        zfs_vdev_mirror_rotating_seek_offset              1048576
        zfs_vdev_queue_depth_pct                          1000
        zfs_vdev_raidz_impl                               [fastest] original scalar sse2 ssse3 avx2
        zfs_vdev_read_gap_limit                           32768
        zfs_vdev_scheduler                                noop
        zfs_vdev_scrub_max_active                         2
        zfs_vdev_scrub_min_active                         1
        zfs_vdev_sync_read_max_active                     10
        zfs_vdev_sync_read_min_active                     10
        zfs_vdev_sync_write_max_active                    10
        zfs_vdev_sync_write_min_active                    10
        zfs_vdev_write_gap_limit                          4096
        zfs_zevent_cols                                   80
        zfs_zevent_console                                0
        zfs_zevent_len_max                                128
        zil_replay_disable                                0
        zil_slog_bulk                                     786432
        zio_delay_max                                     30000
        zio_dva_throttle_enabled                          1
        zio_requeue_io_start_cut_in_line                  1
        zio_taskq_batch_pct                               75
        zvol_inhibit_dev                                  0
        zvol_major                                        230
        zvol_max_discard_blocks                           16384
        zvol_prefetch_bytes                               131072
        zvol_request_sync                                 0
        zvol_threads                                      32
        zvol_volmode                                      1
 
From what I can see, you have a very small ZFS cache, I assume that there is not much RAM available on the system either. Then the writes need to be synced more frequent to the underlying storage. You can try to set a bandwidth limit for cloning (man datacenter.cfg), but more RAM for caching is advisable.
 
From what I can see, you have a very small ZFS cache, I assume that there is not much RAM available on the system either. Then the writes need to be synced more frequent to the underlying storage. You can try to set a bandwidth limit for cloning (man datacenter.cfg), but more RAM for caching is advisable.
Hello Alwin,

I've read that ARC is only used for caching Read operations, not write.

I think that this problem is being caused because my SSDs are 512 bytes, ZFS set with ashift 9 and zpool block size 128K.

I have another setup with 4K SSDs, ashift 12 and zpool block size 128K (This setup is running smoothly with a very good write speed, but the setup above is terrible)
 
I've read that ARC is only used for caching Read operations, not write.
I assume that there is not much RAM available on the system either.

It also depends on your hardware, ofc. Test the storage with fio and compare it with your other system(s). Try also a qemu-img clone manually and fiddle with the options.

Update your system to the latest package version available, to get improvements.
 
  • Like
Reactions: masterdaweb
you zpool config is ?
how much memory do you've in general ?

ashift 9 = 512 bytes
ashift 12 = 4096 bytes


and what is the diff between the 2 setups ?

Hello @6uellerbpanda ,

I have 32 GB RAM.

Server 1 (Running ZFS smoothly, fast Read and fast Write operations):
Xeon E3 1230 v5
32 GB RAM
SSD 480 GB (4K sector size) - ashift 12 and zvol/zpool 128K

Server 2 (Running ZFS terribly, fast Read but ...... very poor Write operations):
Xeon E3 1230 v5
32 GB RAM
SSD 480 GB (512 bytes sector size) - ashift 9 and zvol/zpool 128K


I really have no idea what is causing that slow write on server 2. My guess is that SSD 512 bytes are terrible for a ZFS setup.

Let me know if you have any idea, I really appreciate your help.
 
I have the same problems.

Server 1 running zfs (also on boot disks), fast read, poor write (& high IO):
A1SRM 2758F
32 GB RAM
SSD 480GB (512 sector size) - ashift 12

Server 2 zunning zfs (not on boot disks), fast read, poor write (& high IO):
A1SRM 2558
32 GB RAM
HD 7,3TB x3 zfs pool (512 sector size) - ashift 12

Server 2 was before runing nas4free, i imported the pool into proxmox. Under nas4free the write speed was excellent. So i'm looking to the linux zfs implementation (in current state) as possible explanation?

I'm looking forward to find a solution for this.
 
Hello @6uellerbpanda ,

I have 32 GB RAM.

Server 1 (Running ZFS smoothly, fast Read and fast Write operations):
Xeon E3 1230 v5
32 GB RAM
SSD 480 GB (4K sector size) - ashift 12 and zvol/zpool 128K

Server 2 (Running ZFS terribly, fast Read but ...... very poor Write operations):
Xeon E3 1230 v5
32 GB RAM
SSD 480 GB (512 bytes sector size) - ashift 9 and zvol/zpool 128K


I really have no idea what is causing that slow write on server 2. My guess is that SSD 512 bytes are terrible for a ZFS setup.

Let me know if you have any idea, I really appreciate your help.


# zpool config
Code:
zpool status

# server 2
is the sector size really correct ? did you check with smartctl ? 512b and ashift 9 shouldn't make any impact but the other way around would
 
# zpool config
Code:
zpool status

# server 2
is the sector size really correct ? did you check with smartctl ? 512b and ashift 9 shouldn't make any impact but the other way around would

zpool status
Code:
root@br01:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 1h31m with 0 errors on Sun Apr  8 01:55:16 2018
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda2      ONLINE       0     0     0

errors: No known data errors

zpool iostat rpool
Code:
root@br01:~# zpool iostat rpool
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool        296G   148G     21     51  1.10M  2.72M

zfs get all
Code:
root@br01:~# zfs get all
NAME                 PROPERTY               VALUE                  SOURCE
rpool                type                   filesystem             -
rpool                creation               Thu Feb 15  2:43 2018  -
rpool                used                   298G                   -
rpool                available              132G                   -
rpool                referenced             104K                   -
rpool                compressratio          1.83x                  -
rpool                mounted                yes                    -
rpool                quota                  none                   default
rpool                reservation            none                   default
rpool                recordsize             128K                   default
rpool                mountpoint             /rpool                 default
rpool                sharenfs               off                    default
rpool                checksum               on                     default
rpool                compression            on                     local
rpool                atime                  off                    local
rpool                devices                on                     default
rpool                exec                   on                     default
rpool                setuid                 on                     default
rpool                readonly               off                    default
rpool                zoned                  off                    default
rpool                snapdir                hidden                 default
rpool                aclinherit             restricted             default
rpool                createtxg              1                      -
rpool                canmount               on                     default
rpool                xattr                  on                     default
rpool                copies                 1                      default
rpool                version                5                      -
rpool                utf8only               off                    -
rpool                normalization          none                   -
rpool                casesensitivity        sensitive              -
rpool                vscan                  off                    default
rpool                nbmand                 off                    default
rpool                sharesmb               off                    default
rpool                refquota               none                   default
rpool                refreservation         none                   default
rpool                guid                   605379444072230747     -
rpool                primarycache           all                    default
rpool                secondarycache         all                    default
rpool                usedbysnapshots        0B                     -
rpool                usedbydataset          104K                   -
rpool                usedbychildren         298G                   -
rpool                usedbyrefreservation   0B                     -
rpool                logbias                latency                default
rpool                dedup                  off                    default
rpool                mlslabel               none                   default
rpool                sync                   disabled               local
rpool                dnodesize              legacy                 default
rpool                refcompressratio       1.00x                  -
rpool                written                104K                   -
rpool                logicalused            543G                   -
rpool                logicalreferenced      44K                    -
rpool                volmode                default                default
rpool                filesystem_limit       none                   default
rpool                snapshot_limit         none                   default
rpool                filesystem_count       none                   default
rpool                snapshot_count         none                   default
rpool                snapdev                hidden                 default
rpool                acltype                off                    default
rpool                context                none                   default
rpool                fscontext              none                   default
rpool                defcontext             none                   default
rpool                rootcontext            none                   default
rpool                relatime               off                    default
rpool                redundant_metadata     all                    default
rpool                overlay                off                    default
rpool                zfs:zfs_nocacheflush   1                      local

smartctl -a /dev/sda
Code:
root@br01:~# smartctl -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.16-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     KINGSTON SA400S37480G
Serial Number:    50026B77760049E6
LU WWN Device Id: 5 000000 000000000
Firmware Version: SBFK71E0
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   Unknown(0x0ff8) (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Apr 24 09:40:51 2018 -03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (65535) seconds.
Offline data collection
capabilities:                    (0x79) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  30) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000a   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1296
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       34
148 Unknown_Attribute       0x0000   255   255   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   255   255   000    Old_age   Offline      -       0
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       15
170 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       17
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       5243031
181 Program_Fail_Cnt_Total  0x0012   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0000   255   255   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       26
194 Temperature_Celsius     0x0023   063   050   000    Pre-fail  Always       -       37 (Min/Max 21/50)
196 Reallocated_Event_Count 0x0000   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       4
231 Temperature_Celsius     0x0013   100   100   000    Pre-fail  Always       -       92
233 Media_Wearout_Indicator 0x0013   100   100   000    Pre-fail  Always       -       41780
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       10720
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       5052
244 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       80
245 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       151
246 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       1845600

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited:
but you've different ssd in the servers isn't it ?
Yes. The config, which I posted above, is from the server that is having the slow write issue. It runs a 512 bytes SSD.

The other server, that is running fine, has enteprise SSD (4K sector size).

I decided not using ZFS with 512 bytes SSD. I've already tried everything, but with no success.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!