High I/O wait with SSDs

ralph.vigne · Jan 16, 2021

Hello,

I recently built my new "server" for my home, and as the title states, experiencing very bad IO performance. The system is configured to have 2 SSDs as a ZFS mirror for the system itself (rpool) and has 4 more spinning disks of which again 2 are configured as a ZFS mirror.

The hardware of the system is as follows:

CPU: Intel i5-4440 (4 cores) on a Gigabyte H97-D3H-CF with all RAID features disabled
RAM: 16 GB (non-ECC, Kingston)
2 SSDS: Crucial BX500 (1TB) in ZFS Mirror (https://www.crucial.com/ssd/bx500/ct1000bx500ssd1)
2 HDD: WED RED 4TB WD40EFAZ again in a ZFS mirror
2 HDD: WD RED 4TB each a separate ZFS volume

Note: At this point I like to add that I'm aware that my SSDs are not “enterprise-grade”, but I was sort of hoping to still use them for my home server, as there is actually very little load on it, and since they are used in a mirrored pool the risk of data-loss seemed acceptable to me. I also know that they are not among the fastest SDDs but based on its specs it should be at least similar fast as a spinning disk, right?

During install I selected to use ZFS and configured the two SSDs as RAID1 without error. I was also able to import each volume without issues, and it seems everything is working, only the performance of the SSDs is unbelievably bad. Making a backup of my MySQL server with about 250MB of data takes in the order of 10 minutes (read from SSD and write to SSD) and IO wait is about 30-35% all the time (till the backup is finished). The actual CPU load is in the order of 4-5% and RAM is about 85-90% without swapping anything. I tried shutting down all the guests (3 containers and 1 VM) but performance is pretty much unaffected. Thus, I think I can exclude my workload (Home Assistant and Plex) to be the reason for the clogged-up system which btw. was running fine on an Intel Celeron before.

To show some data about my ZFS settings I ran the following commands:

Code:

root@duckburg:~# zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
media1    3.62T  2.40T  1.23T        -         -     0%    66%  1.00x    ONLINE  -
media2    3.62T  1.47T  2.16T        -         -     0%    40%  1.00x    ONLINE  -
rpool      928G   515G   413G        -         -    33%    55%  1.00x    ONLINE  -
zfs-data  2.72T   426G  2.30T        -         -     0%    15%  1.00x    ONLINE  -


root@duckburg:~# zpool status rpool
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:31:52 with 0 errors on Sun Jan 10 00:55:57 2021
config:

       NAME                                        STATE     READ WRITE CKSUM
       rpool                                       ONLINE       0     0     0
         mirror-0                                  ONLINE       0     0     0
           ata-CT1000BX500SSD1_1951E230BD67-part3  ONLINE       0     0     0
           ata-CT1000BX500SSD1_2012E295BD75-part3  ONLINE       0     0     0

errors: No known data errors

Code:

root@duckburg:~# zpool get all | grep ashift
media1    ashift                         12                             local
media2    ashift                         12                             local
rpool     ashift                         12                             local
zfs-data  ashift                         12                             local

Code:

root@duckburg:~# zpool get all rpool
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           928G                           -
rpool  capacity                       55%                            -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  guid                           17740624895035484119           -
rpool  version                        -                              default
rpool  bootfs                         rpool/ROOT/pve-1               local
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupditto                     0                              default
rpool  dedupratio                     1.00x                          -
rpool  free                           413G                           -
rpool  allocated                      515G                           -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  33%                            -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  checkpoint                     -                              -
rpool  load_guid                      5700784901473667627            -
rpool  autotrim                       off                            default
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            enabled                        local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local
rpool  feature@encryption             enabled                        local
rpool  feature@project_quota          active                         local
rpool  feature@device_removal         enabled                        local
rpool  feature@obsolete_counts        enabled                        local
rpool  feature@zpool_checkpoint       enabled                        local
rpool  feature@spacemap_v2            active                         local
rpool  feature@allocation_classes     enabled                        local
rpool  feature@resilver_defer         enabled                        local
rpool  feature@bookmark_v2            enabled                        local

Note: I also tried removing one disk from the pool, but performance did not improve (or change at all to be honest). Only adding the disk to the pool again took like 36 hours of resilvering the data.

I did run smartctl, but it didn't report any issues and SSD wear also seems to be ok.

Code:

root@duckburg:/zfs-data# smartctl -a /dev/sde
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     CT1000BX500SSD1
[...]

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[...]

Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       18
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   087   087   000    Old_age   Always       -       202
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       10
180 Unused_Reserve_NAND_Blk 0x0033   100   100   000    Pre-fail  Always       -       13
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   048   000    Old_age   Always       -       35 (Min/Max 25/52)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   087   087   001    Old_age   Offline      -       13
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       63437345129
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       1982417035
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       13068894592
249 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
250 Read_Error_Retry_Rate   0x0032   100   100   000    Old_age   Always       -       923940
251 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       32
252 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
253 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
254 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       209
223 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       10

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      4571         -
# 2  Extended offline    Interrupted (host reset)      90%      4570         -
# 3  Short offline       Interrupted (host reset)      90%      4569         -

Selective Self-tests/Logging not supported

and for the second disk results are very similar, as both disks were new at the time of assembly.

I read (https://jrs-s.net/2018/03/13/zvol-vs-qcow2-with-kvm/) that ZFS can be best benchmarked by using a tool named ifo (which I never heard of before). But I run the recommended commands and thought maybe it helps someone to understand my issue if I copy the output here. The tests have been run with all guests turned off and a fresh rebooted system.

The test I ran were synchronous 4K writes:

Command: fio --name=random-write --ioengine=libaio --iodepth=4 --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=4 --end_fsync=1

On the SSD it printed the following results:

Code:

Run status group 0 (all jobs):
WRITE: bw=2734KiB/s (2799kB/s), 683KiB/s-811KiB/s (700kB/s-831kB/s), io=1024MiB (1074MB), run=323169-383596msec

Note: During this test I observed IO waits of up 99.15% on the Proxmox summary page while my CPU load was around 1%.

While running the same test on the HDD mirror was way way faster and yielded the following results:

Code:

Run status group 0 (all jobs):
WRITE: bw=59.7MiB/s (62.6MB/s), 14.9MiB/s-14.9MiB/s (15.6MB/s-15.7MB/s), io=1024MiB (1074MB), run=17125-17162msec

Note: During this test IO waits topped at 13% in the summary page.

Based on this results, it seems my SSD is 20x slower that my HDD? Can this be true, are they really that bad?

Any advice is very much appreciated!

Cheers,
Ralph

PS: Since this is my first post, I also like to say a big thank you to the people developing Proxmox VE - amazing job!
PPS: The full output of the commands is attached in a separate file. If you need any information, I'll be happy to provide what ever you need.

Dunuin · Jan 16, 2021

What is your ARC size? 8GB?

Did you also tried that test with bigger async writes? Your SSDs can't use the RAM cache for sync writes because it got no powerloss protection so its normal that sync writes are super slow, especially if you are using small 4k writes.

2 HDD: WD RED 4TB each a separate ZFS volume

ZFS can't repair corrupted data if there is no mirror or parity data. A raidz1 with 4 discs might be more usefull. Same capacity, all 4 drives may fail without dataloss and ZFS could repair corrupted data. Only write speed isn't that good.

apoc · Jan 16, 2021

We recently had a longer thread where the solution of the issue was the blocksize of the ZVOL dataset.
IIRC the default is 8KB which caused a lot of overhead. The author of the thread switched to 128kb and the problem went away.
HTH

6uellerbpanda · Jan 16, 2021

pl post output of

Code:

arc_summary

and

Code:

zpool get all rpool

ralph.vigne · Jan 16, 2021

@Dunuin : it's the default set by Proxmox, at least I did not remember changing any of it.

Code:

root@duckburg:~# cat /proc/spl/kstat/zfs/arcstats |grep c_
c_min                           4    520479104
c_max                           4    8327665664
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    400818432
arc_meta_limit                  4    6245749248
arc_dnode_limit                 4    624574924
arc_meta_max                    4    406443400
arc_meta_min                    4    16777216
async_upgrade_sync              4    215
arc_need_free                   4    0
arc_sys_free                    4    260239552
arc_raw_size                    4    0

Based on this output I guess that answer to your question is yes, it is 8GB.

Also, yes I'm aware of that - ZFS will not be capable of fixing data if not replicated in some sort (RAID,ZRAID, ...) but thanks for reminding

@tburger: I think I read it, but couldn't quite follow. I didn't understand what the right settings for my drive was. There seems to be this legacy thing "Industry-standard, 512-byte sector size support", then there is the 4K sector size which seem to be the current industry standard, and finally there is the data sheet from Crucial (https://www.ediatlanta.com/images/crucial-bx500-ssd-datasheet.pdf) mentioning 128KB as size for their performance test. Maybe you can tell which setting I should use? Also I got confused between the records size in ZFS and block size in the OS. Am I right that I need to create a new dataset with the adapted block size and then copy my data from the old to the new record size?

I also read that one should adapt the record size of the VM to that setting. Maybe you know, is this a setting in Proxomx or is this something need to figure out in the guest OS?

@6uellerbpanda:

Code:

root@duckburg:~# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report Sat Jan 16 21:13:30 2021
Linux 5.4.78-2-pve 0.8.5-pve1
Machine: duckburg (x86_64) 0.8.5-pve1

ARC status: HEALTHY
Memory throttle count: 0

ARC size (current): 61.8 % 4.8 GiB
Target size (adaptive): 100.0 % 7.8 GiB
Min size (hard limit): 6.2 % 496.4 MiB
Max size (high water): 16:1 7.8 GiB
Most Frequently Used (MFU) cache size: 32.7 % 1.5 GiB
Most Recently Used (MRU) cache size: 67.3 % 3.1 GiB
Metadata cache size (hard limit): 75.0 % 5.8 GiB
Metadata cache size (current): 6.9 % 408.9 MiB
Dnode cache size (hard limit): 10.0 % 595.6 MiB
Dnode cache size (current): 4.5 % 26.9 MiB

ARC hash breakdown:
Elements max: 509.0k
Elements current: 99.9 % 508.3k
Collisions: 170.3k
Chain max: 5
Chains: 52.4k

ARC misc:
Deleted: 83
Mutex misses: 0
Eviction skips: 1.3k

ARC total accesses (hits + misses): 9.4M
Cache hit ratio: 96.2 % 9.0M
Cache miss ratio: 3.8 % 356.2k
Actual hit ratio (MFU + MRU hits): 93.8 % 8.8M
Data demand efficiency: 97.1 % 5.3M
Data prefetch efficiency: 63.7 % 529.0k

Cache hits by cache type:
Most frequently used (MFU): 81.0 % 7.3M
Most recently used (MRU): 16.4 % 1.5M
Most frequently used (MFU) ghost: 0.0 % 0
Most recently used (MRU) ghost: 0.0 % 0
Anonymously used: 2.5 % 228.9k

Cache hits by data type:
Demand data: 57.2 % 5.2M
Demand prefetch data: 3.7 % 336.8k
Demand metadata: 38.8 % 3.5M
Demand prefetch metadata: 0.2 % 19.1k

Cache misses by data type:
Demand data: 43.5 % 154.8k
Demand prefetch data: 54.0 % 192.3k
Demand metadata: 1.7 % 5.9k
Demand prefetch metadata: 0.9 % 3.2k

DMU prefetch efficiency: 5.5M
Hit ratio: 1.3 % 73.1k
Miss ratio: 98.7 % 5.4M

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
spl_hostid 0
spl_hostid_path /etc/hostid
spl_kmem_alloc_max 1048576
spl_kmem_alloc_warn 65536
spl_kmem_cache_expire 2
spl_kmem_cache_kmem_limit 2048
spl_kmem_cache_kmem_threads 4
spl_kmem_cache_magazine_size 0
spl_kmem_cache_max_size 32
spl_kmem_cache_obj_per_slab 8
spl_kmem_cache_obj_per_slab_min 1
spl_kmem_cache_reclaim 0
spl_kmem_cache_slab_limit 16384
spl_max_show_tasks 512
spl_panic_halt 0
spl_schedule_hrtimeout_slack_us 0
spl_taskq_kick 0
spl_taskq_thread_bind 0
spl_taskq_thread_dynamic 1
spl_taskq_thread_priority 1
spl_taskq_thread_sequential 4

Tunables:
dbuf_cache_hiwater_pct 10
dbuf_cache_lowater_pct 10
dbuf_cache_max_bytes 260239552
dbuf_cache_shift 5
dbuf_metadata_cache_max_bytes 130119776
dbuf_metadata_cache_shift 6
dmu_object_alloc_chunk_shift 7
dmu_prefetch_max 134217728
ignore_hole_birth 1
l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_noprefetch 1
l2arc_norw 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_aliquot 524288
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_df_max_search 16777216
metaslab_df_use_largest_segment 0
metaslab_force_ganging 16777217
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
send_holes_without_birth_time 1
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_print_vdev_tree 0
spa_load_verify_data 1
spa_load_verify_metadata 1
spa_load_verify_shift 4
spa_slop_shift 5
vdev_removal_max_span 32768
vdev_validate_skip 0
zap_iterate_prefetch 1
zfetch_array_rd_sz 1048576
zfetch_max_distance 8388608
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_abd_scatter_enabled 1
zfs_abd_scatter_max_order 10
zfs_abd_scatter_min_size 1536
zfs_admin_snapshot 0
zfs_arc_average_blocksize 8192
zfs_arc_dnode_limit 0
zfs_arc_dnode_limit_percent 10
zfs_arc_dnode_reduce_percent 10
zfs_arc_grow_retry 0
zfs_arc_lotsfree_percent 10
zfs_arc_max 0
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 0
zfs_arc_meta_limit_percent 75
zfs_arc_meta_min 0
zfs_arc_meta_prune 10000
zfs_arc_meta_strategy 1
zfs_arc_min 0
zfs_arc_min_prefetch_ms 0
zfs_arc_min_prescient_prefetch_ms 0
zfs_arc_p_dampener_disable 1
zfs_arc_p_min_shift 0
zfs_arc_pc_percent 0
zfs_arc_shrink_shift 0
zfs_arc_sys_free 0
zfs_async_block_max_blocks 100000
zfs_autoimport_disable 1
zfs_checksum_events_per_second 20
zfs_commit_timeout_pct 5
zfs_compressed_arc_enabled 1
zfs_condense_indirect_commit_entry_delay_ms 0
zfs_condense_indirect_vdevs_enable 1
zfs_condense_max_obsolete_bytes 1073741824
zfs_condense_min_mapping_bytes 131072
zfs_dbgmsg_enable 1
zfs_dbgmsg_maxsize 4194304
zfs_dbuf_state_index 0
zfs_ddt_data_is_special 1
zfs_deadman_checktime_ms 60000
zfs_deadman_enabled 1
zfs_deadman_failmode wait
zfs_deadman_synctime_ms 600000
zfs_deadman_ziotime_ms 300000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_delete_blocks 20480
zfs_dirty_data_max 1665533132
zfs_dirty_data_max_max 4163832832
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync_percent 20
zfs_disable_ivset_guid_check 0
zfs_dmu_offset_next_sync 0
zfs_expire_snapshot 300
zfs_flags 0
zfs_free_bpobj_enabled 1
zfs_free_leak_on_eio 0
zfs_free_min_time_ms 1000
zfs_immediate_write_sz 32768
zfs_initialize_value 16045690984833335022
zfs_key_max_salt_uses 400000000
zfs_lua_max_instrlimit 100000000
zfs_lua_max_memlimit 104857600
zfs_max_missing_tvds 0
zfs_max_recordsize 1048576
zfs_metaslab_fragmentation_threshold 70
zfs_metaslab_segment_weight_enabled 1
zfs_metaslab_switch_threshold 2
zfs_mg_fragmentation_threshold 95
zfs_mg_noalloc_threshold 0
zfs_multihost_fail_intervals 10
zfs_multihost_history 0
zfs_multihost_import_intervals 20
zfs_multihost_interval 1000
zfs_multilist_num_sublists 0
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_object_mutex_size 64
zfs_obsolete_min_time_ms 500
zfs_override_estimate_recordsize 0
zfs_pd_bytes_max 52428800
zfs_per_txg_dirty_frees_percent 5
zfs_prefetch_disable 0
zfs_read_chunk_size 1048576
zfs_read_history 0
zfs_read_history_hits 0
zfs_reconstruct_indirect_combinations_max 4096
zfs_recover 0
zfs_recv_queue_length 16777216
zfs_removal_ignore_errors 0
zfs_removal_suspend_progress 0
zfs_remove_max_segment 16777216
zfs_resilver_disable_defer 0
zfs_resilver_min_time_ms 3000
zfs_scan_checkpoint_intval 7200
zfs_scan_fill_weight 3
zfs_scan_ignore_errors 0
zfs_scan_issue_strategy 0
zfs_scan_legacy 0
zfs_scan_max_ext_gap 2097152
zfs_scan_mem_lim_fact 20
zfs_scan_mem_lim_soft_fact 20
zfs_scan_strict_mem_lim 0
zfs_scan_suspend_progress 0
zfs_scan_vdev_limit 4194304
zfs_scrub_min_time_ms 1000
zfs_send_corrupt_data 0
zfs_send_queue_length 16777216
zfs_send_unmodified_spill_blocks 1
zfs_slow_io_events_per_second 20
zfs_spa_discard_memory_limit 16777216
zfs_special_class_metadata_reserve_pct 25
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 8
zfs_sync_pass_rewrite 2
zfs_sync_taskq_batch_pct 75
zfs_trim_extent_bytes_max 134217728
zfs_trim_extent_bytes_min 32768
zfs_trim_metaslab_skip 0
zfs_trim_queue_limit 10
zfs_trim_txg_batch 32
zfs_txg_history 100
zfs_txg_timeout 5
zfs_unlink_suspend_progress 0
zfs_user_indirect_is_special 1
zfs_vdev_aggregate_trim 0
zfs_vdev_aggregation_limit 1048576
zfs_vdev_aggregation_limit_non_rotating 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 2
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_default_ms_count 200
zfs_vdev_initializing_max_active 1
zfs_vdev_initializing_min_active 1
zfs_vdev_max_active 1000
zfs_vdev_min_ms_count 16
zfs_vdev_mirror_non_rotating_inc 0
zfs_vdev_mirror_non_rotating_seek_inc 1
zfs_vdev_mirror_rotating_inc 0
zfs_vdev_mirror_rotating_seek_inc 5
zfs_vdev_mirror_rotating_seek_offset 1048576
zfs_vdev_ms_count_limit 131072
zfs_vdev_queue_depth_pct 1000
zfs_vdev_raidz_implcycle [fastest] original scalar sse2 ssse3 avx2
zfs_vdev_read_gap_limit 32768
zfs_vdev_removal_max_active 2
zfs_vdev_removal_min_active 1
zfs_vdev_scheduler unused
zfs_vdev_scrub_max_active 2
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_trim_max_active 2
zfs_vdev_trim_min_active 1
zfs_vdev_write_gap_limit 4096
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 64
zfs_zil_clean_taskq_maxalloc 1048576
zfs_zil_clean_taskq_minalloc 1024
zfs_zil_clean_taskq_nthr_pct 100
zil_maxblocksize 131072
zil_nocacheflush 0
zil_replay_disable 0
zil_slog_bulk 786432
zio_deadman_log_all 0
zio_dva_throttle_enabled 1
zio_requeue_io_start_cut_in_line 1
zio_slow_io_ms 30000
zio_taskq_batch_pct 75
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_prefetch_bytes 131072
zvol_request_sync 0
zvol_threads 32
zvol_volmode 1

VDEV cache disabled, skipping section

ZIL committed transactions: 735.0k
Commit requests: 26.2k
Flushes to stable storage: 26.2k
Transactions to SLOG storage pool: 0 Bytes 0
Transactions to non-SLOG storage pool: 3.9 GiB 55.2k

and

Code:

root@duckburg:~# zpool get all rpool
NAME   PROPERTY                       VALUE                          SOURCE
rpool size 928G -
rpool capacity 55% -
rpool altroot - default
rpool health ONLINE -
rpool guid 17740624895035484119 -
rpool version - default
rpool bootfs rpool/ROOT/pve-1 local
rpool delegation on default
rpool autoreplace off default
rpool cachefile - default
rpool failmode wait default
rpool listsnapshots off default
rpool autoexpand off default
rpool dedupditto 0 default
rpool dedupratio 1.00x -
rpool free 413G -
rpool allocated 515G -
rpool readonly off -
rpool ashift 12 local
rpool comment - default
rpool expandsize - -
rpool freeing 0 -
rpool fragmentation 33% -
rpool leaked 0 -
rpool multihost off default
rpool checkpoint - -
rpool load_guid 10769748209767863697 -
rpool autotrim off default
rpool feature@async_destroy enabled local
rpool feature@empty_bpobj active local
rpool feature@lz4_compress active local
rpool feature@multi_vdev_crash_dump enabled local
rpool feature@spacemap_histogram active local
rpool feature@enabled_txg active local
rpool feature@hole_birth active local
rpool feature@extensible_dataset active local
rpool feature@embedded_data active local
rpool feature@bookmarks enabled local
rpool feature@filesystem_limits enabled local
rpool feature@large_blocks enabled local
rpool feature@large_dnode enabled local
rpool feature@sha512 enabled local
rpool feature@skein enabled local
rpool feature@edonr enabled local
rpool feature@userobj_accounting active local
rpool feature@encryption enabled local
rpool feature@project_quota active local
rpool feature@device_removal enabled local
rpool feature@obsolete_counts enabled local
rpool feature@zpool_checkpoint enabled local
rpool feature@spacemap_v2 active local
rpool feature@allocation_classes enabled local
rpool feature@resilver_defer enabled local
rpool feature@bookmark_v2 enabled local

apoc · Jan 16, 2021

ralph.vigne said:
There seems to be this legacy thing "Industry-standard, 512-byte sector size support", then there is the 4K sector size which seem to be the current industry standard

You are talking about physical sector sizes here.
4kb alignment is important which is reflected by the ASHIFT=12 setting on a ZFS pool. As long as you use PVE default, this is fine.

What I was referring to: the block size/record size ZFS uses on a particular ZVOL.
A ZVOL is a virtual block device which can be used for various things. On PVE it represents a VM disk. This has nothing to do with in-guest file system settings.
It is a parameter which tells how to handle storage of data within ZFS.
I can't tell.if that helps in your case but your description reminded me of that thread.

I have not tried to change a recordsize of an existing ZVOL. I think this is something to be done on creation. Someone mentioned you can migrate to a different record size via ZFS send / ZFS receive but I did not try / use / qualify that information.

Dunuin · Jan 17, 2021

tburger said:
I have not tried to change a recordsize of an existing ZVOL. I think this is something to be done on creation. Someone mentioned you can migrate to a different record size via ZFS send / ZFS receive but I did not try / use / qualify that information.

Yes, volblocksize of a zvol (virtual HDD) can only be set at creation. You can set the volblocksize for newly created zvols by changing the value "Datacenter -> Storage -> YourPoolName -> Edit -> Blocksize". So, if you want to change it for existing zvols, you need to...
- create a new virtual HDD
- boot from an linux live CD
- copy the old virtual HDD content (on blocklevel) to the new virtual HDD using "dd" or "zfs send | zfs receive"
- unmount the old virtual HDD (or UUIDs woudn't be unique)
- test if everything is working
- destroy the old virtual HDD if everything works

But if you just want to test if it will speed up things, just create a new VM after changing the volblocksize to something higher like 128K.

Also I got confused between the records size in ZFS and block size in the OS.

There are several block sizes your system is using:
1.) physical sector size of your SSD: your SSD will most likely lie to you. All SSDs will tell you they are using 512B or 4K but interally there will be something much higher used.
2.) logical sector size of your SSD: your SSD will most likely lie to you. All SSDs will tell you they are using 512B or 4K but interally there will be something much higher used.
3.) ashift of your ZFS pool: this tells ZFS what blocksize it should use to communicate with the SSDs. Ashift of 12 or 13 for 4K or 8K should be fine.
4.) record size: this is the block size datasets are using to store stuff. It is ONLY used for files on that dataset. Zvols will ignore this setting and use volblocksize instead.
5.) volblocksize: this is the block size of a zvol
6.) the virtual virtio SCSI controller will use a block size too. By default it should be 512B. So your VMs SCSI controller will try to write in 512B blocks to that zvol.
7.) blocksize of your guests filesystem: a ext4 for example will most likely use 4K blocks to store stuff.

Mixing block sizes will create overhead, especially if a blocksize, lower in the list, will use a smaller blocksize than a blocksize higher in the list.

Right now it should look like this if using LXCs:
512B (SSD sector size) <-- 4K (ashift of pool) <-- 128K (record size of dataset)

And like this if using a Linux-VM with ext4:
512B (SSD sector size) <-- 4K (ashift of pool) <-- 8K (volblocksize of zvol) <- 512B (virtio SCSI) <- 4K (ext4 inside guest)

6uellerbpanda · Jan 17, 2021

ok I thought on zfs on linux (I'm on freebsd) you would see the vol/recordsize of the zpool but I looks like not.

as the others already suggested check your vol/recordsize and the default 128kb should be fine.

I'm in general rather reluctant to zfs benchmarks 'cause of the magic (caching,compression,tgx groups,...) involved in the background but I do agree that spinning hdd shouldn't be faster then your ssd.

if you run the fio stuff you should also check the output of arcstat and zpool iostat to see how it is hitting the cache and disks.

Did you run the benchmark from inside the VM or on the pve itself ?

6uellerbpanda · Jan 17, 2021

also check your bios version. according to gigabyte version F4 has a fix "Improve SSD Performance" - whatever that means

ralph.vigne · Jan 18, 2021

@6uellerbpanda: I checked the BIOS, and I'm running F7 which seems to be the latest version.

I ran the tests in PVE itself with all guests stopped. I used the PVE installer to create the ZFS file system (including the mirroring settings) during install and didn't change any of the default settings.

@Dunuin: I tried what you proposed and change the settings of the volblocksize in Datacenter -> Storage -> local-zfs to 128K in GUI. Next I created a new VM (with a new volume of course) and tried the fio command again, but unfortunately with the same results. :-( When I check ZFS it states using zfs get it shows volblocksize 128K. The datasets I created before this change are showing volblocksize 8K, so the settings seem to have worked.

Next I created a new storage (ZFS with the folder icon) and set the blocksize to 128K. I again created a new VM with its storage now in the new 128K ZFS storage. I ran the ifo again, and it seems to stay the same. 128K is shown on zfs get, but the performance measured by the fio command is still slower than the HDDs.

6uellerbpanda · Jan 19, 2021

that's certainly strange....is it possible for you to remove the rpool and use ext4 for testing ? to rule out the layers below...

ralph.vigne · Jan 20, 2021

First, thank you so much for your time!

So, I did the following:

First I removed one SSD from the pool and reinstalled Proxmox on this disk using ext4. Booting this new installation, I ran

fio --name test-write --ioengine=libaio --iodepth=16 --rw=randwrite --bs=128k --dir ect=0 --size=256m --numjobs=8 --end_fsync=1

which gave me the following results:

Code:

Run status group 0 (all jobs):
  WRITE: bw=420MiB/s (440MB/s), 52.5MiB/s-112MiB/s (55.0MB/s-118MB/s), io=2048MiB (2147MB), run=2276-4878msec

This is about what I expected based on the specs of the SSD. I ran it again with a smaller block size e.g. 8K with similar results.

Next I destroyed the remaining ZFS pool (after backing it up of course) and recreated it with the default settings zpool create tank /dev/sde . Here is a list of the ZFS properties:

Code:

root@duckburg:/tank# zfs get all
NAME  PROPERTY              VALUE                  SOURCE
tank type filesystem -
tank creation Wed Jan 20 10:48 2021 -
tank used 2.00G -
tank available 897G -
tank referenced 2.00G -
tank compressratio 1.00x -
tank mounted yes -
tank quota none default
tank reservation none default
tank recordsize 128K default
tank mountpoint /tank default
tank sharenfs off default
tank checksum on default
tank compression off default
tank atime on default
tank devices on default
tank exec on default
tank setuid on default
tank readonly off default
tank zoned off default
tank snapdir hidden default
tank aclinherit restricted default
tank createtxg 1 -
tank canmount on default
tank xattr on default
tank copies 1 default
tank version 5 -
tank utf8only off -
tank normalization none -
tank casesensitivity sensitive -
tank vscan off default
tank nbmand off default
tank sharesmb off default
tank refquota none default
tank refreservation none default
tank guid 5626506341808755361 -
tank primarycache all default
tank secondarycache all default
tank usedbysnapshots 0B -
tank usedbydataset 2.00G -
tank usedbychildren 78K -
tank usedbyrefreservation 0B -
tank logbias latency default
tank objsetid 54 -
tank dedup off default
tank mlslabel none default
tank sync standard default
tank dnodesize legacy default
tank refcompressratio 1.00x -
tank written 2.00G -
tank logicalused 2.00G -
tank logicalreferenced 2.00G -
tank volmode default default
tank filesystem_limit none default
tank snapshot_limit none default
tank filesystem_count none default
tank snapshot_count none default
tank snapdev hidden default
tank acltype off default
tank context none default
tank fscontext none default
tank defcontext none default
tank rootcontext none default
tank relatime off default
tank redundant_metadata all default
tank overlay off default
tank encryption off default
tank keylocation none default
tank keyformat none default
tank pbkdf2iters 0 default
tank special_small_blocks 0 default

Next I ran the test again:

Code:

Run status group 0 (all jobs):
  WRITE: bw=6746KiB/s (6907kB/s), 843KiB/s-844KiB/s (863kB/s-865kB/s), io=2048MiB (2147MB), run=310449-310896msec

This is pretty much the same results I got with Proxmox installed on ZFS directly. Using ifo with a block size of 8K made it even worse.

Since the two SSDs are of the same type and connected to the same main board, I assume the only difference is ZFS, right? Maybe one of you can tell me how to tweak the zpool create command to improve my SSD performance?

apoc · Jan 20, 2021

ralph.vigne said:
zpool create command to improve my SSD performance?

Did you use ashift=12 as parameter? This ensures 4k alignment.
Some ssds seem also need to align to 8k which would be ashift=13 (which I would not find unlikely considering the TLC nature of the ssds as I understood the specs)

Dunuin · Jan 20, 2021

tank atime on default

You could deactivate atime. Should increase performance and lifetime of that SSD because with that disabled not every read will create a write.

6uellerbpanda · Jan 20, 2021

ralph.vigne said:
Since the two SSDs are of the same type and connected to the same main board, I assume the only difference is ZFS, right?

looks like it

pl show the output of arcstat 1 and zpool iostat rpool 1 when you run the fio command.

ralph.vigne · Jan 20, 2021

@tburger: I checked on the pool, and was surprised to see that ashift was set to 0. I tried recreating the pool using zpool create -o ashift=12 tank /dev/sde but there was very little impact in performance:

Code:

Run status group 0 (all jobs):
  WRITE: bw=6684KiB/s (6845kB/s), 836KiB/s-837KiB/s (856kB/s-857kB/s), io=2048MiB (2147MB), run=313302-313753msec

Then I tried zpool create -o ashift=13 tank /dev/sde but it seems to have made it even worse :-(

Code:

  WRITE: bw=4987KiB/s (5107kB/s), 623KiB/s-624KiB/s (638kB/s-639kB/s), io=2048MiB (2147MB), run=420052-420497msec

Note: As I destroyed the pool between the first and the second run I can assume there was no io-buffer impacting the second run, right?

What I noticed though, was that when using ashift=12 the io-delay was around 50% for 8 minutes, and when using ashift=13 it was only around 25-30% for around 10 minutes.

Also, I checked for the physical sector size using lsblk -o NAME,MOUNTPOINT,PHY-SEC and saw that in the output, all 4 HDDs are listed with 4096 sector size while both SSDs only show 512. While we already expected this, I am surprised that it works perfectly with ext4 and 512, while it doesn't for ZFS. Not sure if this provides any insights, but I thought I mention it just in case it hints anything.

@6uellerbpanda:
I did run arcstat 1 and zpool iostat tank 1 in two additional SSH sesssions while running the test command in a third one. I used the pool with ashift=13 for this test with the following results:

Code:

Run status group 0 (all jobs):
  WRITE: bw=5286KiB/s (5413kB/s), 661KiB/s-1970KiB/s (677kB/s-2017kB/s), io=2048MiB (2147MB), run=133057-396714msec

As for the command zpool iostat tank 1 it started with

Code:

              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
---------- ----- ----- ----- ----- ----- -----
tank        2.00G   926G      0     34      0  4.37M

The values for write operations fluctuating between 47 and 20, with 35 being the majority, and bandwidth did fluctuate between 2.6M and 6M.

The command arcstat 1 created the following outputs:

Code:

time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
22:13:25     0     0      0     0    0     0    0     0    0   2.2G 7.8G

and as soon as I started the fio command, it changed to

Code:

22:13:29  4.9K     0      0     0    0     0    0     0    0   4.1G 7.8G

In the following lines, the value of arcsz increased over time up to 5.1G only to slowly decrease back again to 2.2.G when the test finished.

Do this values indicate any misconfiguration to you?

@Dunuin: Thanks for pointing this out. I will set atime off for all my datasets as I assume this timestamp is irrelevant for my workflows (mostly NAS). In this test case the results got a little better, but are still way below expectations.

Code:

Run status group 0 (all jobs):
  WRITE: bw=9028KiB/s (9244kB/s), 1128KiB/s-1155KiB/s (1156kB/s-1183kB/s), io=2048MiB (2147MB), run=226972-232307msec
root@duckburg:/tank#

Here is the graph of the CPU utilization during all 4 test runs:

6uellerbpanda · Jan 21, 2021

ralph.vigne said:
Do this values indicate any misconfiguration to you?

one line is not much

pl post the whole output.

I think nobody yet asked but did you also check the logs dmesg, syslog, etc.. ?

apoc · Jan 21, 2021

found this one and wanted to share:
https://github.com/openzfs/zfs/issues/10253

It describes a strange behavior when scrubbing and having parallel IO. It seems NCQ (Queue-Depth=1) somewhat had an effect.
Maybe that is something you'd like to try as well..

just have built that for myself:

Code:

#!/bin/bash
queuedepth=1

for i in a b c d e f g h i j k l m n o p q r s t u v w x y z ;
    do
        if test -d /sys/block/sd$i ;
            then
                echo "working on sd$i"
                echo "    before: "`cat /sys/block/sd$i/device/queue_depth`
                echo $queuedepth > /sys/block/sd$i/device/queue_depth
                echo "    after:  "`cat /sys/block/sd$i/device/queue_depth`
                sleep 1
        fi
done

spirit · Jan 22, 2021

Based on this results, it seems my SSD is 20x slower that my HDD? Can this be true, are they really that bad?

Yes it can be true with sync writes && cheap consumer ssd. zfs && ceph need fast sync write for their journal

https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

you should use datacenter ssd for zfs, at least for the slog/zil.

ralph.vigne · Jan 22, 2021

@6uellerbpanda i attached the complete logs of fio, arcstat and zpool iostat. I also attached the output of dmesg and syslog. Sorry for not attaching it before, I thought it would be OK to summarize it.

@tburger I ran you script, and got the following output:

Code:

working on sda
    before: 32
after: 1
working on sdb
before: 32
after: 1
working on sdc
before: 32
after: 1
working on sdd
before: 32
after: 1
working on sde
before: 32
after: 1
working on sdf
before: 32
after: 1

Note: sd[a-d] are ZFS volumes which have not been imported since the server was started, thus no disk access (should) have happened. sdf is the ext4 formated disk that is currently used for the system. sde is the ZFS volume I use for testing the IO capacity (i.e. tank).

What I also noticed, but maybe this is expected, after running the script a second time, all values were set to 1.

@spirit Thank you very much for your input. I read the page you linked, and tried disabling the write-cache, with surprising results.
I had 8013kB/s with the cache enabled and between 16.2MB/s and 19.8MB/s after disabling it. Why is my disk faster with cache disabled?

I am not aware of any ZIL or SLOG storage. Are they created by default? Also, I thought I read somewhere in the forum they are (mostly) only relevant if you have deduplication turned on? Anyway, you're probably right, and my consumer grade SSDs sucks when hammered by ZFS. I guess I need to run it on ext4 (or maybe Btrfs?) as an alternative and keep ZFS for the spinning disks, as it works perfectly fine on those.

High I/O wait with SSDs

Member

Attachments

Distinguished Member

Famous Member

Renowned Member

Member

​

Famous Member

Distinguished Member

Renowned Member

Renowned Member

Member

Renowned Member

Member

Famous Member

Distinguished Member

Renowned Member

Member

Renowned Member

Famous Member

Distinguished Member

Member

Attachments

We value your privacy