About high load and hard drives

strix

Active Member
Mar 20, 2018
23
0
41
37
Hello everyone,

i'am using proxmox last 2 years, but i'am facing one little problem and i can't find how to solved it, so i need some help, or some advices about what can i do for this kind of situation.

I'am using proxmox and i have a VM inside, with 16CPU (2sockets, 8 cores), 90G RAM, 4 x HDD 4TB WD Datacenter Gold Edition in RAID 10 and 2x NVMe 512GB for caching. I use ZFS file system RAID10 for 4 disks and my settings are vm-100-disk-1, cache=writeback, iops_rd=30000, iops_rd_max=32000, iops_wr=30000, iops_wr_max=32000, mbps_rd=30, mbps_rd_max=32, mbps_wr=30, mbps_wr_max=32, size=8T. Those 4 hard drives are encrypted by the way.

Also, i'am using CentoS 7.6 and i have some websites and projects on my control panel. The problem is that when i try to upload or restore a big file like 2G or 4G+ etc I/O requests hits peak for no reason and i have high load on server, without any problem on CPU and RAM, but the services are not responding until the upload or restore finished for some minutes, that's a huge problem for me because i don't want my projects to stop working.

In order to dig a bit further into the issue I ran some tests on the disks.

I ran the following test on encrypted LUKS partition:
=====================

Code:
hdparm -Tt /dev/sda3
/dev/sda3:
Timing cached reads: 12832 MB in 1.99 seconds = 6442.37 MB/sec
Timing buffered disk reads: 38 MB in 3.09 seconds = 12.31 MB/sec

This shows that buffered disk reads are coming in at only 12.31 MB/sec.

I had a feeling that the encrypted LUKS filesystem may not be particularly performant, so I compared this with a test on the xfs filesystem:
Code:
hdparm -Tt /dev/sda2
/dev/sda2:
Timing cached reads: 11598 MB in 1.99 seconds = 5824.03 MB/sec
Timing buffered disk reads: 112 MB in 3.00 seconds = 37.32 MB/sec

As you can see, the performance is doubled, but still quite low at 37.32 MB/sec.

Let's also take a look at writes with the following test:
Code:
dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 55.9095 s, 19.2 MB/s

That's only 19 MB/s for writes to the disk which is quite slow.

This makes it so that the disk utilization quickly spikes and is consumed when doing operations that require a heavy amount of disk Input Output.

Also as you can see that my server is regularly hitting high utilization numbers in this sar report:

Code:
sar -dp | column -t
Linux 3.10.0-962.3.2.lve1.5.24.8.el7.x86_64 - 01/23/2019 _x86_64_ (16 CPU)
12:00:02 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
12:10:01 AM sdb 35.84 12.99 1065.57 30.09 0.32 8.98 6.19 22.18
12:10:01 AM sda 687.99 17214.39 9997.64 39.55 8.32 12.06 1.14 78.49
12:10:01 AM sdc 100.17 5.00 21273.32 212.42 74.61 744.62 5.80 58.14
12:10:01 AM uks-d113de61-2823-4b76-becf-b67bb8c4a986 764.43 17214.21 9997.64 35.60 85.24 111.47 1.04 79.54
12:10:01 AM centos-root 299.17 16900.24 1211.67 60.54 4.20 14.00 2.62 78.50
12:10:01 AM centos-swap 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:10:01 AM centos-home 448.83 313.97 8786.72 20.28 81.04 180.52 0.58 25.99
12:10:01 AM backup 326.92 5.00 21273.32 65.09 1132.16 3463.07 1.78 58.14
12:20:01 AM sdb 66.84 27.86 1265.58 19.35 1.69 25.24 14.46 96.62
12:20:01 AM sda 497.77 725.88 30242.73 62.21 62.13 124.59 2.01 99.94
12:20:01 AM sdc 8.83 6.73 576.89 66.11 1.68 187.72 3.46 3.05
12:20:01 AM luks-d113de61-2823-4b76-becf-b67bb8c4a986 1159.81 726.24 30242.73 26.70 3717.22 3204.86 0.86 99.96
12:20:01 AM centos-root 39.85 529.83 738.42 31.82 53.19 1333.06 25.03 99.75
12:20:01 AM centos-swap 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:20:01 AM centos-home 1084.74 196.41 29504.80 27.38 3663.99 3377.69 0.90 98.08
12:20:01 AM backup 35.05 6.73 576.89 16.65 8.81 251.51 0.87 3.05
12:30:26 AM sdb 332.55 80.13 6446.01 19.62 1.28 3.83 2.16 71.84
12:30:26 AM sda 253.49 5061.01 28166.41 131.08 60.90 239.89 2.40 60.80
12:30:26 AM sdc 45.60 987.64 1361.17 51.51 5.55 121.66 12.48 56.91

The column that you want to look at is the last column which is util%

As i can check everything is fine on CentOS system, but i think i'am doing something wrong on Proxmox and the performance of the disks are very bad and slow as you can see with tests.

Any help or advices for proxmox settings are welcomed, thanks for your time.
 
Code:
 zpool status

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h6m with 0 errors on Sun Jan 13 00:30:58 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdg2    ONLINE       0     0     0
            sdf2    ONLINE       0     0     0

errors: No known data errors

  pool: wd-1092G-dc
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 22h7m with 0 errors on Sun Jan 13 22:31:38 2019
config:

        NAME                                                 STATE     READ WRIT                                                                                                   E CKSUM
        wd-1092G-dc                                          DEGRADED     0                                                                                                        0     0
          mirror-0                                           ONLINE       0                                                                                                        0     0
            ata-WDC_WD6002FRYZ-01WD5B0_K1JSXNHD              ONLINE       0                                                                                                        0     0
            ata-WDC_WD6002FRYZ-01WD5B0_K1JSMWXD              ONLINE       0                                                                                                        0     0
          mirror-1                                           ONLINE       0                                                                                                        0     0
            ata-WDC_WD6002FRYZ-01WD5B1_K1KATHAD              ONLINE       0                                                                                                        0     0
            ata-WDC_WD6002FRYZ-01WD5B0_K1J6APRD              ONLINE       0                                                                                                        0     0
        logs
          mirror-2                                           DEGRADED     0                                                                                                        0     0
            nvme-INTEL_SSDPEKKW512G8_BTHH811211LM512D-part1  ONLINE       0                                                                                                        0     0
            nvme-INTEL_SSDPEKKW512G8_BTHH811214PF512D-part1  UNAVAIL      0   30                                                                                                   0     0  corrupted data
        cache
          nvme-INTEL_SSDPEKKW512G8_BTHH811211LM512D-part2    ONLINE       0                                                                                                        0     0
          nvme-INTEL_SSDPEKKW512G8_BTHH811214PF512D-part2    ONLINE   3.90M 49.6                                                                                                   M     0

errors: No known data errors
 
Code:
 arc_summary
------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Jan 24 11:45:38 2019
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                54.23M
        Mutex Misses:                           18.86k
        Evict Skips:                            1.04M

ARC Size:                               28.41%  18.18   GiB
        Target Size: (Adaptive)         28.51%  18.25   GiB
        Min Size (Hard Limit):          25.00%  16.00   GiB
        Max Size (High Water):          4:1     64.00   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       17.95%  2.86    GiB
        Frequently Used Cache Size:     82.05%  13.07   GiB

ARC Hash Breakdown:
        Elements Max:                           19.71M
        Elements Current:               80.23%  15.81M
        Collisions:                             92.70M
        Chain Max:                              8
        Chains:                                 2.75M

ARC Total accesses:                                     1.10G
        Cache Hit Ratio:                90.17%  987.85M
        Cache Miss Ratio:               9.83%   107.74M
        Actual Hit Ratio:               89.74%  983.21M

        Data Demand Efficiency:         79.59%  443.81M
        Data Prefetch Efficiency:       31.27%  22.26M

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           17.54%  173.28M
          Most Frequently Used:         81.99%  809.93M
          Most Recently Used Ghost:     1.46%   14.46M
          Most Frequently Used Ghost:   0.03%   250.90k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  35.76%  353.21M
          Prefetch Data:                0.70%   6.96M
          Demand Metadata:              63.40%  626.26M
          Prefetch Metadata:            0.14%   1.43M

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  84.09%  90.60M
          Prefetch Data:                14.20%  15.30M
          Demand Metadata:              1.60%   1.72M
          Prefetch Metadata:            0.11%   122.69k

L2 ARC Summary: (DEGRADED)
        Low Memory Aborts:                      167
        Free on Write:                          5.52k
        R/W Clashes:                            0
        Bad Checksums:                          4.09M
        IO Errors:                              4.09M

L2 ARC Size: (Adaptive)                         118.65  GiB
        Compressed:                     94.93%  112.64  GiB
        Header Size:                    0.92%   1.09    GiB
 
Code:
L2 ARC Evicts:
        Lock Retries:                           35
        Upon Reading:                           0

L2 ARC Breakdown:                               107.74M
        Hit Ratio:                      28.93%  31.17M
        Miss Ratio:                     71.07%  76.56M
        Feeds:                                  413.21k

L2 ARC Writes:
        Writes Sent:                    100.00% 403.67k

DMU Prefetch Efficiency:                                        305.89M
        Hit Ratio:                      15.22%  46.57M
        Miss Ratio:                     84.78%  259.32M

ZFS Tunables:
        dbuf_cache_hiwater_pct                            10
        dbuf_cache_lowater_pct                            10
        dbuf_cache_max_bytes                              104857600
        dbuf_cache_max_shift                              5
        dmu_object_alloc_chunk_shift                      7
        ignore_hole_birth                                 1
        l2arc_feed_again                                  1
        l2arc_feed_min_ms                                 200
        l2arc_feed_secs                                   1
        l2arc_headroom                                    2
        l2arc_headroom_boost                              200
        l2arc_noprefetch                                  1
        l2arc_norw                                        0
        l2arc_write_boost                                 8388608
        l2arc_write_max                                   8388608
        metaslab_aliquot                                  524288
        metaslab_bias_enabled                             1
        metaslab_debug_load                               0
        metaslab_debug_unload                             0
        metaslab_fragmentation_factor_enabled             1
        metaslab_lba_weighting_enabled                    1
        metaslab_preload_enabled                          1
        metaslabs_per_vdev                                200
        send_holes_without_birth_time                     1
        spa_asize_inflation                               24
        spa_config_path                                   /etc/zfs/zpool.cache
        spa_load_verify_data                              1
        spa_load_verify_maxinflight                       10000
        spa_load_verify_metadata                          1
        spa_slop_shift                                    5
        zfetch_array_rd_sz                                1048576
        zfetch_max_distance                               8388608
        zfetch_max_streams                                8
        zfetch_min_sec_reap                               2
        zfs_abd_scatter_enabled                           1
        zfs_abd_scatter_max_order                         10
        zfs_admin_snapshot                                1
        zfs_arc_average_blocksize                         8192
        zfs_arc_dnode_limit                               0
        zfs_arc_dnode_limit_percent                       10
        zfs_arc_dnode_reduce_percent                      10
        zfs_arc_grow_retry                                0
        zfs_arc_lotsfree_percent                          10
        zfs_arc_max                                       68719476736
        zfs_arc_meta_adjust_restarts                      4096
        zfs_arc_meta_limit                                0
        zfs_arc_meta_limit_percent                        75
        zfs_arc_meta_min                                  0
        zfs_arc_meta_prune                                10000
        zfs_arc_meta_strategy                             1
        zfs_arc_min                                       17179869184
        zfs_arc_min_prefetch_lifespan                     0
        zfs_arc_p_dampener_disable                        1
        zfs_arc_p_min_shift                               0
        zfs_arc_pc_percent                                0
        zfs_arc_shrink_shift                              0
        zfs_arc_sys_free                                  0
        zfs_autoimport_disable                            1
        zfs_checksums_per_second                          20
        zfs_compressed_arc_enabled                        1
        zfs_dbgmsg_enable                                 0
        zfs_dbgmsg_maxsize                                4194304
        zfs_dbuf_state_index                              0
        zfs_deadman_checktime_ms                          5000
        zfs_deadman_enabled                               1
        zfs_deadman_synctime_ms                           1000000
        zfs_dedup_prefetch                                0
        zfs_delay_min_dirty_percent                       60
        zfs_delay_scale                                   500000
        zfs_delays_per_second                             20
        zfs_delete_blocks                                 20480
        zfs_dirty_data_max                                4294967296
        zfs_dirty_data_max_max                            4294967296
        zfs_dirty_data_max_max_percent                    25
        zfs_dirty_data_max_percent                        10
        zfs_dirty_data_sync                               67108864
        zfs_dmu_offset_next_sync                          0
        zfs_expire_snapshot                               300
        zfs_flags                                         0
        zfs_free_bpobj_enabled                            1
        zfs_free_leak_on_eio                              0
        zfs_free_max_blocks                               100000
        zfs_free_min_time_ms                              1000
        zfs_immediate_write_sz                            32768
        zfs_max_recordsize                                1048576
        zfs_mdcomp_disable                                0
        zfs_metaslab_fragmentation_threshold              70
        zfs_metaslab_segment_weight_enabled               1
        zfs_metaslab_switch_threshold                     2
        zfs_mg_fragmentation_threshold                    85
        zfs_mg_noalloc_threshold                          0
        zfs_multihost_fail_intervals                      5
        zfs_multihost_history                             0
        zfs_multihost_import_intervals                    10
        zfs_multihost_interval                            1000
        zfs_multilist_num_sublists                        0
        zfs_no_scrub_io                                   0
        zfs_no_scrub_prefetch                             0
        zfs_nocacheflush                                  0
        zfs_nopwrite_enabled                              1
        zfs_object_mutex_size                             64
        zfs_pd_bytes_max                                  52428800
        zfs_per_txg_dirty_frees_percent                   30
        zfs_prefetch_disable                              0
        zfs_read_chunk_size                               1048576
        zfs_read_history                                  0
        zfs_read_history_hits                             0
        zfs_recover                                       0
        zfs_recv_queue_length                             16777216
        zfs_resilver_delay                                2
        zfs_resilver_min_time_ms                          3000
        zfs_scan_idle                                     50
        zfs_scan_ignore_errors                            0
        zfs_scan_min_time_ms                              1000
        zfs_scrub_delay                                   4
        zfs_send_corrupt_data                             0
        zfs_send_queue_length                             16777216
        zfs_sync_pass_deferred_free                       2
        zfs_sync_pass_dont_compress                       5
        zfs_sync_pass_rewrite                             2
        zfs_sync_taskq_batch_pct                          75
        zfs_top_maxinflight                               32
        zfs_txg_history                                   0
        zfs_txg_timeout                                   5
        zfs_vdev_aggregation_limit                        131072
        zfs_vdev_async_read_max_active                    3
        zfs_vdev_async_read_min_active                    1
        zfs_vdev_async_write_active_max_dirty_percent     60
        zfs_vdev_async_write_active_min_dirty_percent     30
        zfs_vdev_async_write_max_active                   10
        zfs_vdev_async_write_min_active                   2
        zfs_vdev_cache_bshift                             16
        zfs_vdev_cache_max                                16384
        zfs_vdev_cache_size                               0
        zfs_vdev_max_active                               1000
        zfs_vdev_mirror_non_rotating_inc                  0
        zfs_vdev_mirror_non_rotating_seek_inc             1
        zfs_vdev_mirror_rotating_inc                      0
        zfs_vdev_mirror_rotating_seek_inc                 5
        zfs_vdev_mirror_rotating_seek_offset              1048576
        zfs_vdev_queue_depth_pct                          1000
        zfs_vdev_raidz_impl                               [fastest] original scalar sse2 ssse3
        zfs_vdev_read_gap_limit                           32768
        zfs_vdev_scheduler                                noop
        zfs_vdev_scrub_max_active                         2
        zfs_vdev_scrub_min_active                         1
        zfs_vdev_sync_read_max_active                     10
        zfs_vdev_sync_read_min_active                     10
        zfs_vdev_sync_write_max_active                    10
        zfs_vdev_sync_write_min_active                    10
        zfs_vdev_write_gap_limit                          4096
        zfs_zevent_cols                                   80
        zfs_zevent_console                                0
        zfs_zevent_len_max                                640
        zil_replay_disable                                0
        zil_slog_bulk                                     786432
        zio_delay_max                                     30000
        zio_dva_throttle_enabled                          1
        zio_requeue_io_start_cut_in_line                  1
        zio_taskq_batch_pct                               75
        zvol_inhibit_dev                                  0
        zvol_major                                        230
        zvol_max_discard_blocks                           16384
        zvol_prefetch_bytes                               131072
        zvol_request_sync                                 0
        zvol_threads                                      32
        zvol_volmode                                      1

The problem was every time i use to restore something or upload it up to 1G+

(Sorry for 3 posts, but forum system doesnt allowed me to post in one, because its too big).
 
I would guess, that you get high IO wait due to slowness of ZFS with only 4 HDD disks on asynchronous writes.

So all writes, for which the guest VM does not require confirmation, that they were written to disks, go via buffers (RAM) to the HDD disks (never touching your fast SLOGs). BTW one of your SLOG devices is UNAVAIL.
If these buffers do not fill, you do not experience slow downs, but when they do fill, you get the ZFS speed of 4 HDDs in RAID 10 which seems to slow for your case (and it really is slow). If you check the forum, you can see it full of problems of high IO wait with ZFS where not many spindles are used. My personal experience is the same. For example just one Windows server VM restarting with two HDDs in ZFS mirror with two Intel DCs as SLOG and L2ARC, can bring OI wait up to 50%, but if normal mdadm software RAID is used, without Intel DC SSDs, there is not high IO wait.

While you can try to reduce IO wait with some tuning, I doubt you will ever get down to satisfactory levels with just 4 HDDs in ZFS RAID 10.

Hopefully I am wrong and someone can show us how to tune it to such extend that IO wait would be negligible as it would be with mdadm.
 
I do hope you have a HBA and not HW raid controller

and also your slog is degraded

first pl check https://forum.proxmox.com/threads/zil-l2arc-question.47266/#post-222861 about your L2ARC. this caps you also.

zfs mirrors are good for random IO put not so good for sequential IO
in your case you only get the write perf of one hdd per vdev and the bigger the hdd the more latency you get. so in principle with your hdd setup you won't achieve much in general or whatever you except from it.

but you also have a option - remove l2arc to get more ARC and make the IO's synchronous 'cause then your write IO's will go to SLOG - your ssd
 
To make all writes go to fast SLOG as panda said, you could do zfs set sync=always rpool/datasetOrZVOL.
Please report back and test first after you disable L2ARC and then after you enable sync always.
 
I am the only one who see that the ZFS Pool is in degraded state? I would recommend to fix it and try again.
 
@6uellerbpanda have some valid arguments, you can do some improvments. From your data, I can see that your arc have cache a lot of data. Also all of your vhdd are encrypted. So the probability that your data to be reutilized from cache is low(35 % data only).
A better ideea is to setup for all or maybe some of your vdisk to cache only metadata in arc and all/metadadata only for l2cache. In this case you will have more memory who can be alocated to your vm. VM will have a better cache management compared with other distant layers.

Another point is about your block size used by zfs vdisks. With a default value (8k) you will have write 8 k / 2 stripe = 4 k. So you will have a lot of metadata. You can try to use 16-32 k.

Good luck.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!