ZFS and high iowait / server load

mmenaz

Renowned Member
Jun 25, 2009
835
25
93
Northern east Italy
Hi, I've a i5 4590 CPU, 16GB ram, 2x1TB velociraptor as Raid1, a Intel 3710 200GB as zil (16GB) and L2ARC (110GB). Arc RAM is limited to 4GB since I've not too much.
When I start a VM, the server load skyrockets to > 30-40, I've 3 VMs and I had to start one after the other, because if they start at the same time the server becomes unusable (100% CPU).
Here some details when I start only the first VM (4GB ram assigned, Win2003-32 bit, Virtio) and I have 32 of iowait:
Code:
# iostat -x 2 5
avg-cpu:  %user  %nice %system %iowait  %steal  %idle
  7.02  0.00  3.62  32.41  0.00  56.94

Device:  rrqm/s  wrqm/s  r/s  w/s  rkB/s  wkB/s avgrq-sz avgqu-sz  await r_await w_await  svctm  %util
sda  0.04  0.00  124.86  21.15  8762.27  1118.12  135.34  1.18  8.09  9.27  1.13  4.20  61.31
sdb  0.03  0.00  123.95  21.12  8611.94  1118.12  134.15  1.18  8.14  9.34  1.15  4.21  61.10
sdc  0.00  0.00  7.59  205.97  203.32 12381.30  117.86  0.20  0.94  20.82  0.20  0.56  11.89
sdd  0.00  0.00  1.51  0.01  12.17  0.03  16.06  0.00  0.59  0.54  6.40  0.48  0.07
zd0  0.00  0.00  0.35  0.00  2.79  0.00  16.17  0.00  0.00  0.00  0.00  0.00  0.00
sda and sdb are for raid1, while sdc is for zil and l2arc.
Code:
root@prox02:~# zpool iostat
  capacity  operations  bandwidth
pool  alloc  free  read  write  read  write
----------  -----  -----  -----  -----  -----  -----
rpool  417G  511G  208  100  13.6M  2.01M
#
and
Code:
root@prox02:~# zpool iostat -v
  capacity  operations  bandwidth
pool  alloc  free  read  write  read  write
--------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool  417G  511G  216  101  14.1M  2.04M
  mirror  417G  511G  216  74  14.1M  1.11M
  sda2  -  -  104  21  7.15M  1.13M
  sdb2  -  -  103  21  7.01M  1.13M
logs  -  -  -  -  -  -
  sdc1  23.2M  16.6G  0  27  2.93K  945K
cache  -  -  -  -  -  -
  ata-INTEL_SSDSC2BA200G4_BTHV52640CXE200MGN-part2  4.43G  107G  6  136  174K  9.60M
--------------------------------------------------  -----  -----  -----  -----  -----  -----

I'm very disappointed and scared by this, is the first time I try Proxmox ZFS on real hardware and pveperf fsync index was very good (> 5000).
Thanks for the help
 
To understand ZFS use arc_summary and arcstats tools. Big IOwait can mean only not enough ARC. If you can make a test make ARC bigger and start one VM to see does it boot faster.
 
I think if you are using raw or qcow2 files, it is to be expected that ZFS will not be very fast, especially at first start, when, for example mysql would be creating a log file. On the upside you can take snapshots at any time, but with qcow2 this is redundant, so you should really be using raw files then. When the image file is not expanding rapidly, the performance should be somewhat better. Check if your secondary cache is not set to metadata only.

Also, if you are serious about your filesystem and want to use ZIL, you should use a mirrored ZIL and prefer SLC cells in your SSD.

ZFS is not a racehorse in smaller systems, it really excels in complicated Enterprise filesystems with high needs for data safety and builtin replication.
 
To understand ZFS use arc_summary and arcstats tools. Big IOwait can mean only not enough ARC. If you can make a test make ARC bigger and start one VM to see does it boot faster.
I will try tonight. Last night I took the opposite route, I shrinked ARC to have more free ram.
These are the output of the commands you mentioned after 1 day of usage, would love if you could point me to the important values to consider.
More ram for ARC = more ECC memory = a lot of money + some MB has limit.
In abstract, would be 8GB ZFS limit (32 GB total) be enough for a raid1 or raid10 "small" server?
Thanks in advance
Code:
root@prox02:~# arc_summary.py

------------------------------------------------------------------------
ZFS Subsystem Report  Tue Dec 29 11:46:41 2015
ARC Summary: (HEALTHY)
  Memory Throttle Count:  0

ARC Misc:
  Deleted:  1.78m
  Mutex Misses:  79
  Evict Skips:  79

ARC Size:  100.05% 4.01  GiB
  Target Size: (Adaptive)  100.00% 4.00  GiB
  Min Size (Hard Limit):  49.94%  2.00  GiB
  Max Size (High Water):  2:1  4.00  GiB

ARC Size Breakdown:
  Recently Used Cache Size:  37.56%  1.51  GiB
  Frequently Used Cache Size:  62.44%  2.50  GiB

ARC Hash Breakdown:
  Elements Max:  572.64k
  Elements Current:  99.99%  572.58k
  Collisions:  834.91k
  Chain Max:  5
  Chains:  65.28k

ARC Total accesses:  11.34m
  Cache Hit Ratio:  79.26%  8.98m
  Cache Miss Ratio:  20.74%  2.35m
  Actual Hit Ratio:  76.20%  8.64m

  Data Demand Efficiency:  82.29%  9.88m
  Data Prefetch Efficiency:  53.07%  682.81k

  CACHE HITS BY CACHE LIST:
  Most Recently Used:  25.21%  2.26m
  Most Frequently Used:  70.93%  6.37m
  Most Recently Used Ghost:  2.81%  252.20k
  Most Frequently Used Ghost:  2.69%  241.78k

  CACHE HITS BY DATA TYPE:
  Demand Data:  90.46%  8.13m
  Prefetch Data:  4.03%  362.34k
  Demand Metadata:  5.48%  492.59k
  Prefetch Metadata:  0.02%  2.18k

  CACHE MISSES BY DATA TYPE:
  Demand Data:  74.37%  1.75m
  Prefetch Data:  13.63%  320.47k
  Demand Metadata:  11.91%  280.09k
  Prefetch Metadata:  0.09%  2.10k

L2 ARC Summary: (HEALTHY)
  Low Memory Aborts:  0
  Free on Write:  440
  R/W Clashes:  1
  Bad Checksums:  0
  IO Errors:  0

L2 ARC Size: (Adaptive)  64.64  GiB
  Compressed:  55.86%  36.11  GiB
  Header Size:  0.07%  48.28  MiB

L2 ARC Evicts:
  Lock Retries:  0
  Upon Reading:  0

L2 ARC Breakdown:  2.35m
  Hit Ratio:  51.58%  1.21m
  Miss Ratio:  48.42%  1.14m
  Feeds:  45.75k

L2 ARC Writes:
  Writes Sent:  100.00% 38.22k

File-Level Prefetch: (HEALTHY)
DMU Efficiency:  13.20m
  Hit Ratio:  51.21%  6.76m
  Miss Ratio:  48.79%  6.44m

  Colinear:  6.44m
  Hit Ratio:  0.03%  2.01k
  Miss Ratio:  99.97%  6.44m

  Stride:  6.64m
  Hit Ratio:  99.98%  6.64m
  Miss Ratio:  0.02%  1.62k

DMU Misc:
  Reclaim:  6.44m
  Successes:  0.99%  63.78k
  Failures:  99.01%  6.38m

  Streams:  123.50k
  +Resets:  0.92%  1.13k
  -Resets:  99.08%  122.37k
  Bogus:  0


ZFS Tunable:
  metaslab_debug_load  0
  zfs_arc_min_prefetch_lifespan  0
  zfetch_max_streams  8
  zfs_nopwrite_enabled  1
  zfetch_min_sec_reap  2
  zfs_dbgmsg_enable  0
  zfs_dirty_data_max_max_percent  25
  zfs_arc_p_aggressive_disable  1
  spa_load_verify_data  1
  zfs_zevent_cols  80
  zfs_dirty_data_max_percent  10
  zfs_sync_pass_dont_compress  5
  l2arc_write_max  8388608
  zfs_vdev_scrub_max_active  2
  zfs_vdev_sync_write_min_active  10
  zvol_prefetch_bytes  131072
  metaslab_aliquot  524288
  zfs_no_scrub_prefetch  0
  zfs_arc_shrink_shift  0
  zfetch_block_cap  256
  zfs_txg_history  0
  zfs_delay_scale  500000
  zfs_vdev_async_write_active_min_dirty_percent  30
  metaslab_debug_unload  0
  zfs_read_history  0
  zvol_max_discard_blocks  16384
  zfs_recover  0
  l2arc_headroom  2
  zfs_deadman_synctime_ms  1000000
  zfs_scan_idle  50
  zfs_free_min_time_ms  1000
  zfs_dirty_data_max  1670581043
  zfs_vdev_async_read_min_active  1
  zfs_mg_noalloc_threshold  0
  zfs_dedup_prefetch  0
  zfs_vdev_max_active  1000
  l2arc_write_boost  8388608
  zfs_resilver_min_time_ms  3000
  zfs_vdev_async_write_max_active  10
  zil_slog_limit  1048576
  zfs_prefetch_disable  0
  zfs_resilver_delay  2
  metaslab_lba_weighting_enabled  1
  zfs_mg_fragmentation_threshold  85
  l2arc_feed_again  1
  zfs_zevent_console  0
  zfs_immediate_write_sz  32768
  zfs_dbgmsg_maxsize  4194304
  zfs_free_leak_on_eio  0
  zfs_deadman_enabled  1
  metaslab_bias_enabled  1
  zfs_arc_p_dampener_disable  1
  zfs_metaslab_fragmentation_threshold  70
  zfs_no_scrub_io  0
  metaslabs_per_vdev  200
  zfs_dbuf_state_index  0
  zfs_vdev_sync_read_min_active  10
  metaslab_fragmentation_factor_enabled  1
  zvol_inhibit_dev  0
  zfs_vdev_async_write_active_max_dirty_percent  60
  zfs_vdev_cache_size  0
  zfs_vdev_mirror_switch_us  10000
  zfs_dirty_data_sync  67108864
  spa_config_path  /etc/zfs/zpool.cache
  zfs_dirty_data_max_max  4176452608
  zfs_arc_lotsfree_percent  10
  zfs_zevent_len_max  64
  zfs_scan_min_time_ms  1000
  zfs_arc_sys_free  0
  zfs_arc_meta_strategy  1
  zfs_vdev_cache_bshift  16
  zfs_arc_meta_adjust_restarts  4096
  zfs_max_recordsize  1048576
  zfs_vdev_scrub_min_active  1
  zfs_vdev_read_gap_limit  32768
  zfs_arc_meta_limit  0
  zfs_vdev_sync_write_max_active  10
  l2arc_norw  0
  zfs_arc_meta_prune  10000
  metaslab_preload_enabled  1
  l2arc_nocompress  0
  zvol_major  230
  zfs_vdev_aggregation_limit  131072
  zfs_flags  0
  spa_asize_inflation  24
  zfs_admin_snapshot  0
  l2arc_feed_secs  1
  zfs_sync_pass_deferred_free  2
  zfs_disable_dup_eviction  0
  zfs_arc_grow_retry  0
  zfs_read_history_hits  0
  zfs_vdev_async_write_min_active  1
  zfs_vdev_async_read_max_active  3
  zfs_scrub_delay  4
  zfs_delay_min_dirty_percent  60
  zfs_free_max_blocks  100000
  zfs_vdev_cache_max  16384
  zio_delay_max  30000
  zfs_top_maxinflight  32
  spa_slop_shift  5
  zfs_vdev_write_gap_limit  4096
  spa_load_verify_metadata  1
  spa_load_verify_maxinflight  10000
  l2arc_noprefetch  1
  zfs_vdev_scheduler  noop
  zfs_expire_snapshot  300
  zfs_sync_pass_rewrite  2
  zil_replay_disable  0
  zfs_nocacheflush  0
  zfs_arc_max  4299967296
  zfs_arc_min  2147483648
  zfs_read_chunk_size  1048576
  zfs_txg_timeout  5
  zfs_pd_bytes_max  52428800
  l2arc_headroom_boost  200
  zfs_send_corrupt_data  0
  l2arc_feed_min_ms  200
  zfs_arc_meta_min  0
  zfs_arc_average_blocksize  8192
  zfetch_array_rd_sz  1048576
  zfs_autoimport_disable  1
  zfs_arc_p_min_shift  0
  zio_requeue_io_start_cut_in_line  1
  zfs_vdev_sync_read_max_active  10
  zfs_mdcomp_disable  0
  zfs_arc_num_sublists_per_state  4

root@prox02:~# arcstat.py
  time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz  c  
11:47:21  1  0  0  0  0  0  0  0  0  4.0G  4.0G  
root@prox02:~#
root@prox02:~# uptime
 11:48:35 up 12:15,  1 user,  load average: 0.27, 0.66, 0.70
 
I think if you are using raw or qcow2 files, it is to be expected that ZFS will not be very fast, especially at first start, when, for example mysql would be creating a log file. On the upside you can take snapshots at any time, but with qcow2 this is redundant, so you should really be using raw files then. When the image file is not expanding rapidly, the performance should be somewhat better.
I've ZFS as root file system (installed Proxmox choosing ZFS), so if I create raw disks I can't have snapshots :(

Check if your secondary cache is not set to metadata only.
AFAIU (see the output of my reply above) L2ARC is caching data too

Also, if you are serious about your filesystem and want to use ZIL, you should use a mirrored ZIL and prefer SLC cells in your SSD.
ZFS is not a racehorse in smaller systems, it really excels in complicated Enterprise filesystems with high needs for data safety and builtin replication.

So far I've considered ZFS a way to have raid without a dedicated, expensive and proprietary card with BBU. I've not been able to buy HBA that worked (LSI is not sold here, rebranded need cross flashing and I was not able to, areca ARC1320 produces I/O errors, RAID boards I've tested have bad very bad JBOD mode, had no more money to try Adaptec HBA line), so I'm using on board sata that are often limited in number on server class MB (i.e. ony 2 sata3).
What are the risks of having just one ZIL (I think you suggest 2 ssd in raid1 for that)? Isn't data checksumed and a corrupted zil bypassed?
Thanks a lot
 
What are the risks of having just one ZIL (I think you suggest 2 ssd in raid1 for that)? Isn't data checksumed and a corrupted zil bypassed?
In theory you could lose up to 5 sec. of data if the ZIL melts down instantaneously. If you buy god hardware and your setup is for home use or in business where loosing 5 sec worth of data is acceptable I see no reason for doing a mirrored ZIL. I have survived for years at home without using an Intel DC SSD and paying closed attention to the smart readings.
 
Hmm.. Even 1 second of loss of synchronous database writes can cause a lot of work trying to recover your database if you don't have recent backups, so I wouldn't underestimate that particular risk. When the synchronous write returns, the database program has every right to assume the data is actually on the final disk and not half of it.

I looked at your setup a bit and ZFS logistics. It seems that you are over-dimensioning your primary cache (the one in RAM) by setting it to 4GB because in the most optimistic scenario you couldn't write more then approx. 1GB of data to hard disk in 5 seconds, which is the default time to gather one transaction group of writes by ZFS. If you allow ZFS to gather 4GB in RAM it will need more then 20 seconds to empty it. Maybe there is a way to limit the max transaction group to avoid that.

Of course ZFS is an excellent software RAID platform, but so is mdraid. The ZIL feature of zfs is nice though. On the downside, you are now putting a Copy on write filesystem (qcow) on a Copy on write filesystem (ZFS). Performance on LVM on mdraid using your SSD as HDD cache would be better I think.
 
Hmm.. Even 1 second of loss of synchronous database writes can cause a lot of work trying to recover your database if you don't have recent backups, so I wouldn't underestimate that particular risk. When the synchronous write returns, the database program has every right to assume the data is actually on the final disk and not half of it.

So if you want to make safe database you have to do backup every 1 second? Or something similar ? I don`t think so. There are no file system who can save the world.

As ZFS it will not corrupt database because of COW. I can give certificate for it.

I looked at your setup a bit and ZFS logistics. It seems that you are over-dimensioning your primary cache (the one in RAM) by setting it to 4GB because in the most optimistic scenario you couldn't write more then approx. 1GB of data to hard disk in 5 seconds, which is the default time to gather one transaction group of writes by ZFS. If you allow ZFS to gather 4GB in RAM it will need more then 20 seconds to empty it. Maybe there is a way to limit the max transaction group to avoid that.

Primary cache (ARC) is for read cache not write. But more free ram for ZFS or system caches helps with writes. But every system has it own limits whatever you try to set up.

Of course ZFS is an excellent software RAID platform, but so is mdraid. The ZIL feature of zfs is nice though. On the downside, you are now putting a Copy on write filesystem (qcow) on a Copy on write filesystem (ZFS). Performance on LVM on mdraid using your SSD as HDD cache would be better I think.

mdraid, LVM, extX and everyone else has no protection of corruption. Whatever you choose RAID controllers or any file system with low or no protection will give you more speed but not protection. So you have to balance in your needs.



In abstract, would be 8GB ZFS limit (32 GB total) be enough for a raid1 or raid10 "small" server?

Depends how much your system need IO


Looking at your ZFS I can point only to metadata limit. "zfs_arc_meta_limit 0" Not sure how ZFS act at this setting. Can you print #cat /proc/spl/kstat/zfs/arcstats results?

Maybe #echo 1610612736 > /sys/module/zfs/parameters/zfs_arc_meta_limit
can solve high IO. ZFS can struggling waiting IO for data and for metadata. Both are important but from stats I see low metadata usage from cache.
 
A VM boot is very aggressive in terms of both I/O and CPU. If you've assigned 4 cores/VM, then a parallel boot will start the load at 12 (4*3) and that is normal. Of course, the real load will be much higher because there are other tasks too running on your 4 physical cores. So 30 is not that abnormal.

When you start the hypervisor, you are having 3 VMs doing random I/O on boot, all fighting over 2 mirrorred hard-drives. The cache has almost nothing to do at this stage, because it is cold. Your Velociraptors seem to have around 140 IOPS on read, which matches your iostat output. Nothing abnormal here either.

Are you using compression and/or deduplication on your pool? These will also add CPU (compression) and ARC usage (deduplication).
 
Depends how much your system need IO
Looking at your ZFS I can point only to metadata limit. "zfs_arc_meta_limit 0" Not sure how ZFS act at this setting. Can you print #cat /proc/spl/kstat/zfs/arcstats results?
Maybe #echo 1610612736 > /sys/module/zfs/parameters/zfs_arc_meta_limit
can solve high IO. ZFS can struggling waiting IO for data and for metadata. Both are important but from stats I see low metadata usage from cache.
The setup is a standard, default installation from ISO, I've changed nothing of ZFS setup except max arc ram limit (to reduce it).
The ouptut you requested is the following:
Code:
root@prox02:~# cat /proc/spl/kstat/zfs/arcstats
6 1 0x01 91 4368 975685431 130089859920481
name  type data
hits  4  29744043
misses  4  8311223
demand_data_hits  4  26076114
demand_data_misses  4  5132450
demand_metadata_hits  4  2420264
demand_metadata_misses  4  868396
prefetch_data_hits  4  1245428
prefetch_data_misses  4  2307785
prefetch_metadata_hits  4  2237
prefetch_metadata_misses  4  2592
mru_hits  4  6308129
mru_ghost_hits  4  818038
mfu_hits  4  22314496
mfu_ghost_hits  4  802230
deleted  4  6628942
mutex_miss  4  374
evict_skip  4  372799
evict_not_enough  4  10266
evict_l2_cached  4  737476688896
evict_l2_eligible  4  74970418688
evict_l2_ineligible  4  176916439040
evict_l2_skip  4  0
hash_elements  4  1086662
hash_elements_max  4  1169348
hash_collisions  4  5064611
hash_chains  4  200442
hash_chain_max  4  7
p  4  1324784308
c  4  4299967296
c_min  4  2147483648
c_max  4  4299967296
size  4  4299970552
hdr_size  4  37624648
data_size  4  4004736512
metadata_size  4  132363264
other_size  4  22981992
anon_size  4  1982464
anon_evictable_data  4  0
anon_evictable_metadata  4  0
mru_size  4  1021892096
mru_evictable_data  4  976353792
mru_evictable_metadata  4  609280
mru_ghost_size  4  3272820736
mru_ghost_evictable_data  4  3006398464
mru_ghost_evictable_metadata  4  266422272
mfu_size  4  3113225216
mfu_evictable_data  4  3026416640
mfu_evictable_metadata  4  0
mfu_ghost_size  4  1026906112
mfu_ghost_evictable_data  4  932577280
mfu_ghost_evictable_metadata  4  94328832
l2_hits  4  3995186
l2_misses  4  4315977
l2_feeds  4  135655
l2_rw_clash  4  1
l2_read_bytes  4  230490854912
l2_write_bytes  4  221574266880
l2_writes_sent  4  116174
l2_writes_done  4  116174
l2_writes_error  4  0
l2_writes_lock_retry  4  2
l2_evict_lock_retry  4  0
l2_evict_reading  4  1
l2_evict_l1cached  4  37563
l2_free_on_write  4  913
l2_cdata_free_on_write  4  234
l2_abort_lowmem  4  0
l2_cksum_bad  4  0
l2_io_error  4  0
l2_size  4  134756319744
l2_asize  4  83488300544
l2_hdr_size  4  102264136
l2_compress_successes  4  3381857
l2_compress_zeros  4  0
l2_compress_failures  4  345122
memory_throttle_count  4  0
duplicate_buffers  4  0
duplicate_buffers_size  4  0
duplicate_reads  4  0
memory_direct_count  4  0
memory_indirect_count  4  0
arc_no_grow  4  0
arc_tempreserve  4  0
arc_loaned_bytes  4  0
arc_prune  4  0
arc_meta_used  4  295234040
arc_meta_limit  4  4299967296
arc_meta_max  4  373742256
arc_meta_min  4  16777216
arc_need_free  4  0
arc_sys_free  4  261025792
root@prox02:~
If you see something interesting please teach me how to understand it :)
Thanks a lot
 
Hi all. Have some problems with IO ZFS. What community recommend for use?



I have proxmox installation Virtual Environment 5.0-23 on Dell T130 without real raid controller. Disks in ATA AHCI mode. Disks configuration hard drives in mirror mode configured when installer start and ssd in stripe mode, configured after install.
 
Hmm.. Even 1 second of loss of synchronous database writes can cause a lot of work trying to recover your database if you don't have recent backups, so I wouldn't underestimate that particular risk. When the synchronous write returns, the database program has every right to assume the data is actually on the final disk and not half of it.


Is not like you said. Any decent DB engine have a jurnal. Any DB write is write first in this jurnal and then to the disk in a transactional mode. When the DB is crash (server restart or power problem) the DB engine will see that are some disk writes in the jurnal but not committed to the disk (transactions).
Then the DB engin wil start to write to the disk that data that was not entirely committed to the disk.
 
Is not like you said. Any decent DB engine have a jurnal. Any DB write is write first in this jurnal and then to the disk in a transactional mode. When the DB is crash (server restart or power problem) the DB engine will see that are some disk writes in the jurnal but not committed to the disk (transactions).
Then the DB engin wil start to write to the disk that data that was not entirely committed to the disk.

Yes, this is called 'crash recovery'. Yet you can still loose some transactions if they were not written to disk but that is just the reality. Same for every storage system. If you want to have a high availability on the database layer, just use database technology to circumvent this: streaming replication, hot standby, dataguard or however it is called in $YourDatabaseProduct.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!