Adding SSD for cache - ZIL / L2ARC

amil

Active Member
Dec 16, 2015
63
0
26
Spain
Hello,

I´ve installed the latest Proxmox 4.1 via iso on one of my dedicated servers. The specs are:

CPU1: Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz (Cores 8)
Memory: 24061 MB
Disk /dev/sda: 750 GB (=> 698 GiB)
Disk /dev/sdb: 750 GB (=> 698 GiB)
Disk /dev/sdc: 120 GB (=> 111 GiB)
Total capacity 1509 GiB with 3 Disks

I do the installation with a ZFS-RAID1 over sda and sdb, and left sdc (the SSD disk untouched) to use later as cache (ZIL - L2ARC) and i have now the following issues/doubts:

1. In the case in the proxmox 4.1 iso install, if i choose a zRAID1 with all disk (including the SSD), i suppose that this build by default a RAID with the 3 disks BUT the SSD act as mirror not for caching, so i think proxmox will NOT configure the ssd as cache with this option, right? Or can i configure the ssd disk easy as cache in the installation?

2. For now, on my proxmox install ssd disk not appears, it is not inicialited (i can´t see it with a cfdisk /dev/sdc for example). So the next step should be inicializate disk -acording all i have reading- make 2 partitions, one for ZIL and other for L2ARC, in commandline like:

gpart destroy -F sdc // First i should debete current partitions on sd
gpart create -s GPT sdc
gpart add -s 90G -t freebsd-zfs // And here two partitions


If i am in the right direction, what size for a 120 GB ssd for each partition?

3. The next i suppose will be somethink like ¿?:

zpool add [pool] cache [drive]

4. Partitions on current sata drives

This is my current partition scheme:
Filesystem Size Used Avail Use% Mounted on
udev 10M 0 10M 0% /dev
tmpfs 2.4G 392K 2.4G 1% /run
rpool/ROOT/pve-1 661G 678M 661G 1% /
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 9.4G 0 9.4G 0% /run/shm
rpool 661G 128K 661G 1% /rpool
rpool/ROOT 661G 128K 661G 1% /rpool/ROOT

Note that only sda and sdb are in RAID1 and i have only a zpool (rpool), leaving it by default, like installation. I have checking and most people create diferent zpools, for containers/vm, space backups...for example. Should i do then some separate pools? Can anybody guide me on this too? :)

Thank you all in advance for all help and suggestions since i am new in proxmox world (i came from solusvm & openvz virtualization)


Best regards!

 
1. In the case in the proxmox 4.1 iso install, if i choose a zRAID1 with all disk (including the SSD), i suppose that this build by default a RAID with the 3 disks BUT the SSD act as mirror not for caching, so i think proxmox will NOT configure the ssd as cache with this option, right? Or can i configure the ssd disk easy as cache in the installation?

# zpool status

Will show you current status of your configuration

2. For now, on my proxmox install ssd disk not appears, it is not inicialited (i can´t see it with a cfdisk /dev/sdc for example). So the next step should be inicializate disk -acording all i have reading- make 2 partitions, one for ZIL and other for L2ARC, in commandline like:

gpart destroy -F sdc // First i should debete current partitions on sd
gpart create -s GPT sdc
gpart add -s 90G -t freebsd-zfs // And here two partitions


If i am in the right direction, what size for a 120 GB ssd for each partition?

"cfdisk /dev/sdc" reported as no hard drive exist or report as not mbr table hard drive ?


3. The next i suppose will be somethink like ¿?:

zpool add [pool] cache [drive]


To add L2ARC cache: zpool add [pool] cache [vdev]
To add ZIL: zpool add [pool] log [vdev]

4. Partitions on current sata drives

This is my current partition scheme:
Filesystem Size Used Avail Use% Mounted on
udev 10M 0 10M 0% /dev
tmpfs 2.4G 392K 2.4G 1% /run
rpool/ROOT/pve-1 661G 678M 661G 1% /
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 9.4G 0 9.4G 0% /run/shm
rpool 661G 128K 661G 1% /rpool
rpool/ROOT 661G 128K 661G 1% /rpool/ROOT

Note that only sda and sdb are in RAID1 and i have only a zpool (rpool), leaving it by default, like installation. I have checking and most people create diferent zpools, for containers/vm, space backups...for example. Should i do then some separate pools? Can anybody guide me on this too? :)

if you want to create another pool you need to add more hard drives. But you can create zfs file system and use different settings with that file system. It become ease to use backups with snapshot.
 
  • Like
Reactions: amil
Hi Nemesiz, thank you so much for your reply and your help :)

#zpool status

pool: rpool
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: none requested
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
logs
sdc1 ONLINE 0 0 0
cache
scsi-SATA_OCZ-VERTEX3_OCZ-4HLP444KN617Y5S9-part2 ONLINE 0 0 0


(Please note that it seems i am able to get working SSD cache correctly) :)



"cfdisk /dev/sdc" reported as no hard drive exist or report as not mbr table hard drive ?

Here, in a first moment a cfdisk /dev/sdc reports no drive existing. Later i´ve updated license and make upgrade (this install the new kernel from enterprise repo), i could reboot (this server is not a production server yet of course) and now it seems disk is detected correctly (at this time no other errors to think disk could be failing at the moment).

Well, i had some issues in partitioning the SSD drive once is being detected on system. At first moment i tried the gpart command with no result (gave me error when creating GPT), so i create the gpt and the partitions with the parted command:

# parted /dev/sdc
(parted) mklabel gpt

(parted) mkpart primary 0GB 10GB
(parted) mkpart primary 10G 120G

(parted) print
Model: ATA OCZ-VERTEX3 (scsi)
Disk /dev/sdc: 120GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
1 1049kB 10.0GB 9999MB zfs primary
2 10.0GB 120GB 110GB zfs primary

So, the 10 GB partition it goes to LOG and the rest, about 110GB, goes to L2ARC cache.

The next i do i go to: /dev/disk/by-id/ and found the IDs for the new partitions, then i got:

scsi-SATA_OCZ-VERTEX3_OCZ-4HLxxxxxxxxxx-part1
scsi-SATA_OCZ-VERTEX3_OCZ-4HLxxxxxxxxxx-part2


(also i have the ata-OCZ-VERTEX3_OCZ-4HL...)

Reading doc at: https://pve.proxmox.com/wiki/Storage:_ZFS#Create_a_new_pool_with_Cache_and_Log_on_one_Disk

I can see that
"""
As <device> it is possible to use more devices, like it's shown in "Create a new pool with RAID*". Important: identify device with /dev/disk/by-id/scsi-*<device>"""

With this point clear i was able to do a:

zpool add -f rpool cache /dev/disk/by-id/scsi-SATA_OCZ-VERTEX3_OCZ-4HLxxxxxxxxxx-part2 log /dev/disk/by-id/scsi-SATA_OCZ-VERTEX3_OCZ-4HLxxxxxxxxxx-part1

root@:/dev/disk/by-id# zpool iostat -v 2 300
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------------------- ----- ----- ----- ----- ----- -----
rpool 14.0G 682G 0 54 34.8K 2.11M
mirror 14.0G 682G 0 39 34.7K 1022K
sda2 - - 0 15 17.1K 1.09M
sdb2 - - 0 15 19.0K 1.09M
logs - - - - - -
sdc1
128K 9.25G 0 15 41 1.11M
cache - - - - - -
scsi
-SATA_OCZ-VERTEX3_OCZ-4HLxxxxxx-part2 11.0G 91.5G 0 7 8.51K 758K
-------------------------------------------------- ----- ----- ----- ----- ----- -----

So it seems, that SSD cache is now working fully. Please make the necessary appointments (like hdd partitioning, ie. if i should leave a small free space at first / at the end of disk).

I did some basic test (i next i will do better and specific benchmarks to compare) but the first impression is amazing!

Testing the SSD cache:
I start creating a new VM (kvm) to start testing and i have some few considerations to debate:

This testing virtual machine is installed with windows server 2012 r2 (1 vCore, 4096 RAM, 64 gb hard disk) (obtained from Microsoft eval center) and the first impression is amazin as i said: system boots up in 5 seconds, and file writing to virual machine is very nice too.
But only with this one virtual machine the SSD cache (L2ARC i mean) is on 11GB now and increasing -for only this virtual machine-.


For example now, virtual machine is power off (no virtual machines power on) and the memory used is:

free -m
total used free shared buffers cached
Mem: 24099 13799 10299 0 3 97
-/+ buffers/cache: 13698 10400
Swap: 23551 0 23551


So, it seems that SSD cache files persist on disk , and talking in terms of "density" of VMs this will limit about 7-8 VMs in the case of this server (taking as reference the test vm) until SSD cache full, am i on right?

if you want to create another pool you need to add more hard drives. But you can create zfs file system and use different settings with that file system. It become ease to use backups with snapshot.

Yes, i was refering about creating a zfs filesystem (separately, like "virtual partition", i don´t know the exact name of this) using the space of the rpool, ie in my current rpool, create a "zfs box" for backup for example, another to store ISOs or space for vms...(i think this appears as "storage" on proxmox") i´m sorry for not being clear i hope anybody can understand what i refer ;)


Thanks for all comments and suggestions!
 
Last edited:
L2ARC - it is second level cache for ZFS. To work it use ARC cache for tracking L2ARC data and need time to fill it up.

OCZ-VERTEX3 - not the best option to use for ZIL. I used OCZ-VERTEX4 for it and I started to see smart Media_Wearout_Indicator dropping wery fast. 1% in the month.


As for "free -m"
Code:
             total       used       free     shared    buffers     cached
Mem:         48281      38444       9836        311       5555        734
-/+ buffers/cache:      32155      16125
Swap:         5119        255       4864
 
With ZFS cache don`t imagine it will cache everything.

My system is up for 19 days and ZFS summary
Code:
------------------------------------------------------------------------
ZFS Subsystem Report  Thu Dec 17 08:12:25 2015
ARC Summary: (HEALTHY)
  Memory Throttle Count:  0

ARC Misc:
  Deleted:  15.41m
  Mutex Misses:  202
  Evict Skips:  202

ARC Size:  63.06%  6.31  GiB
  Target Size: (Adaptive)  63.39%  6.34  GiB
  Min Size (Hard Limit):  0.31%  32.00  MiB
  Max Size (High Water):  320:1  10.00  GiB

ARC Size Breakdown:
  Recently Used Cache Size:  80.92%  5.13  GiB
  Frequently Used Cache Size:  19.08%  1.21  GiB

ARC Hash Breakdown:
  Elements Max:  1.49m
  Elements Current:  18.76%  279.02k
  Collisions:  14.12m
  Chain Max:  5
  Chains:  4.13k

ARC Total accesses:  290.25m
  Cache Hit Ratio:  91.86%  266.62m
  Cache Miss Ratio:  8.14%  23.63m
  Actual Hit Ratio:  91.28%  264.95m

  Data Demand Efficiency:  92.35%  180.00m
  Data Prefetch Efficiency:  19.73%  5.25m

  CACHE HITS BY CACHE LIST:
  Most Recently Used:  27.18%  72.46m
  Most Frequently Used:  72.20%  192.49m
  Most Recently Used Ghost:  0.68%  1.81m
  Most Frequently Used Ghost:  0.19%  498.63k

  CACHE HITS BY DATA TYPE:
  Demand Data:  62.35%  166.23m
  Prefetch Data:  0.39%  1.04m
  Demand Metadata:  37.03%  98.72m
  Prefetch Metadata:  0.24%  631.67k

  CACHE MISSES BY DATA TYPE:
  Demand Data:  58.29%  13.78m
  Prefetch Data:  17.84%  4.22m
  Demand Metadata:  22.69%  5.36m
  Prefetch Metadata:  1.18%  278.86k


File-Level Prefetch: (HEALTHY)
DMU Efficiency:  861.66m
  Hit Ratio:  96.16%  828.57m
  Miss Ratio:  3.84%  33.09m

  Colinear:  33.09m
  Hit Ratio:  0.04%  13.11k
  Miss Ratio:  99.96%  33.08m

  Stride:  829.50m
  Hit Ratio:  99.46%  825.04m
  Miss Ratio:  0.54%  4.46m

DMU Misc:
  Reclaim:  33.08m
  Successes:  10.32%  3.41m
  Failures:  89.68%  29.66m

  Streams:  3.53m
  +Resets:  0.11%  3.88k
  -Resets:  99.89%  3.53m
  Bogus:  0


ZFS Tunable:
  metaslab_debug_load  0
  zfs_arc_min_prefetch_lifespan  0
  zfetch_max_streams  8
  zfs_nopwrite_enabled  1
  zfetch_min_sec_reap  2
  zfs_dbgmsg_enable  0
  zfs_dirty_data_max_max_percent  25
  zfs_arc_p_aggressive_disable  1
  spa_load_verify_data  1
  zfs_zevent_cols  80
  zfs_dirty_data_max_percent  10
  zfs_sync_pass_dont_compress  5
  l2arc_write_max  104857600
  zfs_vdev_scrub_max_active  2
  zfs_vdev_sync_write_min_active  10
  zvol_prefetch_bytes  131072
  metaslab_aliquot  524288
  zfs_no_scrub_prefetch  0
  zfs_arc_shrink_shift  0
  zfetch_block_cap  256
  zfs_txg_history  0
  zfs_delay_scale  500000
  zfs_vdev_async_write_active_min_dirty_percent  30
  metaslab_debug_unload  0
  zfs_read_history  0
  zvol_max_discard_blocks  16384
  zfs_recover  0
  l2arc_headroom  2
  zfs_deadman_synctime_ms  1000000
  zfs_scan_idle  50
  zfs_free_min_time_ms  1000
  zfs_dirty_data_max  5062646169
  zfs_vdev_async_read_min_active  1
  zfs_mg_noalloc_threshold  0
  zfs_dedup_prefetch  0
  zfs_vdev_max_active  1000
  l2arc_write_boost  104857600
  zfs_resilver_min_time_ms  3000
  zfs_vdev_async_write_max_active  10
  zil_slog_limit  1048576
  zfs_prefetch_disable  0
  zfs_resilver_delay  2
  metaslab_lba_weighting_enabled  1
  zfs_mg_fragmentation_threshold  85
  l2arc_feed_again  1
  zfs_zevent_console  0
  zfs_immediate_write_sz  32768
  zfs_dbgmsg_maxsize  4194304
  zfs_free_leak_on_eio  0
  zfs_deadman_enabled  1
  metaslab_bias_enabled  1
  zfs_arc_p_dampener_disable  1
  zfs_metaslab_fragmentation_threshold  70
  zfs_no_scrub_io  0
  metaslabs_per_vdev  200
  zfs_dbuf_state_index  0
  zfs_vdev_sync_read_min_active  10
  metaslab_fragmentation_factor_enabled  1
  zvol_inhibit_dev  0
  zfs_vdev_async_write_active_max_dirty_percent  60
  zfs_vdev_cache_size  0
  zfs_vdev_mirror_switch_us  10000
  zfs_dirty_data_sync  67108864
  spa_config_path  /etc/zfs/zpool.cache
  zfs_dirty_data_max_max  12656615424
  zfs_arc_lotsfree_percent  10
  zfs_zevent_len_max  128
  zfs_scan_min_time_ms  1000
  zfs_arc_sys_free  0
  zfs_arc_meta_strategy  1
  zfs_vdev_cache_bshift  16
  zfs_arc_meta_adjust_restarts  4096
  zfs_max_recordsize  1048576
  zfs_vdev_scrub_min_active  1
  zfs_vdev_read_gap_limit  32768
  zfs_arc_meta_limit  5368709120
  zfs_vdev_sync_write_max_active  10
  l2arc_norw  0
  zfs_arc_meta_prune  10000
  metaslab_preload_enabled  1
  l2arc_nocompress  0
  zvol_major  230
  zfs_vdev_aggregation_limit  131072
  zfs_flags  0
  spa_asize_inflation  24
  zfs_admin_snapshot  0
  l2arc_feed_secs  1
  zfs_sync_pass_deferred_free  2
  zfs_disable_dup_eviction  0
  zfs_arc_grow_retry  0
  zfs_read_history_hits  0
  zfs_vdev_async_write_min_active  1
  zfs_vdev_async_read_max_active  3
  zfs_scrub_delay  4
  zfs_delay_min_dirty_percent  60
  zfs_free_max_blocks  100000
  zfs_vdev_cache_max  16384
  zio_delay_max  30000
  zfs_top_maxinflight  32
  spa_slop_shift  5
  zfs_vdev_write_gap_limit  4096
  spa_load_verify_metadata  1
  spa_load_verify_maxinflight  10000
  l2arc_noprefetch  0
  zfs_vdev_scheduler  noop
  zfs_expire_snapshot  300
  zfs_sync_pass_rewrite  2
  zil_replay_disable  0
  zfs_nocacheflush  0
  zfs_arc_max  10737418240
  zfs_arc_min  0
  zfs_read_chunk_size  1048576
  zfs_txg_timeout  5
  zfs_pd_bytes_max  52428800
  l2arc_headroom_boost  200
  zfs_send_corrupt_data  0
  l2arc_feed_min_ms  200
  zfs_arc_meta_min  0
  zfs_arc_average_blocksize  8192
  zfetch_array_rd_sz  1048576
  zfs_autoimport_disable  1
  zfs_arc_p_min_shift  0
  zio_requeue_io_start_cut_in_line  1
  zfs_vdev_sync_read_max_active  10
  zfs_mdcomp_disable  0
  zfs_arc_num_sublists_per_state  8

It is the first time when ZFS cache use less than it set up to use. So strange thinks can happen.
 
OCZ-VERTEX3 - not the best option to use for ZIL. I used OCZ-VERTEX4 for it and I started to see smart Media_Wearout_Indicator dropping wery fast. 1% in the month.

You are right Nemesiz, the OCZ-VERTEX3 is an old hdd harddisk and with new & moderns SSD can get better performance and better compatibility (more space), but at this time this is the unique dedicated server with i have a SSD, so testing time 100 % :)

I am on headache, since test virtual machine is power off (for more than 12 hours) and the FREE real memory of system is very hight (remember, no VM or CT running).

Code:
 free -m
  total  used  free  shared  buffers  cached
Mem:  24099  13857  10241  0  3  123
-/+ buffers/cache:  13730  10368
Swap:  23551  0  23551

I think this can not be normal since before create my first test vm machine, ram mem was slow: so if now container is stopped 13GB ram in use...

I have read in forum (i don´t know if could be related):
https://forum.proxmox.com/threads/proxmox-4-1-zfs-lxc-memory-limits.25168/

With ZFS cache don`t imagine it will cache everything.

My system is up for 19 days and ZFS summary

How can i generate (or the log) to check the ZFS subsystem reports like yours?

Thank you!
 
Thank you so much Nemesiz for the tools, i´ll give try soon.

So, you consideer that hardware node memory usage can be considered normal (when all virtual machines halted)?

I just write a new post with the free -m output in different scenarios.


Greetings!
 
OS do they job to make system run faster. Caching libs and other stuff in free ram and clean it only on request. So don`t think it is waste of ram.

In my server there are 48GB ram. 27GB is set for VMs. 10GB for ZFS ARC. So approximate 10 GB is free. But in reality sometimes server starts to use swap space. At that time I see only ~1-2 GB free. Later free space rise back to ~8 GB.
 
  • Like
Reactions: amil
Great Nemesiz, your case give me more light about this :)

The next thing i´ll test will be creating multiple VMs finding server limits, to check how performance is and a idea about an aproximate server density to be stable and ready for production.

Thank you so much again for all comments.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!