RAM high load on node with no VM running

amil

Active Member
Dec 16, 2015
63
0
26
Spain
I am doing some performance checks in a dedicated server with latest proxmox 4.1 (including the today kernel enterprise repo update). Once rebooted i started checks:


Node just restarted, no vms running:

Code:
root@hn2:~# free -m
  total  used  free  shared  buffers  cached
Mem:  24098  870  23227  0  5  75
-/+ buffers/cache:  789  23309
Swap:  23551  0  23551

With 1 OpenVZ container POWER ON (CentOS 7 with cPanel, 4x vCores, 4GB RAM max) node mem load:
Code:
root@hn2:~# free -m
  total  used  free  shared  buffers  cached
Mem:  24098  1797  22301  0  4  293
-/+ buffers/cache:  1499  22599
Swap:  23551  0  23551


With 1 OpenVZ container POWER OFF (CentOS 7 with cPanel, 4x vCores, 4GB RAM max) node mem load:
Code:
root@hn2:~# free -m
  total  used  free  shared  buffers  cached
Mem:  24098  1335  22762  0  4  81
-/+ buffers/cache:  1249  22849
Swap:  23551  0  23551


With 1 VM (kvm) POWER ON (Windows Server 2012 r2, 1vCore, 4 GB RAM max)
Code:
root@hn2:~# free -m
  total  used  free  shared  buffers  cached
Mem:  24098  17728  6370  0  4  98
-/+ buffers/cache:  17625  6473
Swap:  23551  0  23551


With 1 VM (kvm) POWER OFF (Windows Server 2012 r2, 1vCore, 4 GB RAM max)
Code:
root@hn2:~# free -m
  total  used  free  shared  buffers  cached
Mem:  24098  13517  10581  0  4  95
-/+ buffers/cache:  13417  10681
Swap:  23551  0  23551


So, dedicated server -node- starts using only about 800 MB ram, and later a power on and poweroff), so again no virtual machines power on on server and RAM is about 13 GB and this ram not decrease.

For more info, i have two sata disks (as RAID1 ZFS) and a SSD harddisk configured as CACHE (ZIL / L2ARC).

I came from openvz & centos (with solusvm to manage) and of course when a OpenVZ is halted, the memory on hardware node is free as long as containers not active.

So i don´t know if really i can expect this memory usage in hardware node (with virtual machines halted) or really i am in a trouble.

Thanks for all help, suggestions and considerations.
 
Why do you want to have free RAM? Seriously!

NEVER ever do cache drop in production. This drops performance significantly! This memory is used/freed if it is needed elsewhere. Linux normally uses all available memory for caching of different things (metadata, raw blocks from disk, inodes). Please refer to the official documentation at https://www.kernel.org/doc/Documentation/sysctl/vm.txt and look for drop_caches.

Also swapping is not bad in general - but lot of swapins and swapouts is indeed bad.

Besides, drop_cached does not clean the entire ARC (for me only a very very small part), because is not a normal linux buffer, it's stored in spl. In a pure ZFS-system, I never observed that the general buffers/cached entry in free shows big numbers as it would be in non-zfs systems.
 
  • Like
Reactions: amil
Hi,
zfs use the ram as cache!
If you clean the cache you will see much more free ram:
Code:
echo 3 > /proc/sys/vm/drop_caches
Udo

Code:
root@hn2:~# free -m
  total  used  free  shared  buffers  cached
Mem:  24098  17661  6436  0  4  99
-/+ buffers/cache:  17557  6541
Swap:  23551  0  23551
root@hn2:~# cat /proc/sys/vm/drop_caches
0
root@hn2:~# echo 3 > /proc/sys/vm/drop_caches
root@hn2:~# cat /proc/sys/vm/drop_caches
3
root@hn2:~# free -m
  total  used  free  shared  buffers  cached
Mem:  24098  5078  19020  0  0  50
-/+ buffers/cache:  5028  19070
Swap:  23551  0  23551

Hi udo, thank you so much for your reply! It works like a charm this command.

But i understand reading the LnxBill post that this is not good in production servers (or at least to do frecuently), so i supposed that is a bad idea to execute this comand in a daily (or maybe weekely) cron, right?
 
Why do you want to have free RAM? Seriously!

NEVER ever do cache drop in production. This drops performance significantly! This memory is used/freed if it is needed elsewhere. Linux normally uses all available memory for caching of different things (metadata, raw blocks from disk, inodes). Please refer to the official documentation at https://www.kernel.org/doc/Documentation/sysctl/vm.txt and look for drop_caches.

Also swapping is not bad in general - but lot of swapins and swapouts is indeed bad.

Besides, drop_cached does not clean the entire ARC (for me only a very very small part), because is not a normal linux buffer, it's stored in spl. In a pure ZFS-system, I never observed that the general buffers/cached entry in free shows big numbers as it would be in non-zfs systems.

Hi LnxBil, thank you so much for your reply. I am realling testing the system: as i said previously i come from OpenVZ under ext4 partitions and i note a great difference in this sense, but it seems i am on right direction ;)

I don´t know if a vm is deleted, this memory is being free auto, or it needs to make a cache clean. You recommend don´t abuse to droping cache in production systems, but as you know in production server many vps can be stoped or "terminated" so with -cache-, so did you make any weekly swap droping (automated with cron), did you make this by hand when swap is going bad or definetively did you not recomend in any situation the swap droping?

I understand with this (please correct me if i am wrong) that if i clean cache (via /proc/sys/vm/drop_caches), this (can) decrease vm performance, but in online vms the cache will be regenerated, right?

Thank you so much guys for all help and recommendations!
 
Why do you want to have free RAM? Seriously!
Really i want to test all benefits and of course obtain an aproximation of what server density i can get for this hardware i have.

Of course this is not yet a production system and now the next step will be create more and more vms -openvz and kvm with different OS- to check server degradation and the use or real ram (the main ARC cache i understand) and also check the cache and log on SSD disk.

Maybe not the better or more effective method but is a good point of start to check how ZFS and cache works and important too, estimate the real capacity adding multiple VMs
 
Code:
root@hn2:~# free -m
  total  used  free  shared  buffers  cached
Mem:  24098  17661  6436  0  4  99
-/+ buffers/cache:  17557  6541
Swap:  23551  0  23551
root@hn2:~# cat /proc/sys/vm/drop_caches
0
root@hn2:~# echo 3 > /proc/sys/vm/drop_caches
root@hn2:~# cat /proc/sys/vm/drop_caches
3
root@hn2:~# free -m
  total  used  free  shared  buffers  cached
Mem:  24098  5078  19020  0  0  50
-/+ buffers/cache:  5028  19070
Swap:  23551  0  23551

Hi udo, thank you so much for your reply! It works like a charm this command.

But i understand reading the LnxBill post that this is not good in production servers (or at least to do frecuently), so i supposed that is a bad idea to execute this comand in a daily (or maybe weekely) cron, right?
Hi,
you don't need to flush cache...
I do this only for performance measuring to avoid measure caching and not "real" IO.

Udo
 
Hi,

After i have created 10 testing VMs (and all running) i had a long freeze -no access to vm / proxmox or ssh). The system responds after about 10 minutes and here the logs -i did not reboot-

dmesg log:
Code:
INFO: task ntpd:3237 blocked for more than 120 seconds.
  Tainted: P  -- ------------  2.6.32-43-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ntpd  D ffff880635124740  0  3237  1  0 0x00000000
ffff88062176f080 0000000000000082 ffffffffa012ab04 0000000000000000
ffff880635124740 ffff880635124740 ffff880635124740 ffff880635124740
00ffffff00000000 ffff88062176f018 0000000000000000 0000000000000001
Call Trace:
[<ffffffffa012ab04>] ? dmu_zfetch+0x2b4/0xd80 [zfs]
[<ffffffff815ac085>] rwsem_down_failed_common+0x95/0x1e0
[<ffffffffa01119b8>] ? dbuf_rele_and_unlock+0x268/0x400 [zfs]
[<ffffffffa0111517>] ? dbuf_read+0x6b7/0x8f0 [zfs]
[<ffffffff815ac226>] rwsem_down_read_failed+0x26/0x30
[<ffffffff812a6a84>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff815ab904>] ? down_read+0x24/0x2b
[<ffffffffa011a7e1>] dmu_buf_hold_array_by_dnode+0x51/0x480 [zfs]
[<ffffffffa011ada6>] dmu_buf_hold_array+0x66/0x90 [zfs]
[<ffffffffa011c4f2>] dmu_write_bio+0x72/0x1a0 [zfs]
[<ffffffffa01c250f>] zvol_request+0x23f/0x610 [zfs]
[<ffffffffa09377ab>] ? kvm_unmap_rmapp+0x4b/0x70 [kvm]
[<ffffffff8129e4ce>] ? radix_tree_tag_clear+0x1e/0x220
[<ffffffff81279edc>] generic_make_request+0x28c/0x420
[<ffffffff8127a0f3>] submit_bio+0x83/0x1c0
[<ffffffff811539a5>] ? test_set_page_writeback+0xe5/0x1e0
[<ffffffff81184a49>] swap_writepage+0xd9/0x120
[<ffffffff81157abc>] pageout.constprop.22+0x16c/0x2d0
[<ffffffff81159da0>] shrink_page_list.constprop.21+0x7c0/0xa60
[<ffffffff8115a3b3>] shrink_inactive_list+0x373/0xa70
[<ffffffff81153bea>] ? determine_dirtyable_memory+0x1a/0x30
[<ffffffff8115ad7e>] shrink_lruvec+0x2ce/0x600
[<ffffffff8115b285>] shrink_zone+0x1d5/0x410
[<ffffffff811581f8>] ? shrink_slab+0x1a8/0x1f0
[<ffffffff8115cdf6>] do_try_to_free_pages+0x516/0xa50
[<ffffffff8115d53a>] try_to_free_pages+0x8a/0x110
[<ffffffff81152437>] __alloc_pages_nodemask+0x6b7/0xad0
[<ffffffff81154840>] ? __do_page_cache_readahead+0xd0/0x210
[<ffffffff81191174>] alloc_pages_current+0xa4/0x110
[<ffffffff8113d1e3>] __page_cache_alloc+0x43/0xb0
[<ffffffff8113f83b>] filemap_fault+0x31b/0x590
[<ffffffff811ca200>] ? pollwake+0x0/0x70
[<ffffffff8116bf60>] __do_fault+0x80/0x5d0
[<ffffffff811ca200>] ? pollwake+0x0/0x70
[<ffffffff81170bbb>] handle_pte_fault+0x9b/0x1210
[<ffffffff810a78cb>] ? do_schedule_next_timer+0x4b/0xe0
[<ffffffff81171f4c>] handle_mm_fault+0x21c/0x300
[<ffffffff8109789f>] ? get_signal_to_deliver+0xbf/0x480
[<ffffffff8104d35a>] __do_page_fault+0x18a/0x4c0
[<ffffffff815af55b>] do_page_fault+0x3b/0xa0
[<ffffffff815ac905>] page_fault+0x25/0x30

Here the current zfs stats (note that node has NOT been rebooted since freze):
Code:
grep c_max /proc/spl/kstat/zfs/arcstats
c_max  4  12634667008



root@hn2:/sys/module/zfs/parameters# cat /proc/spl/kstat/zfs/arcstats |grep c_
c_min  4  33554432
c_max  4  12634667008
arc_no_grow  4  0
arc_tempreserve  4  0
arc_loaned_bytes  4  0
arc_prune  4  0
arc_meta_used  4  199822560
arc_meta_limit  4  9476000256
arc_meta_max  4  239988408
arc_meta_min  4  16777216
arc_need_free  4  0
arc_sys_free  4  394829824

cat /sys/module/zfs/parameters/zfs_arc_max
0

Some other considerations:

1. seconds before the freeze, the RAM is almost full (this server it has 24 GB RAM). In this logs can i see that arc_max is on 12634667008 - 12 GB the half of real RAM memory.

2. I have a 120 GB SSD for ZIL log & L2ARC
Code:
# zpool iostat -v 2 300
  capacity  operations  bandwidth
pool  alloc  free  read  write  read  write
--------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool  72.5G  623G  14  96  1.12M  4.45M
  mirror  72.5G  623G  14  59  1.12M  1.80M
  sda2  -  -  6  26  594K  1.92M
  sdb2  -  -  6  26  565K  1.92M
logs  -  -  -  -  -  -
  sdc1  932K  9.25G  0  36  34  2.65M
cache  -  -  -  -  -  -
  scsi-SATA_OCZ-VERTEX3_OCZ-4HLPxxxxxxxx-part2  23.4G  79.1G  1  6  74.0K  685K
--------------------------------------------------  -----  -----  -----  -----  -----  -----

The SSD cache is working and it has too many free space.

It looks that this heavy freeze it happens when real server RAM is going full. I supposed if RAM (ARC cache) is full, the SSD Cache - L2ARC- will help on this and i can understand a performance decreasing but not this heavy freeze.

Thanks!
 
Searching about this issue i found this
I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10″ in /etc/sysctl.conf.
(and maybe vm.dirty_background_ratio=5?
I´ll give a try. All considerations regarding performance/ zfs parameters or limits that i currently have are very welcome!
 
Hi LnxBil, thank you so much for your reply. I am realling testing the system: as i said previously i come from OpenVZ under ext4 partitions and i note a great difference in this sense, but it seems i am on right direction ;)

I don´t know if a vm is deleted, this memory is being free auto, or it needs to make a cache clean. You recommend don´t abuse to droping cache in production systems, but as you know in production server many vps can be stoped or "terminated" so with -cache-, so did you make any weekly swap droping (automated with cron), did you make this by hand when swap is going bad or definetively did you not recomend in any situation the swap droping?

I understand with this (please correct me if i am wrong) that if i clean cache (via /proc/sys/vm/drop_caches), this (can) decrease vm performance, but in online vms the cache will be regenerated, right?

Thank you so much guys for all help and recommendations!

Cache is freed automatically if the space is needed elsewere. I NEVER drop caches or do unswapping. Linux is very capable of doing this by its own. I monitor my swapin/swapout to get an estimate if I need to change the memory load on the server, but normally I do not run into problems here. If you're swapping regularly, your customers will notice it, because the machines will get slow. Same with cache drop. This will significantly reduce performance. Just let Linux memory management work.
 
Your hanging process seams like the exact problem I discovered myself yesterday and other agreed that it's a real problem:
https://forum.proxmox.com/threads/zfs-swap-crashes-system.25208/
Yes, you are absolutely on right. Reading post i can test the @Nemesiz suggestion:
Code:
zfs set primarycache=metadata rpool/swap
zfs set secondarycache=metadata rpool/swap
zfs set compression=off rpool/swap
zfs set sync=disabled rpool/swap
zfs set checksum=on rpool/swap

And a ( as suggested by @Q-wulf) -the previous default value- :
echo 0 > /proc/sys/vm/drop_caches

And now node, is NOT is freezing! good notice for friday morning ;) Thanks to @Nemesiz


I can note now (i don´t know if directly related) that now system RAM decreasing along i am going power off the vms; this is a normal situation.

In the oposite, as i said when RAM (and so ARC) is almost full (the 24 gb) the node not freece but i can see a hard performance down on VMs. Please note that i have SSD as L2ARC cache and logs, and free space on ssd cache:

Code:
# zpool iostat -v 2 300
  capacity  operations  bandwidth
pool  alloc  free  read  write  read  write
--------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool  102G  594G  19  91  1.00M  3.46M
  mirror  102G  594G  19  63  1.00M  1.49M
  sda2  -  -  9  23  548K  1.60M
  sdb2  -  -  7  23  492K  1.60M
logs  -  -  -  -  -  -
  sdc1  48K  9.25G  0  27  17  1.97M
cache  -  -  -  -  -  -
  scsi-SATA_OCZ-VERTEX3_OCZ-4HLxxxxxxxxxxxxxx-part2  31.9G  70.6G  2  5  92.7K  502K
--------------------------------------------------  -----  -----  -----  -----  -----  -----
Yeah, i know that SATA_OCZ-VERTEX3_OCZ is not the most modern or suitable SSD for this needs. :D

I did not make serious benchmark test but i can confirm that when RAM memory is almost full decreasing performance is incredible, talking colloquially it seems like server with sata drives on ext4 partitions without type of cache and of course with high load.

Of course RAM arc is much faster than SSD (L2ARC cache) but i don´t know if i can expect this or i can tune some option (mounting options, zfs tunning) in order to atempt to get better performance in this way.

With practical examples to give an idea:

With the node with enought free RAM (so working with ARC-RAM almost) linux containers reboot in 1-2 seconds and Win 2012 r2 in 4-5 seconds, very good. With RAM full and L2ARC working, linux reboot on 10-20 or 1min or more, and Win 2012 can take one minute or more.

Copying files: On windows guests: with ARC/RAM free, copying many files (thousand of files) locally can take 300+ MB/ of data transfer (not continuous but very good), really fast. With a saturated systems, max. 80 MB/s, usually 7 or 4 Mb/s, later some Kbs...so it takes about 10 times or more the same operation.

I´m sorry for this "home benchmarks" but i think that this could give an idea about how server goes degradation so valore if this can expect as "normal" or i can make some further tunning ;)


Greetings!
 
There are a lot of threads in this forum that analyse this behavior and please read them, especially how L2ARC is actually slowing down your system and ZIL is only used on synchronous writes.
 
Hi LnxBill, thank you so much for your comments. Due i´m new on proxmox & zfs world i am following the excellent proxmox install&config guide and of course taking a look at forum post, really many of the main issues can be encountered in forum yet. I am not native english speaker, so maybe i am "lost" some of important considerations due language, by the way this is not excluse :)

But permit me a few questions, to ensure i have clear the root cause of this. In this terms the issue is when server gests RAM full lthe L2ARC is lowing down and ZIL is only ised on sync writes as you mention.

For example i found this threat
that may not the same case, in every case the
sync=disabled rpool/swap

This cause server is going more stable (don´t freeze the node), but the performance is poor. I don´t know if completely disable swap will be nice (i understand that this current swap is over the sata disk and this is the root cause).
Maybe when creating VM the option cache=none could be good...

In all cases as you can see i am very lost about the main root cause of this and how to atempt to fix it.

Best regards!
 
Hi amil,

There is no simple answer to performance bottlenecks. The main reason is that ZFS is in general not as fast as a simple file system like ext4. The benefits of ZFS are snapshots, CoW, checksummed data etc. which cost performance. Optimizing a two-disk setup even with SSD is hard and I could not get a fast system, which I read also in this and other forums. I stick to no-SSD and get more stable and faster setup. I use ZFS in my laptop for easier backups, but it feels a lot slower than ext4 - even with L2ARC on the build in SSD.

Please read on L2ARC and how it affects (reduces) the ARC. ZIL is also not used if you're not using sync write operation.

For server hardware, I can only state that you make ZFS faster by throwing more hardware at it. The actual hardware requirements depends on the desired workload and I have not read any benchmarks regarding ZFS vs. ext4, but the difference is not negligible IMHO.

Best,
LnxBil
 
Hi LnxBill, thank you for your response and your useful details.

I´ve done some suggested changes and tests and i can confirm:

1. Limit ARC is a must (recommended the half of real memory -in doc-). If ARC is limited, and almost full, i can´t create new VM/containers: so it´s okay, i receive a message in proxmox and system performance is okay.

2. With the swap: since SWAP is contained in zvol ROOT (on slow sata drivers), this decreases hard performance too. While testing, i have disabled totally swap on server (swapoff) and commented from /etc/fstab and vm.swapiness=0 in node. The results was sucessfull, i don´t know if disable totally swap in order to increase performance is a good (or the best) option.

I agree with you: more hardware = more performance. In my opinion, comparing the ZFS model and benefits compared with tradditional ext4, while server have enough free real RAM (and arc) the performance will be amazing & fast.

So, i think that in the same hardware zfs will work allways better with enough ARC free than simple ext4 filesystems. The server density in zfs servers should be worst so you can allocate less containers/vm on a zfs node but will run faster...of course it´s my opinion and a reflexion, and i did not make test yet for confirm this.


Greetings!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!