Frequent CPU stalls in KVM guests during high IO on host

gkovacs

Renowned Member
Dec 22, 2008
514
51
93
Budapest, Hungary
Since we upgraded our cluster to PVE 4.3 from 3.4, all our OpenVZ containers have been converted to KVM virtual machines. In many of these guests we get frequent console alerts about CPU stalls, usually when the cluster node is under high IO load (for example when backing up or restoring VMs to / from NFS). These CPU stall intervals usually last a couple of minutes, during that the VM is not responding on the network, but console works. If the high IO activity stops on the host, the VMs recover in a few seconds.

It happens on all our nodes running PVE 4.3. We are running ZFS (SSD RAIDZ or HDD RAID10) on our nodes as local storage for VMs, and NFS for backups.

Strangely it seems it does not affect Ubuntu 16.04 guests running kernel 4.4.0, but does older Ubuntus and all Debians (6/7/8), running kernels 2.6.32, 3.x and 4.7.

Below I included how it looks in the guest /var/log/messages.

Debian 8, kernel 4.7 (2 episodes within 2 minutes)
Code:
Nov 28 14:42:39 php-slave-03 kernel: [315902.792165] rcu_sched       S ffff88023fc572c0     0     7      2 0x00000000
Nov 28 14:42:39 php-slave-03 kernel: [315902.792171]  ffff880236210e40 ffff88023622a180 ffff88023fc50080 0000000000000186
Nov 28 14:42:39 php-slave-03 kernel: [315902.792174]  ffff880236218000 ffff880236217e50 0000000104b3aadb ffff880236217dc0
Nov 28 14:42:39 php-slave-03 kernel: [315902.792177]  ffff88023fc50080 0000000000000001 ffffffffbe9db651 ffff88023fc50080
Nov 28 14:42:39 php-slave-03 kernel: [315902.792180] Call Trace:
Nov 28 14:42:39 php-slave-03 kernel: [315902.792263]  [<ffffffffbe9db651>] ? schedule+0x31/0x80
Nov 28 14:42:39 php-slave-03 kernel: [315902.792271]  [<ffffffffbe9de801>] ? schedule_timeout+0x161/0x2c0
Nov 28 14:42:39 php-slave-03 kernel: [315902.792293]  [<ffffffffbe4e68d0>] ? trace_raw_output_tick_stop+0x70/0x70
Nov 28 14:42:39 php-slave-03 kernel: [315902.792306]  [<ffffffffbe4bdd82>] ? prepare_to_swait+0x52/0x60
Nov 28 14:42:39 php-slave-03 kernel: [315902.792318]  [<ffffffffbe4e11bb>] ? rcu_gp_kthread+0x3db/0x840
Nov 28 14:42:39 php-slave-03 kernel: [315902.792321]  [<ffffffffbe9db173>] ? __schedule+0x293/0x740
Nov 28 14:42:39 php-slave-03 kernel: [315902.792324]  [<ffffffffbe4e0de0>] ? force_qs_rnp+0x130/0x130
Nov 28 14:42:39 php-slave-03 kernel: [315902.792336]  [<ffffffffbe49b87f>] ? kthread+0xdf/0x100
Nov 28 14:42:39 php-slave-03 kernel: [315902.792340]  [<ffffffffbe9df7ef>] ? ret_from_fork+0x1f/0x40
Nov 28 14:42:39 php-slave-03 kernel: [315902.792343]  [<ffffffffbe49b7a0>] ? kthread_park+0x50/0x50
Nov 28 14:42:39 php-slave-03 kernel: [315902.792353] Task dump for CPU 0:
Nov 28 14:42:39 php-slave-03 kernel: [315902.792355] kworker/0:1     R  running task        0  1397      2 0x00000008
Nov 28 14:42:39 php-slave-03 kernel: [315902.792403] Workqueue: events_freezable_power_ disk_events_workfn
Nov 28 14:42:39 php-slave-03 kernel: [315902.792406]  0000000000000000 00000000bfdc3b8f ffffffffbe579667 ffff88023fc18040
Nov 28 14:42:39 php-slave-03 kernel: [315902.792409]  ffffffffbee54240 0000000000000000 ffff880044c52200 ffffffffbe4e1fcc
Nov 28 14:42:39 php-slave-03 kernel: [315902.792412]  ffffffffbe4ee381 001dcd6500000000 00011f4fc3d5852b 0000000000000046
Nov 28 14:42:39 php-slave-03 kernel: [315902.792416] Call Trace:
Nov 28 14:42:39 php-slave-03 kernel: [315902.792421]  <IRQ>  [<ffffffffbe579667>] ? rcu_dump_cpu_stacks+0x67/0x86
Nov 28 14:42:39 php-slave-03 kernel: [315902.792434]  [<ffffffffbe4e1fcc>] ? rcu_check_callbacks+0x70c/0x7b0
Nov 28 14:42:39 php-slave-03 kernel: [315902.792441]  [<ffffffffbe4ee381>] ? timekeeping_update+0xf1/0x150
Nov 28 14:42:39 php-slave-03 kernel: [315902.792448]  [<ffffffffbe4efa43>] ? update_wall_time+0x2e3/0x7b0
Nov 28 14:42:39 php-slave-03 kernel: [315902.792455]  [<ffffffffbe4f7920>] ? tick_sched_do_timer+0x30/0x30
Nov 28 14:42:39 php-slave-03 kernel: [315902.792458]  [<ffffffffbe4e8422>] ? update_process_times+0x32/0x60
Nov 28 14:42:39 php-slave-03 kernel: [315902.792460]  [<ffffffffbe4f7340>] ? tick_sched_handle.isra.14+0x20/0x50
Nov 28 14:42:39 php-slave-03 kernel: [315902.792462]  [<ffffffffbe4f7958>] ? tick_sched_timer+0x38/0x70
Nov 28 14:42:39 php-slave-03 kernel: [315902.792466]  [<ffffffffbe4e8fea>] ? __hrtimer_run_queues+0xea/0x280
Nov 28 14:42:39 php-slave-03 kernel: [315902.792468]  [<ffffffffbe4e9469>] ? hrtimer_interrupt+0x99/0x190
Nov 28 14:42:39 php-slave-03 kernel: [315902.792602]  [<ffffffffc043f680>] ? ata_scsiop_inq_std+0x140/0x140 [libata]
Nov 28 14:42:39 php-slave-03 kernel: [315902.792608]  [<ffffffffbe9e1eb9>] ? smp_apic_timer_interrupt+0x39/0x50
Nov 28 14:42:39 php-slave-03 kernel: [315902.792611]  [<ffffffffbe9e01e2>] ? apic_timer_interrupt+0x82/0x90
Nov 28 14:42:39 php-slave-03 kernel: [315902.792612]  <EOI>  [<ffffffffc043f680>] ? ata_scsiop_inq_std+0x140/0x140 [libata]
Nov 28 14:42:39 php-slave-03 kernel: [315902.792626]  [<ffffffffbe9df1a1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
Nov 28 14:42:39 php-slave-03 kernel: [315902.792636]  [<ffffffffc0443db5>] ? ata_scsi_queuecmd+0x155/0x360 [libata]
Nov 28 14:42:39 php-slave-03 kernel: [315902.792713]  [<ffffffffc03afae8>] ? scsi_dispatch_cmd+0xd8/0x220 [scsi_mod]
Nov 28 14:42:39 php-slave-03 kernel: [315902.792725]  [<ffffffffc03b2ae3>] ? scsi_request_fn+0x473/0x600 [scsi_mod]
Nov 28 14:42:39 php-slave-03 kernel: [315902.792735]  [<ffffffffbe6eb1af>] ? __blk_run_queue+0x2f/0x40
Nov 28 14:42:39 php-slave-03 kernel: [315902.792738]  [<ffffffffbe6f3eb8>] ? blk_execute_rq_nowait+0xa8/0x160
Nov 28 14:42:39 php-slave-03 kernel: [315902.792741]  [<ffffffffbe6f3fe7>] ? blk_execute_rq+0x77/0x120
Nov 28 14:42:39 php-slave-03 kernel: [315902.792750]  [<ffffffffbe6e53a4>] ? bio_phys_segments+0x14/0x20
Nov 28 14:42:39 php-slave-03 kernel: [315902.792753]  [<ffffffffbe6f3d7a>] ? blk_rq_map_kern+0xaa/0x120
Nov 28 14:42:39 php-slave-03 kernel: [315902.792755]  [<ffffffffbe6edae2>] ? blk_get_request+0x72/0xf0
Nov 28 14:42:39 php-slave-03 kernel: [315902.792765]  [<ffffffffc03af61c>] ? scsi_execute+0x12c/0x1d0 [scsi_mod]
Nov 28 14:42:39 php-slave-03 kernel: [315902.792774]  [<ffffffffc03b140f>] ? scsi_execute_req_flags+0x8f/0xf0 [scsi_mod]
Nov 28 14:42:39 php-slave-03 kernel: [315902.792793]  [<ffffffffc039268e>] ? sr_check_events+0xbe/0x2d0 [sr_mod]
Nov 28 14:42:39 php-slave-03 kernel: [315902.793041]  [<ffffffffc031d054>] ? cdrom_check_events+0x14/0x30 [cdrom]
Nov 28 14:42:39 php-slave-03 kernel: [315902.793046]  [<ffffffffbe6fec52>] ? disk_check_events+0x62/0x150
Nov 28 14:42:39 php-slave-03 kernel: [315902.793049]  [<ffffffffbe495afb>] ? process_one_work+0x14b/0x400
Nov 28 14:42:39 php-slave-03 kernel: [315902.793052]  [<ffffffffbe4965a5>] ? worker_thread+0x65/0x4a0
Nov 28 14:42:39 php-slave-03 kernel: [315902.793054]  [<ffffffffbe496540>] ? rescuer_thread+0x340/0x340
Nov 28 14:42:39 php-slave-03 kernel: [315902.793056]  [<ffffffffbe49b87f>] ? kthread+0xdf/0x100
Nov 28 14:42:39 php-slave-03 kernel: [315902.793060]  [<ffffffffbe9df7ef>] ? ret_from_fork+0x1f/0x40
Nov 28 14:42:39 php-slave-03 kernel: [315902.793062]  [<ffffffffbe49b7a0>] ? kthread_park+0x50/0x50



Nov 28 14:44:10 php-slave-03 kernel: [315993.500650] Task dump for CPU 0:
Nov 28 14:44:10 php-slave-03 kernel: [315993.500652] kworker/0:1     R  running task        0  1397      2 0x00000008
Nov 28 14:44:10 php-slave-03 kernel: [315993.500683] Workqueue: events_freezable_power_ disk_events_workfn
Nov 28 14:44:10 php-slave-03 kernel: [315993.500686]  ffff88023fc16ac0 0000000000000000 ffffe8ffffc02900 0000000000000000
Nov 28 14:44:10 php-slave-03 kernel: [315993.500688]  ffffffffbe495afb 000000003fc16ac0 ffff88023fc16ac0 ffff880191568270
Nov 28 14:44:10 php-slave-03 kernel: [315993.500690]  0000000000000008 ffff88023fc16ae0 ffff880191568240 ffff880044c52200
Nov 28 14:44:10 php-slave-03 kernel: [315993.500693] Call Trace:
Nov 28 14:44:10 php-slave-03 kernel: [315993.500709]  [<ffffffffbe495afb>] ? process_one_work+0x14b/0x400
Nov 28 14:44:10 php-slave-03 kernel: [315993.500713]  [<ffffffffbe4965a5>] ? worker_thread+0x65/0x4a0
Nov 28 14:44:10 php-slave-03 kernel: [315993.500715]  [<ffffffffbe496540>] ? rescuer_thread+0x340/0x340
Nov 28 14:44:10 php-slave-03 kernel: [315993.500718]  [<ffffffffbe49b87f>] ? kthread+0xdf/0x100
Nov 28 14:44:10 php-slave-03 kernel: [315993.500729]  [<ffffffffbe9df7ef>] ? ret_from_fork+0x1f/0x40
Nov 28 14:44:10 php-slave-03 kernel: [315993.500731]  [<ffffffffbe49b7a0>] ? kthread_park+0x50/0x50
Nov 28 14:44:16 php-slave-03 kernel: [315999.128486] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel jitterentropy_rng hmac joydev drbg hid_generic ansi_cprng usbhid aesni_intel cirrus hid aes_x86_64 lrw ttm gf128mul glue_helper drm_kms_helper ablk_helper drm cryptd ppdev evdev i2c_piix4 acpi_cpufreq serio_raw virtio_balloon shpchp tpm_tis parport_pc pcspkr tpm parport button autofs4 ext4 crc16 jbd2 mbcache sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix libata uhci_hcd crc32c_intel ehci_hcd scsi_mod usbcore psmouse virtio_pci virtio_ring virtio usb_common floppy

This looks like the same issue that has been experienced by some Linode users after moving to KVM:
https://forum.linode.com/viewtopic.php?p=67775&sid=6159312034f76c59f8981ad183a96165

Has anyone seen this?
Any idea how to prevent it?
 
Last edited:
This is the essential CPU stall console message, but as it passed relatively fast it didn't freeze up the kernel:

Code:
[301703.260023] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 2, t=8978 jiffies, g=2886633, c=2886632, q=265)
[301711.937277]  [<ffffffff810d259a>] ? rcu_check_callbacks+0x6aa/0x6d0
[301816.940024] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=9837 jiffies, g=2887160, c=2887159, q=219)
[301816.944017]  [<ffffffff810d259a>] ? rcu_check_callbacks+0x6aa/0x6d0

We get these on many VMs during nightly backups to NFS. I suspect the host kernel denies all IO to the guests and iowait freezes up the CPU, but I haven't yet seen this in real time on a guest, so I'm just speculating.

The guest here is a Debian 7 KVM vps, 4 CPUs, 4 GB RAM and 160 GB virtio disk with cache=none hosted on a ZFS RAID10 pool (local-zfs).

Is this an issue of ZFS?
Any idea how to prevent this?
 
Last edited:
Do you really have only 4 GB of RAM, or was it typo? Anyway, have you monitor the system during backup using iotop, iostat, top etc. to find at leas some culprit?
 
Do you really have only 4 GB of RAM, or was it typo? Anyway, have you monitor the system during backup using iotop, iostat, top etc. to find at leas some culprit?

If you read carefully, the 4GB of RAM belongs to the the KVM guest that produced the errors. These CPU stalls are happening on all of our nodes, regardless of hardware configuration.

Yes, we tried monitoring the hosts and the guest nodes as well, no particular reason has shown up.
 
Aah, yes, I see now, thanks :)

Does the all the nodes have same HW configurations? Maybe I would try to reproduce the situation on a test box if you have some.

I was getting some CPU stalls some time ago, but it was on the bare bone server and it was problem with tape drive and HBA.
 
Aah, yes, I see now, thanks :)

Does the all the nodes have same HW configurations? Maybe I would try to reproduce the situation on a test box if you have some.

I was getting some CPU stalls some time ago, but it was on the bare bone server and it was problem with tape drive and HBA.

Well, our nodes are very different hardware wise, yet the problem surfaces on all of them (but not on all guests). We have experienced the issue on single and dual socket servers, sporting 32GB to 96GB RAM, but there is one thing in common: all of them use ZFS (KVM guests are running on zfs-local).

So I think there is a configuration that I think would reproduce the issue:

- Proxmox 4.3 installed on ZFS RAID10 (4 or 6 spinning disks) or RAIDZ (3 SSDs)
- all packages updated from pve-no-subscription
- NFS storage created from storage provided by KVM guest 1
- snapshot backup job for all guests to NFS storage
- KVM guest 1 (lvm): OpenMediaVault installed on separate disk (providing NFS storage for backups)
- KVM guest 2 (local-zfs): Debian 7 (4 CPU, 4 GB RAM, at least 50GB of data)
- KVM guest 3,4,5 (local-zfs): doesn't matter, but should use lots of storage

So in our case KVM guest 2 definitely shows CPU stalls when the backup job is running...
 
Hi

The same problems here happend to me. It seems that it is a problem with VM from 3.4 only, because, i have a VM from 3.4 (in this 3.4 never have a problem with it), and now i updated a couple of weeks ago to the 4.3, and, about couple of days, it started every couple of hours to have an broked web server, it show just an error:

rcu_sched detected stalls on cpus/tasks ...

and the full VM is stoped to work. What i mentioned, on my dual Xeon CPU host server, is, that the all of 24 GB RAM is to 99% used in node, but not in VM.

I have another VM created with v4.3, but no errors, it works ok.

Maybe, i go to install the v4.4 and then, rebuild entire web server from scratch (ohhh my...) and hope, that will no more this errors.
 
Hi

The same problems here happend to me. It seems that it is a problem with VM from 3.4 only, because, i have a VM from 3.4 (in this 3.4 never have a problem with it), and now i updated a couple of weeks ago to the 4.3, and, about couple of days, it started every couple of hours to have an broked web server, it show just an error:

rcu_sched detected stalls on cpus/tasks ...

and the full VM is stoped to work. What i mentioned, on my dual Xeon CPU host server, is, that the all of 24 GB RAM is to 99% used in node, but not in VM.

I have another VM created with v4.3, but no errors, it works ok.

Maybe, i go to install the v4.4 and then, rebuild entire web server from scratch (ohhh my...) and hope, that will no more this errors.

Thank you for this report. In our case, almost all of our VMs were created under 4.3 (and their contents synced from OpenVZ), so I doubt that the version of the node that created the VM would have any effect.

One possible solution
On the other hand, we have found one thing that probably helps Linux guests: you have to increase the sysctl value of vm.min_free_kbytes on both your Proxmox host and your guests. On the host we have set 262144 kbytes if total system RAM is or under 32 GB, for our high memory servers (48-64-72-96 GB) we use 524288 kbytes. In VMs we use 65536 kbytes for 1-2GB RAM, 131072 kbytes for 4-8 GB RAM, or the above values if more. Don't use more than these values, as it could cause serious memory allocation problems!

This is how you can see the value currently applied to your system (same command on host and in guests), and set it to a higher one:
Code:
root@proxmox:~# sysctl vm.min_free_kbytes
vm.min_free_kbytes = 49060
root@proxmox:~# sysctl vm.min_free_kbytes=131072
vm.min_free_kbytes = 131072

You probably want the change permanent, so you should add the value to /etc/sysctl.conf:
Code:
root@proxmox:~# echo "vm.min_free_kbytes=131072" >> /etc/sysctl.conf

Setting the above values decreased the frequency of the guest CPU stalls considerably. I'm really curious if it works for you as well.
 
Hi gkovacs, and thank you for quick answer.

On my Proxmox host:
Code:
sysctl vm.min_free_kbytes
vm.min_free_kbytes = 19863

On Debian Server VM:
Code:
sysctl vm.min_free_kbytes
vm.min_free_kbytes = 67584

I have totaly of 24GB ECC RAM on Host, and for VM set Min of 12GB and Max of 16GB

But, i will try now your proposal, and see how it would react, because, i have serious problems that my VM stop working, because of this error, and all my sites in production server are down, until i reset the VM, or entire Proxmox Host-Server.

I have set now, in Host and in VM, both the same value:
Code:
echo "vm.min_free_kbytes=262144" >> /etc/sysctl.conf
(for Min 12GB and Max 16GB defined in VM RAM Options of VM, hope that value is ok?)
 
Last edited:
I have set now, in Host and in VM, both the same value:
Code:
echo "vm.min_free_kbytes=262144" >> /etc/sysctl.conf
(for Min 12GB and Max 16GB defined in VM RAM Options of VM, hope that value is ok?)

I think it's fine. I have a couple of other sysctl tweaks that are all supposed to decrease IO load due to better use of memory, thus prevent stalls and OOM kills:

Code:
vm.min_free_kbytes = 262144
vm.dirty_background_ratio = 5
vm.swappiness = 1

This is about why dirty_background_ratio needs to be small:
https://major.io/2008/08/07/reduce-disk-io-for-small-reads-using-memory/

This is about why swappiness needs to be 1:
https://community.hortonworks.com/articles/33522/swappiness-setting-recommendation.html
https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/

Looking forward to hearing about your results. BTW are you using ZFS as well? What is your disk subsystem? What is your host doing when the stalls happen, vzdump backups maybe?
 
Last edited:
  • Like
Reactions: GuiltyNL and vkhera
@gkovacs in the past I've increased min_free_kbytes on some servers that were having io stall issues during vzdump operations and it helped.

Any thoughs as to why this is necessary?

@e100 were you having problems / applying this solution on the host, in the guests or both?

I have come to apply the same solution to similar set of problems as well. Unfortunately I have no idea why it helps, but IO buffers and KVM memory allocation together with the heavy memory use of ZFS in our case somehow don't play nice with the small default values.

Some reading:
http://stackoverflow.com/questions/21374491/vm-min-free-kbytes-why-keep-minimum-reserved-memory
https://blogs.linbit.com/p/28/kernel-min_free_kbytes/
http://askubuntu.com/questions/4177...n-almost-full-ram-possibly-disk-cache-problem
 
Last edited:
@gkovacs Thanks for your reply.

I will text this 2 tweaks to, but, are the tweaks for host, or for VM(ratio and swappines)?

My host is a RAIDZ2 ZFS, 4x 1TB WD Se HDD's (WD1002F9YZ), 24GB ECC RAM, 2 Intel Xeon 2 Core each (4 Threads each, totaly 8 threads)

CPU on both is about 10-30%
My RAM in host it shows allmost 95% full, but on the VM, about 60-70% (with this vm.min_free_kbytes tweak to), but, until now, no rcu stals :)

If this error is happend, the host works fine, and other VM to, no problems, but the web Server VM is blocked, and nothing worked, nothing online.

If i go to make a backup of Web Server WM, in about 80%, the host shows an io error, but, i can't remember it. But, i will test now, and try to make a backup of VM, and see, if with this tweaks it work, and i will post here what is happend on vzdump backup of VM.

In the VM, the memory shows about 20% on this error, and CPU about 10%, but entire VM is down, i can't even SSH to it. On the host, the memory is about 99,x%, but, it works, i can login myself in CP of Proxmox, and can reset the VM.
 
Last edited:
I will text this 2 tweaks to, but, are the tweaks for host, or for VM(ratio and swappines)?

These tweaks are for the host mainly, but I'm using vm.swappiness=1 in all my VMs as well to avoid unnecessary swapping.

My host is a RAIDZ2 ZFS, 4x 1TB WD Se HDD's (WD1002F9YZ), 24GB ECC RAM, 2 Intel Xeon 2 Core each (4 Threads each, totaly 8 threads)

Just as I expected: this CPU stall issue is happening when the host is using ZFS, and IO load is high.
 
@gkovacs my ratio on host it was 5, but the swappiness it was 60.

I changed swappiness now to 1 on host, and look how it works.

On VM, ratio was 10, and swappiness 60 to.

I changed now on both to 5 and 1, and see how goes now, and rebooted entire Host server.

But, about ZFS, i read many article where described that is good, way better than ext4 or so? What is wrong, and why is this issue with RAM?

Does there exist some tricks to stop the host to use entire RAM, or to say, how many RAM it should use? I read to, that ZFS is a RAM eater, but, how to stop it to do so?
 
I applied this change to the host. I use drbd on many hosts, I think that same drbd link you provided is where I got the idea to change min_free_kbytes.

On any host with 32GB ram or more and in all my vms I disabled disk swap and added zram swap.

Got tired of high ram usage vms that sat idle all weekend getting swapped to disk ( in the host ) and performing like crap on Monday. Even experienced this swap issue with swappiness=1 on the host.

I figured if my vms are swapping then I have something configured wrong and I got tired of tracking down random performance issues when some process in a vm got swapped. I monitor logs for OOM events and adjust configs so they do not repeat. Better to have one clients process fail than hundreds of clients experiencing painful slowness.
 
But, about ZFS, i read many article where described that is good, way better than ext4 or so? What is wrong, and why is this issue with RAM?

Does there exist some tricks to stop the host to use entire RAM, or to say, how many RAM it should use? I read to, that ZFS is a RAM eater, but, how to stop it to do so?

1. ZFS ARC size
We aggressively limit the ZFS ARC size, as it has led to several spontaneous reboots in the past when unlimited. Basically, we add up all the memory the system uses without caches and buffers (like all the KVM maximum RAM combined), subtract that from total host RAM, and set the ARC to something a bit less than that, so it has to compete with system cache only. For example: on a 32GB server the maximum RAM allocation of KVM guests is 25 GB, so we set the ARC to max out at 5GB (leaving 2GB for anything else). We also set a lower limit of 1GB to the ARC, as it has been reported that it helps performance for some reason.

To do that, you have add the following lines to /etc/modprobe.d/zfs.conf
Code:
options zfs zfs_arc_max=5368709120
options zfs zfs_arc_min=1073741824
and after that run and reboot:
Code:
# update-initramfs -u

Looking at the ARC of this very server with arc_summary.py you can see it stays between the limits:
Code:
ARC Size:                               30.72%  1.54    GiB
        Target Size: (Adaptive)         30.72%  1.54    GiB
        Min Size (Hard Limit):          20.00%  1.00    GiB
        Max Size (High Water):          5:1     5.00    GiB

ARC Size Breakdown:
        Recently Used Cache Size:       35.27%  554.85  MiB
        Frequently Used Cache Size:     64.73%  1018.10 MiB


2. SWAP on ZFS zvol

You also have to make sure that swap behaves well if it resides on a ZFS zvol (default installation places it there). Most important is disabling ARC caching the swap volume, but the other tweaks are important as well (and endorsed by the ZFS on Linux community):
https://github.com/zfsonlinux/zfs/wiki/FAQ

Execute these commands in your shell (left out the # so you can copy all lines at once):
Code:
zfs set primarycache=metadata rpool/swap
zfs set secondarycache=metadata rpool/swap
zfs set compression=zle rpool/swap
zfs set checksum=off rpool/swap
zfs set sync=always rpool/swap
zfs set logbias=throughput rpool/swap

You can verify these settings by running:
Code:
# zfs get all rpool/swap
 
Last edited:
@gkovacs Thanks you so much for your info.

With the first options, i was able to backup again my VM web server, but, the RAM on host, during the backup is going to 99%. But successfully backed up the VM on my externaly FreeNAS NFS Partition.

If no backup, RAM on host is now stable on 60-70%, and VM +-40% full, which is ok.

After a day, it has worked ok, and then, i have scheduled again the backups to do automaticaly, and, the entire host is restarting after a hour or so of backing up. Exactly what you have the problem.

My currently arc_summary.py is:
Code:
ARC Size:                61.65%    7.26    GiB
    Target Size: (Adaptive)        100.00%    11.77    GiB
    Min Size (Hard Limit):        0.27%    32.00    MiB
    Max Size (High Water):        376:1    11.77    GiB

ARC Size Breakdown:
    Recently Used Cache Size:    50.00%    5.89    GiB
    Frequently Used Cache Size:    50.00%    5.89    GiB
I will add this restrictions to, but, first, i don't have the zfs.conf file on my host in /etc/modprobe.d folder. Should i create one? or maybe is this file in my currently v.4.3-1/e7cdc165 somwhere else located?

I want this week to update to the new v4.4, and see how it works. But, i want to add a Samsung SSD 850 Pro 256GB HDD for L2arc and ZIL. I have this configuration have on my v3.4, and it has worked great, no problems at all, and without this tunning. Have set first partition for ZIL 8GB, and the rest 2nd partition for L2arc.

Then, i have updated to the v4.3, and add the same configuration, and it has worked ok, until i restart the host, or restart itself, but then, the error that the HDD with ID... not found. Tryed couple of times, formated the SSD totaly, re-partitioning it again couple of times, but, no luck, it works ok, L2arc fill up, and work until i restart, or try to restart the host, and then, the sadly error, that somehow the HDD ID not found or such error. Then i decide to go without SSD, and it has worked until this problems comes, on backup, filling up the RAM.

Now, i belive, my host is stable, and will try this zfs options to, but on the new version. But, i think, i buy a new SSD for ZIL and L2arc.

If i have SSD for ZIL and L2arc, do i need this zfs settings to add, and the other:
vm.min_free_kbytes = 262144
vm.dirty_background_ratio = 5
vm.swappiness = 1
?

What is your recommendation? Should i go with 256GB SSD again better, and after that should i add this restrictions for ZFS, or it doesn't need this restrictions, if there is a SSD for ZIL and L2arc?
 
Last edited:
I will add this restrictions to, but, first, i don't have the zfs.conf file on my host in /etc/modprobe.d folder. Should i create one?
Yes, create the file with the two lines. Include the values in bytes, so 5GB equals 5x1024x1024x1024 bytes.

What is your recommendation? Should i go with 256GB SSD again better, and after that should i add this restrictions for ZFS, or it doesn't need this restrictions, if there is a SSD for ZIL and L2arc?
Limit the ARC and set the values for rpool/swap as soon as you can, and your server will not reset anymore. You can also upgrade Proxmox to 4.4 right now, it will work fine. You should keep all the tweaks for stability.

You can add the SSD as ZIL and L2ARC any time later, it's not related to these issues.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!