options zfs zfs_arc_max still needed for stability now-a-days?

mailinglists · Apr 20, 2021

Hi guys,

ever since I have encountered a crash, because ZFS did not release ARC memory fast enough, I limit it's use to a number I know will always be available and will not be requested by other processes.

I am wondering, is it safe now-a-days not to define maximum ARC size and ARC will shrink fast enough not to bother other applications?

mailinglists · Apr 22, 2021

I will try and not set it on this new cluster. We'll see if it presents problems.

Opinions still wanted.

entilza · Apr 22, 2021

Do you mean you ran out of memory and the OOM killer started killing tasks?

I have 128G of memory and also am wondering if I really need to commit 64G just to the ZFS arc size.

mailinglists · Apr 23, 2021

If I remember correctly, it might have even crashed the server.

If you want to commit less, just set max arc to less.

Example

Code:

echo 'options zfs zfs_arc_max="2147483648"' >> /etc/modprobe.d/zfs.conf
echo 'options zfs zfs_arc_min="2047483648"' >> /etc/modprobe.d/zfs.conf
update-initramfs -u

I have 2GB ARC max on some systems, that have just 24 GB of RAM.
However, now that I have a few more, that have only 24 GB, I will leave max undefined, let ARC grow up to 12GB and hopefully shrink back down when necessary.

If I encounter a crash or OOM killer, will report back to this thread.

TopheC · Apr 4, 2022

Hi @mailinglists, I have some systems with just 16GB of physical RAM, with only 2 VMs, nothing else, and OOM killer kills one of them, sometimes.
I have not set min nor max ARC values.
So, have you experienced some crashes till your changes that lets ARC grow up ?
Every report back on "small" configs appreciated

Dunuin · Apr 4, 2022

How much RAM do you assign to both VMs?

mailinglists · Apr 5, 2022

I have had no OOM kills till now, even when I let ARC grow and it has to shrink due to memory pressure from VMs. Still cautious about doing it on all the servers.

LnxBil · Apr 5, 2022

mailinglists said:
I have had no OOM kills till now, even when I let ARC grow and it has to shrink due to memory pressure from VMs. Still cautious about doing it on all the servers.

That is also my experience. I think you will get a problem if you hit the arc minimum and space cannot be reclaimed and you don't have swap anymore.

The crashes with ZFS in the old days were common if you use swap on ZFS. That is not used anymore and recommended that you use a "real" partition or disk for that - or zram.

TopheC · Apr 7, 2022

Dunuin said:
How much RAM do you assign to both VMs?

Hi @Dunuin , thks for time and support

8GB for the first and 2GB for the 2nd one. With or without ballonning, seems to be the same behaviour.

TopheC · Apr 7, 2022

LnxBil said:
That is also my experience. I think you will get a problem if you hit the arc minimum and space cannot be reclaimed and you don't have swap anymore.

The crashes with ZFS in the old days were common if you use swap on ZFS. That is not used anymore and recommended that you use a "real" partition or disk for that - or zram.

Hi @LnxBil , thks for your reply and rex

You mean swap on host (proxmox) or swap in VMs ?
No swap on Proxmox (v6 or v7), but in Vms, for my setup.

TopheC · Apr 7, 2022

mailinglists said:
I have had no OOM kills till now, even when I let ARC grow and it has to shrink due to memory pressure from VMs. Still cautious about doing it on all the servers.

Hi, @mailinglists, thks for reply and time spent to read me

No zfs_arc min nor max set on my setup. So it can grow up to 50% of physical RAM, am I right ?

chrcoluk · Apr 8, 2022

ZFS dirty cache can trigger the OOM, its much less polite than ARC. I found adding a swap device killed of the problem, but other mitigation options are decreasing the dirty cache limit, disabling transparent huge pages, and throttling write speeds in guests which mitigates how much the dirty cache is filled.

LnxBil · Apr 8, 2022

TopheC said:
You mean swap on host (proxmox) or swap in VMs ?

In the old days, there was a swap device on ZFS, so it's on the host. Fortunately, this is gone.

mailinglists · Apr 9, 2022

Hi,
I just hit OOM killer today, due to ARC not retreating fast enough, so I guess it is still a problem nowdays.
I have had SWAP on the same host and was almost empty when OOM killer hit. (I use MDADM SW RAID for it, cause ZFS has problems holding SWAP.)
It;s sunday today, so I will continue working on this on monday, any suggestions welcome.

Here are some logs:

Code:

[Sat Apr  9 05:11:50 2022] zfs invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO), order=2, oom_score_adj=0
[Sat Apr  9 05:11:50 2022] CPU: 0 PID: 15835 Comm: zfs Tainted: P           O      5.4.166-1-pve #1
[Sat Apr  9 05:11:50 2022] Hardware name: Gigabyte Technology Co., Ltd. P35C-DS3R/P35C-DS3R, BIOS F4 09/07/2007
[Sat Apr  9 05:11:50 2022] Call Trace:
[Sat Apr  9 05:11:50 2022]  dump_stack+0x6d/0x8b
[Sat Apr  9 05:11:50 2022]  dump_header+0x4f/0x1e1
[Sat Apr  9 05:11:50 2022]  oom_kill_process.cold.33+0xb/0x10
[Sat Apr  9 05:11:50 2022]  out_of_memory+0x1bf/0x4e0
[Sat Apr  9 05:11:50 2022]  __alloc_pages_slowpath+0xd40/0xe30
[Sat Apr  9 05:11:50 2022]  __alloc_pages_nodemask+0x2df/0x330
[Sat Apr  9 05:11:50 2022]  kmalloc_large_node+0x42/0x90
[Sat Apr  9 05:11:50 2022]  __kmalloc_node+0x267/0x330
[Sat Apr  9 05:11:50 2022]  ? lru_cache_add_active_or_unevictable+0x39/0xb0
[Sat Apr  9 05:11:50 2022]  spl_kmem_zalloc+0xd1/0x120 [spl]
[Sat Apr  9 05:11:50 2022]  zfsdev_ioctl+0x2b/0xe0 [zfs]
[Sat Apr  9 05:11:50 2022]  do_vfs_ioctl+0xa9/0x640
[Sat Apr  9 05:11:50 2022]  ? handle_mm_fault+0xc9/0x1f0
[Sat Apr  9 05:11:50 2022]  ksys_ioctl+0x67/0x90
[Sat Apr  9 05:11:50 2022]  __x64_sys_ioctl+0x1a/0x20
[Sat Apr  9 05:11:50 2022]  do_syscall_64+0x57/0x190
[Sat Apr  9 05:11:50 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Sat Apr  9 05:11:50 2022] RIP: 0033:0x7fd487202427
[Sat Apr  9 05:11:50 2022] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
[Sat Apr  9 05:11:50 2022] RSP: 002b:00007ffc47527a58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Sat Apr  9 05:11:50 2022] RAX: ffffffffffffffda RBX: 00007ffc47527a80 RCX: 00007fd487202427
[Sat Apr  9 05:11:50 2022] RDX: 00007ffc47527a80 RSI: 0000000000005a12 RDI: 0000000000000003
[Sat Apr  9 05:11:50 2022] RBP: 00007ffc47527a70 R08: 00007fd4869a9010 R09: 00007fd48724d7c0
[Sat Apr  9 05:11:50 2022] R10: 0000000000000022 R11: 0000000000000246 R12: 00005592e5ac92e0
[Sat Apr  9 05:11:50 2022] R13: 00005592e5ac92e0 R14: 00005592e5acb0c0 R15: 00007ffc4752b190
[Sat Apr  9 05:11:50 2022] Mem-Info:
[Sat Apr  9 05:11:50 2022] active_anon:481511 inactive_anon:228731 isolated_anon:0
                            active_file:7086 inactive_file:2212 isolated_file:0
                            unevictable:40042 dirty:3 writeback:0 unstable:0
                            slab_reclaimable:12674 slab_unreclaimable:328071
                            mapped:15297 shmem:10562 pagetables:4905 bounce:0
                            free:112881 free_pcp:56 free_cma:0
[Sat Apr  9 05:11:50 2022] Node 0 active_anon:1926044kB inactive_anon:914924kB active_file:28344kB inactive_file:8848kB unevictable:160168kB isolated(anon):0kB isolated(file):0kB mapped:61188kB dirty:12kB writeback:0kB shmem:42248kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 98304kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[Sat Apr  9 05:11:50 2022] Node 0 DMA free:15904kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[Sat Apr  9 05:11:50 2022] lowmem_reserve[]: 0 3672 7882 7882 7882
[Sat Apr  9 05:11:50 2022] Node 0 DMA32 free:335132kB min:31420kB low:39272kB high:47124kB active_anon:379736kB inactive_anon:208312kB active_file:832kB inactive_file:912kB unevictable:0kB writepending:0kB present:3914624kB managed:3820032kB mlocked:0kB kernel_stack:128kB pagetables:5648kB bounce:0kB free_pcp:28kB local_pcp:0kB free_cma:0kB
[Sat Apr  9 05:11:50 2022] lowmem_reserve[]: 0 0 4210 4210 4210
[Sat Apr  9 05:11:50 2022] Node 0 Normal free:100488kB min:36028kB low:45032kB high:54036kB active_anon:1546172kB inactive_anon:706612kB active_file:27512kB inactive_file:8196kB unevictable:160168kB writepending:12kB present:4456448kB managed:4311500kB mlocked:160168kB kernel_stack:5016kB pagetables:13972kB bounce:0kB free_pcp:196kB local_pcp:0kB free_cma:0kB
[Sat Apr  9 05:11:50 2022] lowmem_reserve[]: 0 0 0 0 0
[Sat Apr  9 05:11:50 2022] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15904kB
[Sat Apr  9 05:11:50 2022] Node 0 DMA32: 16795*4kB (UE) 33569*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 335732kB
[Sat Apr  9 05:11:50 2022] Node 0 Normal: 14436*4kB (UE) 5182*8kB (UE) 1*16kB (H) 1*32kB (H) 1*64kB (H) 1*128kB (H) 1*256kB (H) 1*512kB (H) 1*1024kB (H) 0*2048kB 0*4096kB = 101232kB
[Sat Apr  9 05:11:50 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Sat Apr  9 05:11:50 2022] 34494 total pagecache pages
[Sat Apr  9 05:11:50 2022] 11623 pages in swap cache
[Sat Apr  9 05:11:50 2022] Swap cache stats: add 44971456, delete 44958457, find 198686281/220264633
[Sat Apr  9 05:11:50 2022] Free swap  = 8810524kB
[Sat Apr  9 05:11:50 2022] Total swap = 10476540kB
[Sat Apr  9 05:11:50 2022] 2096765 pages RAM
[Sat Apr  9 05:11:50 2022] 0 pages HighMem/MovableOnly
[Sat Apr  9 05:11:50 2022] 59906 pages reserved
[Sat Apr  9 05:11:50 2022] 0 pages cma reserved
[Sat Apr  9 05:11:50 2022] 0 pages hwpoisoned
[Sat Apr  9 05:11:50 2022] Tasks state (memory values in pages):
[Sat Apr  9 05:11:50 2022] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Sat Apr  9 05:11:50 2022] [   1695]     0  1695    28686     2692   258048     9734             0 systemd-journal
[Sat Apr  9 05:11:50 2022] [   1720]     0  1720     5813      483    65536      278         -1000 systemd-udevd
[Sat Apr  9 05:11:50 2022] [   2262]     0  2262      958      333    45056       22             0 mdadm
[Sat Apr  9 05:11:50 2022] [   2590]   100  2590    23270      631    81920      184             0 systemd-timesyn
[Sat Apr  9 05:11:50 2022] [   2591]   106  2591     1705      410    53248       72             0 rpcbind
[Sat Apr  9 05:11:50 2022] [   2616]     0  2616    41547      485    81920      139             0 zed
[Sat Apr  9 05:11:50 2022] [   2617]     0  2617    56455      674    90112      179             0 rsyslogd
[Sat Apr  9 05:11:50 2022] [   2618]     0  2618    68958      334    77824       42             0 pve-lxc-syscall
[Sat Apr  9 05:11:50 2022] [   2619]     0  2619      535      143    40960        0         -1000 watchdog-mux
[Sat Apr  9 05:11:50 2022] [   2627]     0  2627     3182      551    61440      360             0 smartd
[Sat Apr  9 05:11:50 2022] [   2629]     0  2629    21333      269    57344       51             0 lxcfs
[Sat Apr  9 05:11:50 2022] [   2632]   104  2632     2297      624    53248       49          -900 dbus-daemon
[Sat Apr  9 05:11:50 2022] [   2633]     0  2633     4907      718    77824      184             0 systemd-logind
[Sat Apr  9 05:11:50 2022] [   2636]     0  2636     1022      352    45056        2             0 qmeventd
[Sat Apr  9 05:11:50 2022] [   2640]     0  2640     1810      537    49152       30             0 ksmtuned
[Sat Apr  9 05:11:50 2022] [   3018]     0  3018     1823      135    53248       85             0 lxc-monitord
[Sat Apr  9 05:11:50 2022] [   3019]     0  3019    75821     1505   143360      357             0 proxmox-backup-
[Sat Apr  9 05:11:50 2022] [   3043]     0  3043      568      126    40960       17             0 none
[Sat Apr  9 05:11:50 2022] [   3051]     0  3051     1722       48    57344       13             0 iscsid
[Sat Apr  9 05:11:50 2022] [   3052]     0  3052     1848     1306    57344        0           -17 iscsid
[Sat Apr  9 05:11:50 2022] [   3082]     0  3082     3962      738    65536      176         -1000 sshd
[Sat Apr  9 05:11:50 2022] [   3135]     0  3135     1402      335    45056       12             0 agetty
[Sat Apr  9 05:11:50 2022] [   3187]    34  3187   533684    27377  2949120   106328             0 proxmox-backup-
[Sat Apr  9 05:11:50 2022] [   3237]     0  3237   183257     1115   184320      600             0 rrdcached
[Sat Apr  9 05:11:50 2022] [   3279]     0  3279    10867      508    73728      181             0 master
[Sat Apr  9 05:11:50 2022] [   3281]   107  3281    10992      545    81920      185             0 qmgr
[Sat Apr  9 05:11:50 2022] [   3292]     0  3292   230080     3620   483328     2960             0 pmxcfs
[Sat Apr  9 05:11:50 2022] [   3297]     0  3297     2125      551    53248       32             0 cron
[Sat Apr  9 05:11:50 2022] [   3299]     0  3299   143011    44163   421888        0             0 corosync
[Sat Apr  9 05:11:50 2022] [   3497]     0  3497    76714     6409   315392    15437             0 pve-firewall
[Sat Apr  9 05:11:50 2022] [   3499]     0  3499    85639    12332   397312    19061             0 pvestatd
[Sat Apr  9 05:11:50 2022] [   3686]     0  3686    88962      944   417792    29754             0 pvedaemon
[Sat Apr  9 05:11:50 2022] [   3697]     0  3697    84613     1465   376832    22824             0 pve-ha-crm
[Sat Apr  9 05:11:50 2022] [   3699]    33  3699    89319     2816   450560    29189             0 pveproxy
[Sat Apr  9 05:11:50 2022] [   3705]    33  3705    17683     2174   180224    11192             0 spiceproxy
[Sat Apr  9 05:11:50 2022] [   3707]     0  3707    84508     2018   380928    22312             0 pve-ha-lrm
[Sat Apr  9 05:11:50 2022] [   3816]     0  3816  1176393   554941  5963776    47362             0 kvm
[Sat Apr  9 05:11:50 2022] [   4101]     0  4101   914644   244900  3543040    38241             0 kvm
[Sat Apr  9 05:11:50 2022] [  10472]   110 10472     1607      517    49152      103          -500 nrpe
[Sat Apr  9 05:11:50 2022] [  15012]     0 15012    91989     7742   450560    25233             0 pvedaemon worke
[Sat Apr  9 05:11:50 2022] [  31182]     0 31182    91960     7477   450560    25492             0 pvedaemon worke
[Sat Apr  9 05:11:50 2022] [  13549]     0 13549    91982     8304   450560    24684             0 pvedaemon worke
[Sat Apr  9 05:11:50 2022] [  11690]     0 11690    21543      440    69632        0             0 pvefw-logger
[Sat Apr  9 05:11:50 2022] [  11697]    33 11697    17964     2098   180224    11025             0 spiceproxy work
[Sat Apr  9 05:11:50 2022] [  12961]    33 12961    92689     5873   458752    27747             0 pveproxy worker
[Sat Apr  9 05:11:50 2022] [  12962]    33 12962    92669     6186   458752    27463             0 pveproxy worker
[Sat Apr  9 05:11:50 2022] [  12964]    33 12964    92698     5908   458752    27704             0 pveproxy worker
[Sat Apr  9 05:11:50 2022] [  12166]   107 12166    10958      696    81920        0             0 pickup
[Sat Apr  9 05:11:50 2022] [  14343]     0 14343     1314      189    45056        0             0 sleep
[Sat Apr  9 05:11:50 2022] [  15835]     0 15835     2708      574    57344        0             0 zfs
[Sat Apr  9 05:11:50 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/qemu.slice/140.scope,task=kvm,pid=3816,uid=0
[Sat Apr  9 05:11:50 2022] Out of memory: Killed process 3816 (kvm) total-vm:4705572kB, anon-rss:2215472kB, file-rss:4288kB, shmem-rss:4kB, UID:0 pgtables:5824kB oom_score_adj:0
[Sat Apr  9 05:11:51 2022] oom_reaper: reaped process 3816 (kvm), now anon-rss:0kB, file-rss:96kB, shmem-rss:4kB

LnxBil · Apr 9, 2022

What about the arc status, i don't see that on the output / dmesg. I'm interessted in the lower setting and how low the arc was.

Your slab was very fragmented, maybe that is also part of the problem.

What about using hugepages for all VMs that are multiple of 2 MB in size? That will help a lot and save some memory on huge systems.

LnxBil · Apr 9, 2022

mailinglists said:
Node 0 Normal: 14436*4kB (UE) 5182*8kB (UE) 1*16kB (H) 1*32kB (H) 1*64kB (H) 1*128kB (H) 1*256kB (H) 1*512kB (H) 1*1024kB (H) 0*2048kB 0*4096kB = 101232kB

If you want to allocate a contiguous 2 MB slice, you will fail and run at least in a swap, maybe also in an OOM situation besides having over 100 MB free.

mailinglists · Apr 19, 2022

Sorry for the late reply and thank you for the answer.

Below linked is current arc_summary. Not from the time of the crash, however, hypervisor has not been rebooted since VM was killed.
If you want some other data, let me know. Have had to link, post character limit was reached:
https://pastebin.com/K0T6tRck

This system has 8 GB of RAM. I have two VMs for NFS, rsync backups on it and nothing else, so not sure, if using hugepages would help.

I haven't really looked into hugepages feature yet, but after a really quick lock, hugepagesize seems to be 2MB already:

Code:

root@p39:~# cat /proc/meminfo | grep Huge
AnonHugePages:     94208 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB

But they are not used?

Code:

root@p39:~# cat /proc/sys/vm/nr_hugepages
0

Are you suggesting I should enable hugepages for each VM specifically (or globally) and set it to 2 MB (default?) and this will help in crashing due to OOM? Logic being, that because page are big, table of pages gets less entries and whole system uses less RAM? Or..?

LnxBil · Apr 19, 2022

mailinglists said:
Logic being, that because page are big, table of pages gets less entries and whole system uses less RAM? Or..?

Yes, that is the case. Hugepages are a CPU extension that allows the virtual memory management (VMM) to use page sizes of 4KB, 2 MB and 1 GB in order to keep the page tables small and fast. Unfortunately, hugepages are statically allocated and then only usable with programs that can make use of them (KVM/QEMU can, also some Databases).

mailinglists said:
Are you suggesting I should enable hugepages for each VM specifically (or globally) and set it to 2 MB (default?) and this will help in crashing due to OOM?

You will not get OOM due to VMs, the memory is hard pinned and it cannot be swapped out. With only 8 GB, you will not have such a huge improvement and I would not enable them. I normally use Hugepages with system starting with at least 32 GB RAM.

So hugepages are one solution, but not one that is advisable here. Your system is currently using 47 anonymous hugepages, so a little optimization has already been done internally.

mailinglists said:
This system has 8 GB of RAM. I have two VMs for NFS, rsync backups on it and nothing else

Are the guests the same OS with the same patch level? Maybe you can have a little improvement with KSM. What is the current level (as shown in the host summary on the PVE management console)

mailinglists · Jun 2, 2022

Hi @LnxBil ,

thank you for your reply and sorry for my late response (life happenz).

To answer further, we already use KSM.

I guess the solution here would be to restrict max ZFS ARC < ( SUM(VM RAM) + 2GB for PM host).

Dunuin · Jun 2, 2022

mailinglists said:
I guess the solution here would be to restrict max ZFS ARC < ( SUM(VM RAM) + 2GB for PM host).

Also keep in mind that VMs will use more RAM than you assign to the VMs because of the virtualization overhead. If you look with htop at your running processes you will see that the KVM process of a 4GB VM for example could be using something like 4.7GB of RAM.

options zfs zfs_arc_max still needed for stability now-a-days?

Renowned Member

Renowned Member

Active Member

Renowned Member

Member

Distinguished Member

Renowned Member

Distinguished Member

Member

Member

Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member