ZFS uses RAM equal to files being written to it

csystem

New Member
Aug 20, 2016
6
0
1
27
Hello,
I've got a server with 3x WD red 3TB setup in RAIDZ1 and 32GB ECC RAM. Another ZFS mirror is used for root.
The 3 disk storage array is shared via ZFS built in NFS with a couple of users, and has been working perfectly for the last couple of months. Until one of the users decided to copy over their entire backups hard drive. It isn't much data, around 300GB, but its spread over at least 4 million files. The first few hours it copied like expected. Then it started filling all of the ram until the machine crashed.
Now, when I copy over a file to either one of the ZFS volumes, the ram usage increases about equal to the file being copied over. e.g.
Code:
root@rho:~# free -h
  total  used  free  shared  buffers  cached
Mem:  31G  4.4G  26G  49M  1.8M  91M
-/+ buffers/cache:  4.4G  26G
Swap:  8.0G  0B  8.0G
root@rho:~# free
  total  used  free  shared  buffers  cached
Mem:  32792720  4665476  28127244  50236  1828  93856
-/+ buffers/cache:  4569792  28222928
Swap:  8388604  0  8388604
root@rho:~# dd if=/dev/urandom of=testfile bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 42.6863 s, 12.3 MB/s
root@rho:~# free -h
  total  used  free  shared  buffers  cached
Mem:  31G  4.7G  26G  49M  1.8M  91M
-/+ buffers/cache:  4.7G  26G
Swap:  8.0G  0B  8.0G
root@rho:~# free
  total  used  free  shared  buffers  cached
Mem:  32792720  4975820  27816900  50236  1828  93856
-/+ buffers/cache:  4880136  27912584
Swap:  8388604  0  8388604
The ram swell isn't exactly the same size, sometimes its more than the file being copied, sometimes less, but it continues until all of the ram is used and the system fails. Its not regular cache as its green in htop, not orange and cached in output of free stays low.

I've tried a scrub of the volume, but that had no effect and found zero errors.
Does anyone have any idea what is going on here? This problem makes the server completely unusable. There is a backup, so if its necessary to destroy the storage volume its possible.
If you need the output of a command or content of a log, please ask.

This is the zpool configuration:
Code:
root@rho:~# zpool get all
NAME  PROPERTY  VALUE  SOURCE
hdd  size  8.12T  -
hdd  capacity  37%  -
hdd  altroot  -  default
hdd  health  ONLINE  -
hdd  guid  8250828831358797934  default
hdd  version  -  default
hdd  bootfs  -  default
hdd  delegation  on  default
hdd  autoreplace  off  default
hdd  cachefile  -  default
hdd  failmode  wait  default
hdd  listsnapshots  off  default
hdd  autoexpand  off  default
hdd  dedupditto  0  default
hdd  dedupratio  1.00x  -
hdd  free  5.11T  -
hdd  allocated  3.02T  -
hdd  readonly  off  -
hdd  ashift  0  default
hdd  comment  -  default
hdd  expandsize  -  -
hdd  freeing  0  default
hdd  fragmentation  22%  -
hdd  leaked  0  default
hdd  feature@async_destroy  enabled  local
hdd  feature@empty_bpobj  active  local
hdd  feature@lz4_compress  active  local
hdd  feature@spacemap_histogram  active  local
hdd  feature@enabled_txg  active  local
hdd  feature@hole_birth  active  local
hdd  feature@extensible_dataset  enabled  local
hdd  feature@embedded_data  active  local
hdd  feature@bookmarks  enabled  local
hdd  feature@filesystem_limits  enabled  local
hdd  feature@large_blocks  enabled  local
rpool  size  222G  -
rpool  capacity  2%  -
rpool  altroot  -  default
rpool  health  ONLINE  -
rpool  guid  4109231484567507720  default
rpool  version  -  default
rpool  bootfs  rpool/ROOT/pve-1  local
rpool  delegation  on  default
rpool  autoreplace  off  default
rpool  cachefile  -  default
rpool  failmode  wait  default
rpool  listsnapshots  off  default
rpool  autoexpand  off  default
rpool  dedupditto  0  default
rpool  dedupratio  1.00x  -
rpool  free  217G  -
rpool  allocated  4.76G  -
rpool  readonly  off  -
rpool  ashift  12  local
rpool  comment  -  default
rpool  expandsize  -  -
rpool  freeing  0  default
rpool  fragmentation  0%  -
rpool  leaked  0  default
rpool  feature@async_destroy  enabled  local
rpool  feature@empty_bpobj  active  local
rpool  feature@lz4_compress  active  local
rpool  feature@spacemap_histogram  active  local
rpool  feature@enabled_txg  active  local
rpool  feature@hole_birth  active  local
rpool  feature@extensible_dataset  enabled  local
rpool  feature@embedded_data  active  local
rpool  feature@bookmarks  enabled  local
rpool  feature@filesystem_limits  enabled  local
rpool  feature@large_blocks  enabled  local
 
Does it help if you set the following in /etc/modprobe.d/zfs.conf (reboot or reload zfs module required):

Code:
# Don't let ZFS use less than 4GB and more than 16GB
options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=17179869184
 
  • Like
Reactions: albans
I already had those set, max at around 10G but /sys/module/zfs/parameters/zfs_arc_max is still 0. If I echo the value into there the problem is gone. It just seems to ignore /etc/modprobe.d/zfs.conf even after reboot. Reload of zfs module is impossible as / is on zfs.
 
ZFS of course is a kernel module so rebuilding initramfs did it. Although I still find it strange that without that limit ZFS keeps using memory until there is no more left and the system crashes.
Here is a capture from ram graph showing the event
15dmkug.jpg
 
openzfs has these defaults for zfs_arc_max which I expect is the same for ZOL
  • 75% of memory on systems with less than 4 GB of memory
  • physmem minus 1 GB on systems with greater than 4 GB of memory
Remember you also have other processes using RAM on your host (VM's, CT's etc) so balance the size of zfs_arc_max with the expected use for RAM to VM's and CT's. A rule of thumb:
OS: 1 GB

So the formula is: total_ram - 1 GB - expected_GB_for_vm/ct = zfs_arc_max; zfs_arc_max >= 4 GB.
If zfs_arc_max < 4 GB you need to add more RAM.
 
  • Like
Reactions: csystem
32 - 1 - 16 = 15GB free for zfs. I've set zfs_arc_max to 10GB now to allow some breathing space, and do not really care about a slight performance drop as long as it is stable.
That it failed makes perfect sense now as ZFS without limits would leave just 1 GB for the rest of the system.
Thank you mir and dietmar for your help
 
  • 75% of memory on systems with less than 4 GB of memory
  • physmem minus 1 GB on systems with greater than 4 GB of memory

@mir Do you think such defaults are reasonable for a system like Proxmox VE? Maybe we should change that to reserve a bit more RAM for VMs?
 
@mir Do you think such defaults are reasonable for a system like Proxmox VE? Maybe we should change that to reserve a bit more RAM for VMs?
No, I think those defaults are not optional for Proxmox since the defaults are defined with a storage server in mind. In my opinion if RAM > 4 GB arc size should never be higher than half the amount of RAM. Eg.

total_ram - 1 GB / 2 = arc_max; if 4 <= arc_max <= 16
 
openzfs has these defaults for zfs_arc_max which I expect is the same for ZOL
  • 75% of memory on systems with less than 4 GB of memory
  • physmem minus 1 GB on systems with greater than 4 GB of memory
Remember you also have other processes using RAM on your host (VM's, CT's etc) so balance the size of zfs_arc_max with the expected use for RAM to VM's and CT's. A rule of thumb:
OS: 1 GB

So the formula is: total_ram - 1 GB - expected_GB_for_vm/ct = zfs_arc_max; zfs_arc_max >= 4 GB.
If zfs_arc_max < 4 GB you need to add more RAM.

If I read the code in module/zfs/arc.c correctly, the defaults for ZoL are actually:

limit for updating:

Code:
        /* Valid range: 64M - <all physical memory> */
        if ((zfs_arc_max) && (zfs_arc_max != arc_c_max) &&
            (zfs_arc_max > 64 << 20) && (zfs_arc_max < ptob(physmem)) &&
            (zfs_arc_max > arc_c_min)) {
                arc_c_max = zfs_arc_max;
                arc_c = arc_c_max;
                arc_p = (arc_c >> 1);
                arc_meta_limit = MIN(arc_meta_limit, arc_c_max);
        }
....
        /* Valid range: 0 - <all physical memory> */
        if ((zfs_arc_sys_free) && (zfs_arc_sys_free != arc_sys_free))
                arc_sys_free = MIN(MAX(zfs_arc_sys_free, 0), ptob(physmem));

initial defaults:
Code:
        /*
         * allmem is "all memory that we could possibly use".
         */
#ifdef _KERNEL
        uint64_t allmem = ptob(physmem);
#else
        uint64_t allmem = (physmem * PAGESIZE) / 2;
#endif

...

        /* Start out with 1/8 of all memory */
        arc_c = allmem / 8;

#ifdef _KERNEL
        /*
         * On architectures where the physical memory can be larger
         * than the addressable space (intel in 32-bit mode), we may
         * need to limit the cache to 1/8 of VM size.
         */
        arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 8);

        /*
         * Register a shrinker to support synchronous (direct) memory
         * reclaim from the arc.  This is done to prevent kswapd from
         * swapping out pages when it is preferable to shrink the arc.
         */
        spl_register_shrinker(&arc_shrinker);

        /* Set to 1/64 of all memory or a minimum of 512K */
        arc_sys_free = MAX(ptob(physmem / 64), (512 * 1024));
        arc_need_free = 0;
#endif

        /* Set min cache to allow safe operation of arc_adapt() */
        arc_c_min = 2ULL << SPA_MAXBLOCKSHIFT;
        /* Set max to 1/2 of all memory */
        arc_c_max = allmem / 2;

        arc_c = arc_c_max;

which would mean an initial max arc size of 1/2, and potentially 512k or 1/64 of physical memory of reserved memory for the system. both limits can be overridden with arbitrary values up to all of the physical memory.

I will try to reproduce the issue at hand - are there any syslogs you could provide? does the OOM killer trigger?
 
Here is the syslog containing the event: http://pastebin.com/rHBqv7xJ The file copy was initiated around 04:00 servertime with memory usage increasing until all 32G is used, the sudden out of memory reset is at 07:00 servertime. Note that there are a few reboots after that as I was troubleshooting the issue.
 
Last edited:
strange indeed. does this system have swap?

as expected, if zfs_arc_max is not set (i.e., 0) the ARC will use up to half of the physical memory, not more. so something must have used the other half and not given up any of it, but the OOM-killer did not trigger. did any of the VMs crash/get killed before the linked log file starts?
 
The system has 8GB of swap, also on ZFS. The server was still under test and had been stable for over a month when it went into use as file server only until there is more time to setup the rest. Thats when the data transfers started and ZFS ate the RAM. At the time of the crash only one KVM vm with 2GB memory allocated and one LXC ct with 1GB of ram was running. Before the file transfer, the system used about 4GB of RAM. In the event log I can only see the VM and CT start when the server finished booting at 07:01 and the vm's themselves appear to stop abruptly and then boot as normal. Its like somebody pulled the plug on the server and reinserted it.
 
Now I see want went wrong. Half your RAM were taken by arc-cache, some for the OS and VM, and the rest were used for file cache while copying the large file.
 
Does it help if you set the following in /etc/modprobe.d/zfs.conf (reboot or reload zfs module required):

Code:
# Don't let ZFS use less than 4GB and more than 16GB
options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=17179869184

Shouldn't this parameter be set @proxmox install when the root system is on ZFS -> meaning that the install wizard should ask the user for this parameter?

Just a thought as this can be ignored by many admin and lead to stability issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!