PVE 5.0-25 and ZFS 0.7.2 Do Not Respect zfs_arc_max

chrone

Active Member
Apr 15, 2015
115
16
38
planet earth
Hi Proxmox Team,

The latest pve-no-subscription packages for Proxmox VE 5.0-25 and ZFS 0.7.2 do not respect zfs_arc_max option. Is this a bug or intentional? Also how to check whether the new ZFS ARC is compressed or not?

I tried to limit the maximum zfs_arc_max to 128MB but from the arcstats and arc_summary I get ZFS still allocate 16GB RAM for ZFS ARC. We allocate only 2-4GB RAM for Proxmox host to use and let the rest of the 28-30GB RAM for VMs, hence the zfs_arc_max is set small to avoid random Proxmox reboot due to the host runs out of memory.

The kernel 4.13.4-1-pve initramfs has been updated with the zfs options zfs_arc_max from /etc/modprobe.d/zfs.conf and the host has been rebooted as well.

Code:
root@node3:~# uname -a
Linux node3 4.13.4-1-pve #1 SMP PVE 4.13.4-25 (Fri, 13 Oct 2017 08:59:53 +0200) x86_64 GNU/Linux

root@node3:~# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=134217728

root@node3:~# cat /sys/module/zfs/parameters/zfs_arc_max
134217728

root@node3:~# cat /proc/spl/kstat/zfs/arcstats | grep -C1 c_max
c_min                           4    1053305600
c_max                           4    16852889600
size                            4    3133130464

root@node3:~# pveversion -v
proxmox-ve: 5.0-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.0-34 (running version: 5.0-34/b325d69e)
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.13.4-1-pve: 4.13.4-25
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.44-1-pve: 4.4.44-84
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-10
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90


root@node3:~# zpool status -v
  pool: dpool
 state: ONLINE
  scan: scrub repaired 0B in 4h36m with 0 errors on Sun Oct  8 05:00:07 2017
config:
        NAME        STATE     READ WRITE CKSUM
        dpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc1    ONLINE       0     0     0
            sdd1    ONLINE       0     0     0
errors: No known data errors
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h32m with 0 errors on Sun Oct  8 00:56:34 2017
config:
        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda2    ONLINE       0     0     0
            sdb2    ONLINE       0     0     0
errors: No known data errors

root@node3:~# zdb -C rpool
MOS Configuration:
        version: 5000
        name: 'rpool'
        state: 0
        txg: 43006237
        pool_guid: 4036384320052402193
        errata: 0
        hostid: 2831164162
        hostname: '(none)'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 4036384320052402193
            children[0]:
                type: 'mirror'
                id: 0
                guid: 1225544730253818612
                metaslab_array: 35
                metaslab_shift: 31
                ashift: 12
                asize: 256046268416
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_top: 169
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 9301319602177735395
                    path: '/dev/sda2'
                    whole_disk: 0
                    DTL: 219
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 170
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 16903084243559480732
                    path: '/dev/sdb2'
                    whole_disk: 0
                    DTL: 218
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 171
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
            
root@node3:~# zdb -C dpool
MOS Configuration:
        version: 5000
        name: 'dpool'
        state: 0
        txg: 18482931
        pool_guid: 7700303140461795525
        errata: 0
        hostid: 2831164162
        hostname: 'node3'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 7700303140461795525
            create_txg: 4
            children[0]:
                type: 'mirror'
                id: 0
                guid: 804796955154172038
                metaslab_array: 35
                metaslab_shift: 33
                ashift: 12
                asize: 1000198897664
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_top: 111
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 5333707630976611633
                    path: '/dev/sdc1'
                    whole_disk: 0
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 112
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 14834454220423811900
                    path: '/dev/sdd1'
                    whole_disk: 0
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 113
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
space map refcount mismatch: expected 118 != actual 116
 

Attachments

  • proxmox ve 5.0-25 zfs 0.7.2 zfs_arc_max issue.txt
    17.6 KB · Views: 3
hah, okay I didn't expect your limit to be this low ;) you should probably think about bumping it at least a bit (e.g., to 1G), as the ARC is not only used for read caching, but also as buffer for async writing!

that being said, it is possible to get such a small ARC:

Code:
$ cat /proc/spl/kstat/zfs/arcstats | grep -C1 c_max                                   
c_min                           4    67108864
c_max                           4    134217728
size                            4    203793856
$ cat /sys/module/zfs/parameters/zfs_arc_m??
134217728
67108864

note that it isn't yet done shrinking the ARC completely, and I needed to set arc_min as well otherwise it wouldn't shrink at all. I haven't checked, but I recall that 0.7 changed some of the defaults, so I guess your arc_max of 128M is simply below the default arc_min, and thus ignored..
 
  • Like
Reactions: chrone
hah, okay I didn't expect your limit to be this low ;) you should probably think about bumping it at least a bit (e.g., to 1G), as the ARC is not only used for read caching, but also as buffer for async writing!

that being said, it is possible to get such a small ARC:

Code:
$ cat /proc/spl/kstat/zfs/arcstats | grep -C1 c_max                                  
c_min                           4    67108864
c_max                           4    134217728
size                            4    203793856
$ cat /sys/module/zfs/parameters/zfs_arc_m??
134217728
67108864

note that it isn't yet done shrinking the ARC completely, and I needed to set arc_min as well otherwise it wouldn't shrink at all. I haven't checked, but I recall that 0.7 changed some of the defaults, so I guess your arc_max of 128M is simply below the default arc_min, and thus ignored..

Oh, I see, that's why! What is the recommended RAM size to allocate for Proxmox host and ZFS? Is 4GB suffice for 2x or 4x 1TB HDDs? This way, I could allocate less RAM for VMs so Proxmox and ZFS won't be out of memory and reboots randomly.

I'll try again with arc 1GB tomorrow and will update you later on this. Thanks for the info. :)
 
hah, okay I didn't expect your limit to be this low ;) you should probably think about bumping it at least a bit (e.g., to 1G), as the ARC is not only used for read caching, but also as buffer for async writing!

that being said, it is possible to get such a small ARC:

Code:
$ cat /proc/spl/kstat/zfs/arcstats | grep -C1 c_max                                  
c_min                           4    67108864
c_max                           4    134217728
size                            4    203793856
$ cat /sys/module/zfs/parameters/zfs_arc_m??
134217728
67108864

note that it isn't yet done shrinking the ARC completely, and I needed to set arc_min as well otherwise it wouldn't shrink at all. I haven't checked, but I recall that 0.7 changed some of the defaults, so I guess your arc_max of 128M is simply below the default arc_min, and thus ignored..

Whoa, setting to 1GB did the tricks! Much appreciated for the helped. :)

Code:
root@node1:~# cat /proc/spl/kstat/zfs/arcstats | grep -C1 c_max
c_min                           4    1053305472
c_max                           4    1073741824
size                            4    984503736
 
as the ARC is not only used for read caching, but also as buffer for async writing!

Hi,

Maybe is important to say, that in this special case(low memory for arc), that more important is to setup min / max values for Metadata used by arc. If this metadata arc is very low, this will use more disk access(for a ls -l as an example). As I remember in this low memory use case I had reserved 60% for metadata arc.
Also you can setup your zvol properties to use in arc only metadata. It is a non sense to cache the same data twice: at the zfs(zvol) level (aka cache= data + metadata) and at the kvm guest (again the same data and metadata cache). And if you run several kvm guests, in total you will waste many Gb of RAM.

As a final ideea, use your ram to cache metadata insted of data if you have not sufficient RAM.

Have a nice day!
 
  • Like
Reactions: chrone
Hi,

Maybe is important to say, that in this special case(low memory for arc), that more important is to setup min / max values for Metadata used by arc. If this metadata arc is very low, this will use more disk access(for a ls -l as an example). As I remember in this low memory use case I had reserved 60% for metadata arc.
Also you can setup your zvol properties to use in arc only metadata. It is a non sense to cache the same data twice: at the zfs(zvol) level (aka cache= data + metadata) and at the kvm guest (again the same data and metadata cache). And if you run several kvm guests, in total you will waste many Gb of RAM.

As a final ideea, use your ram to cache metadata insted of data if you have not sufficient RAM.

Have a nice day!

Thx for the input. That's what I thought too, zfs cache and linux page cache race condition to cache same file.

I'll take a look the zvol option later.
 
In case that you use lxc containers, the metadata caching it is very important, and if the metadata of yours lxc's are not in the zfs cache, the performance it will be very bad.
 
  • Like
Reactions: chrone

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!