Scaling past 1350 Containers seccomp errors & vmap allocation failure

andwoo8182

New Member
Jan 11, 2020
14
2
3
35
Hello,

I have been slowly but surely trying to scale on of my nodes & attempting to find the limits of the hardware in order to settle on a good level to load the server at long term. I have encountered various constraints along the way and have recently got stuck at 1350 containers of the proxmox ubuntu 18.04 container image - which throws a seccomp error. A previous occurance of this error was solved by the following sysctl - net.core.bpf_jit_limit = 3000000000 as described in the updated lxc production setup doc: https://linuxcontainers.org/lxd/docs/master/production-setup

However, now I am at 1350 containers, and containers are failing to start:

lxc-start 438 20200615053953.628 ERROR seccomp - seccomp.c:lxc_seccomp_load:1239 - Unknown error 524 - Error loading the seccomp policy

However I think this is just a symptom of what I see in the syslog:

Jun 15 06:45:14 host kernel: vmap allocation for size 8192 failed: use vmalloc= to increase size
Jun 15 06:45:14 host kernel: lxc-start: vmalloc: allocation failure: 4096 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=ns,mems_allowed=0-1

Jun 15 06:45:14 host kernel: Call Trace:
Jun 15 06:45:14 host kernel: dump_stack+0x6d/0x9a
Jun 15 06:45:14 host kernel: warn_alloc.cold.119+0x7b/0xdd
Jun 15 06:45:14 host kernel: ? __get_vm_area_node+0x149/0x160
Jun 15 06:45:14 host kernel: ? bpf_jit_alloc_exec+0xe/0x10
Jun 15 06:45:14 host kernel: __vmalloc_node_range+0x1aa/0x270
Jun 15 06:45:14 host kernel: ? pcpu_block_refresh_hint+0xb0/0xf0
Jun 15 06:45:14 host kernel: ? bpf_jit_alloc_exec+0xe/0x10
Jun 15 06:45:14 host kernel: module_alloc+0x82/0xe0
Jun 15 06:45:14 host kernel: ? bpf_jit_alloc_exec+0xe/0x10
Jun 15 06:45:14 host kernel: bpf_jit_alloc_exec+0xe/0x10
Jun 15 06:45:14 host kernel: bpf_jit_binary_alloc+0x63/0xf0
Jun 15 06:45:14 host kernel: ? emit_mov_reg+0xf0/0xf0
Jun 15 06:45:14 host kernel: bpf_int_jit_compile+0x133/0x34d
Jun 15 06:45:14 host kernel: bpf_prog_select_runtime+0xcd/0x150
Jun 15 06:45:14 host kernel: bpf_prepare_filter+0x52e/0x5a0
Jun 15 06:45:14 host kernel: bpf_prog_create_from_user+0xc5/0x110
Jun 15 06:45:14 host kernel: ? hardlockup_detector_perf_cleanup.cold.9+0x1a/0x1a
Jun 15 06:45:14 host kernel: do_seccomp+0x2bf/0x8d0
Jun 15 06:45:14 host kernel: __x64_sys_seccomp+0x1a/0x20
Jun 15 06:45:14 host kernel: do_syscall_64+0x57/0x190
Jun 15 06:45:14 host kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 15 06:45:14 host kernel: RIP: 0033:0x7fbfc709bf59
Jun 15 06:45:14 host kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48
Jun 15 06:45:14 host kernel: RSP: 002b:00007ffd36a591b8 EFLAGS: 00000246 ORIG_RAX: 000000000000013d
Jun 15 06:45:14 host kernel: RAX: ffffffffffffffda RBX: 00005597e2acf440 RCX: 00007fbfc709bf59
Jun 15 06:45:14 host kernel: RDX: 00005597e2ade6f0 RSI: 0000000000000000 RDI: 0000000000000001
Jun 15 06:45:14 host kernel: RBP: 00005597e2ade6f0 R08: 00005597e2acf440 R09: 00005597e2ac8cc0
Jun 15 06:45:14 host kernel: R10: 00005597e2ad34a0 R11: 0000000000000246 R12: 00007ffd36a5925c
Jun 15 06:45:14 host kernel: R13: 0000000000000000 R14: 00000000ffffffff R15: 00005597e2ac8cc0
Jun 15 06:45:14 host kernel: Mem-Info:
Jun 15 06:45:14 host kernel: active_anon:46934939 inactive_anon:84738556 isolated_anon:0
active_file:20479475 inactive_file:18648470 isolated_file:0
unevictable:223734 dirty:590 writeback:0 unstable:0
slab_reclaimable:6646485 slab_unreclaimable:25509665
mapped:5764741 shmem:53598 pagetables:2035581 bounce:0
free:35623875 free_pcp:138359 free_cma:0

Jun 15 06:45:14 host kernel: Node 0 active_anon:96891592kB inactive_anon:176347476kB active_file:42523196kB inactive_file:38214056kB unevictable:285892kB isolated(anon):0kB isolated(file):0kB mapped:11951496kB dirty:1392kB writeback:0kB shmem:78572kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Jun 15 06:45:14 host kernel: Node 1 active_anon:90848164kB inactive_anon:162606748kB active_file:39394704kB inactive_file:36379824kB unevictable:609044kB isolated(anon):0kB isolated(file):0kB mapped:11107468kB dirty:968kB writeback:0kB shmem:135820kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Jun 15 06:45:14 host kernel: Node 0 DMA free:15872kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15872kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jun 15 06:45:14 host kernel: lowmem_reserve[]: 0 2557 515793 515793 515793
Jun 15 06:45:14 host kernel: Node 0 DMA32 free:2626636kB min:220kB low:2836kB high:5452kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2732964kB managed:2665112kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1608kB local_pcp:0kB free_cma:0kB
Jun 15 06:45:14 host kernel: lowmem_reserve[]: 0 0 513236 513236 513236
Jun 15 06:45:14 host kernel: Node 0 Normal free:57810656kB min:44820kB low:570372kB high:1095924kB active_anon:96891592kB inactive_anon:176347476kB active_file:42523196kB inactive_file:38214056kB unevictable:285892kB writepending:1392kB present:533970944kB managed:525553736kB mlocked:285892kB kernel_stack:881128kB pagetables:4131924kB bounce:0kB free_pcp:280648kB local_pcp:1284kB free_cma:0kB
Jun 15 06:45:14 host kernel: lowmem_reserve[]: 0 0 0 0 0
Jun 15 06:45:14 host kernel: Node 1 Normal free:82042336kB min:45064kB low:573476kB high:1101888kB active_anon:90848164kB inactive_anon:162606748kB active_file:39394704kB inactive_file:36379824kB unevictable:609044kB writepending:968kB present:536866816kB managed:528422156kB mlocked:609044kB kernel_stack:973480kB pagetables:4010400kB bounce:0kB free_pcp:271176kB local_pcp:1472kB free_cma:0kB
Jun 15 06:45:14 host kernel: lowmem_reserve[]: 0 0 0 0 0
Jun 15 06:45:14 host kernel: Node 0 DMA: 24kB (U) 18kB (U) 116kB (U) 132kB (U) 364kB (U) 0128kB 1256kB (U) 0512kB 11024kB (U) 12048kB (M) 34096kB (M) = 15872kB
Jun 15 06:45:14 host kernel: Node 0 DMA32: 5
4kB (UM) 38kB (M) 616kB (M) 632kB (M) 464kB (M) 6128kB (M) 5256kB (UM) 7512kB (UM) 91024kB (UM) 72048kB (UM) 6344096kB (M) = 2626636kB
Jun 15 06:45:14 host kernel: Node 0 Normal: 169974kB (UME) 192868kB (UM) 639816kB (UME) 200932kB (UME) 3764kB (UME) 192128kB (UME) 173256kB (UM) 53512kB (UM) 7521024kB (UME) 3562048kB (U) 136294096kB (M) = 57810820kB
Jun 15 06:45:14 host kernel: Node 1 Normal: 52417
4kB (UME) 378348kB (UME) 1751816kB (UME) 2570332kB (UME) 1020064kB (UME) 6514128kB (UME) 794256kB (UME) 755512kB (UE) 6851024kB (UE) 1562048kB (U) 188794096kB (M) = 82040852kB


I will try and post the contents of /proc/meminfo once I am able to bring my node back up after some work, but VmallocUsed is showing at around 22GB. I have searched for more info around the topic, but much of it is centered around 32bit constraints that need to be alleviated with a line in the grub boot loader, but I am unsure if that works here, as the 64bit contraint is 34TB or so. The bpf_jit constraint i increased previously seems to be involved. I make extensive use of SWAP (1-2TB) on a round robin array of 8 NVME drives, which allows my containers to dump most of their idle memory (after they have performed their workload) allowing me to make much better use of RAM. The containers remain performant & I have done extensive IO tuning for my workload.

Does anyone have any advice as to where to look next?

Here are further details of my node:

Dual AMD Epyc 7742, 1TB RAM, 8TB SWAP (8x1TB NVME), 72x 2TB SSD

proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.4.27-1-pve: 5.4.27-1
pve-kernel-5.4.24-1-pve: 5.4.24-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 
Thanks for the reply - yeah i wasn't sure of the syntax, so I had added vmalloc=32768M, and I believe I've had a full retest with that parameter in, but I'm going to have to retry to confirm if it makes a difference. I was testing again last night with Debian containers instead, and whilst I got to a higher number, I hit an Exchange Full error in lxc-start, which I believe meant that my bridge was full, so I need to spread the load onto the 2nd bridge. However, shortly after that I encountered the vmap allocation error too.

I can see the vmalloc=32768M in the bootlog/syslog, but I'm not sure how to tell if it has applied or been processed - the vmalloc total is still the massive 34TB in /proc/meminfo. I am also unsure as to whether that limit is being hit, as currently with 300 or so containers up, my vmallocUsed is very close to the value I see the error at (22G): VmallocUsed: 19938412 kB


MemTotal: 1056656876 kB
MemFree: 714265668 kB
MemAvailable: 745991756 kB
Buffers: 153768 kB
Cached: 30504816 kB
SwapCached: 35092504 kB
Active: 104507516 kB
Inactive: 108886276 kB
Active(anon): 87772140 kB
Inactive(anon): 95127544 kB
Active(file): 16735376 kB
Inactive(file): 13758732 kB
Unevictable: 370556 kB
Mlocked: 370556 kB
SwapTotal: 7814100640 kB
SwapFree: 7490270880 kB
Dirty: 728 kB
Writeback: 3432 kB
AnonPages: 153766172 kB
Mapped: 6103448 kB
Shmem: 148056 kB
KReclaimable: 7795996 kB
Slab: 35828356 kB
SReclaimable: 7795996 kB
SUnreclaim: 28032360 kB
KernelStack: 1112416 kB
PageTables: 1887168 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 8342429076 kB
Committed_AS: 551695736 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 19938412 kB
VmallocChunk: 0 kB
Percpu: 7383040 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 67421092 kB
DirectMap2M: 179888128 kB
DirectMap1G: 826277888 kB

I'll hopefully get to a full retest this evening & will report back with /proc/meminfo.

The containers are blockchain nodes, each holding a 30GB blockchain, that once it is sync'd, only receives/uploads a single block every few minutes, and uses only 1-5% of 1 allocated CPU. The main workload is a cryptographic calculation that is performed periodically & intermittently across all the containers - and that calc maxes out the 1 CPU for 50-100 seconds, but immediately after completing it, providing there is enough memory pressure in the container, the container will drop around 70% of its RAM into the host SWAP, which I've tried to make as speedy as possible through the NVME array. Then once SWAP is full & all the containers lightly read from it, the levels of RAM/SWAP more or less remain the same.

At 1350 containers im using around 700GB/1TB RAM, so I think there is some room left to go, but its mostly trial and error at the moment, as the load is quite significant - but I think I am encountering excessive NIC queue interrupts, which I need to play around with.
 
I was testing again last night with Debian containers instead, and whilst

If the distro isn't that important you could see if Alpine Linux is enough for you, it's very minimal - mostly busybox + package management - it do not uses any fancy new feature and a plain openrc as pid-1 init daemon. This makes it consume less kernel resources as it doesn't really uses cgroups or the like internally. Has its drawbacks too, but if it could work for your case it could make quite the difference. We provide Alpine Linux as a template to download in the Storage content tab.

The vmalloc is really a bit weird IMO, your VmallocTotal: 34359738367 kB is 32 TB, so enough for your setup - it seems this is the default value as I have it too on a bit tinier laptop with 8 GB memory I'm currently using.


You say you've 72 SSDs, what storage tech are they using, ZFS?
 
Yeah i saw the changelog for recent update noting improvements for nodes with large numbers of containers, with particular mention of alpine linux - that definitely got me interested, but i think I still have a steep learning curve there before I'm able to get a fully operational container under alpine, but yeah I will start work on that on the side. Although I guess I could do some tests with bare alpine containers in the meantime to see how the results differ.

Yeah I think VmallocTotal is potentially a theoretical constraint under 64-bit, but I'm unsure. For my storage, yeah I am using ZFS (I encountered some early issues with cloning on LVM & just moved straight to ZFS, not knowing what I was in for). They are single disk zpools, as I have no need for redundancy, and a disk failure is easily rectified through cloning a template & inserting some minor unique files to the container. I'm unsure if the design is optimal, but it has worked well so far, although I have tweaked ZFS to reduce IO.

The most recent reflief from IO, which was becoming significant with load running away, was disabling IOMMU and X2APIC in the BIOS. I lose 1 core, but apparently it results in significantly less memory overhead. That got me from 1050 to 1350 containers.

The 72 drives are all connected via 3 x LSI 9305-24i/16i HBA & 2 Oculink connectors on the motherboard.
 
Yeah I have set a max for ZFS ARC of 64GB - I tried 32GB, but later found out that when I set a max i need to set a dnode limit %, as i encountered some issues there with pruning. Here is my zfs.conf:

options zfs zfs_arc_max=68719476736
options zfs l2arc_noprefetch=0
options zfs zfs_arc_dnode_limit_percent=75
options zfs zfs_arc_meta_limit_percent=75

pools are also optimised on a few parameters favouring IO over redundancy:

NAME PROPERTY VALUE SOURCE
zdata1 type filesystem -
zdata1 creation Sun Feb 16 13:56 2020 -
zdata1 used 960G -
zdata1 available 838G -
zdata1 referenced 336K -
zdata1 compressratio 1.12x -
zdata1 mounted yes -
zdata1 quota none default
zdata1 reservation none default
zdata1 recordsize 128K default
zdata1 mountpoint /zdata1 default
zdata1 sharenfs off default
zdata1 checksum on default
zdata1 compression lz4 local
zdata1 atime off local
zdata1 devices on default
zdata1 exec on default
zdata1 setuid on default
zdata1 readonly off default
zdata1 zoned off default
zdata1 snapdir hidden default
zdata1 aclinherit restricted default
zdata1 createtxg 1 -
zdata1 canmount on default
zdata1 xattr sa local
zdata1 copies 1 default
zdata1 version 5 -
zdata1 utf8only off -
zdata1 normalization none -
zdata1 casesensitivity sensitive -
zdata1 vscan off default
zdata1 nbmand off default
zdata1 sharesmb off default
zdata1 refquota none default
zdata1 refreservation none default
zdata1 guid 4965731962439637170 -
zdata1 primarycache all local
zdata1 secondarycache all local
zdata1 usedbysnapshots 0B -
zdata1 usedbydataset 336K -
zdata1 usedbychildren 960G -
zdata1 usedbyrefreservation 0B -
zdata1 logbias latency default
zdata1 objsetid 54 -
zdata1 dedup off local
zdata1 mlslabel none default
zdata1 sync disabled local
zdata1 dnodesize legacy default
zdata1 refcompressratio 1.00x -
zdata1 written 336K -
zdata1 logicalused 1.05T -
zdata1 logicalreferenced 160K -
zdata1 volmode default default
zdata1 filesystem_limit none default
zdata1 snapshot_limit none default
zdata1 filesystem_count none default
zdata1 snapshot_count none default
zdata1 snapdev hidden default
zdata1 acltype off default
zdata1 context none default
zdata1 fscontext none default
zdata1 defcontext none default
zdata1 rootcontext none default
zdata1 relatime off default
zdata1 redundant_metadata most local
zdata1 overlay off default
zdata1 encryption off default
zdata1 keylocation none default
zdata1 keyformat none default
zdata1 pbkdf2iters 0 default
zdata1 special_small_blocks 0 default

@guletz, yeah, all drives are single disk zpools
 
Also, my sysctl conf:

vm.swappiness=100
kernel.keys.maxkeys = 100000000
kernel.keys.maxbytes = 200000000
kernel.dmesg_restrict = 1
vm.max_map_count = 262144
net.ipv6.conf.default.autoconf = 0
fs.inotify.max_queued_events = 167772160
fs.inotify.max_user_instances = 167772160 # def:128
fs.inotify.max_user_watches = 167772160 # def:8192
net.core.bpf_jit_limit = 300000000000
kernel.keys.root_maxbytes = 2000000000
kernel.keys.root_maxkeys = 1000000000
kernel.pid_max = 4194304
kernel.keys.gc_delay = 300
kernel.keys.persistent_keyring_expiry = 259200
fs.aio-max-nr = 524288
kernel.pty.max = 10000
net.core.somaxconn=10000
fs.file-max = 1048576
net.ipv4.ip_local_port_range = 12000 65535
kernel.pty.reserve = 2048
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_mem = 50576 64768 98152
net.core.netdev_max_backlog = 10000
kernel.unprivileged_bpf_disabled=1
 
So after spending much time diagnosing increasing load issues that seemed disproportionate to the increasing container count, I eventually realised that my Router Advertisments were sending far too many multicast messages to configure ipv6, resulting in about 3x the load average I now have.

Whilst doing that, I reduced my number of zpools, which greatly reduced running/sleeping tasks, but to no avail. I am still getting stuck at the same error at around 1350 containers:

Jul 02 15:27:55 host kernel: vmap allocation for size 8192 failed: use vmalloc=<size> to increase size
Jul 02 15:27:55 host kernel: warn_alloc: 2 callbacks suppressed
Jul 02 15:27:55 host kernel: lxc-start: vmalloc: allocation failure: 4096 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=ns,mems_allowed=0-1
Jul 02 15:27:55 host kernel: CPU: 52 PID: 1804830 Comm: lxc-start Tainted: P OE 5.4.44-1-pve #1
Jul 02 15:27:55 host kernel: Hardware name: Supermicro Super Server/H11DSi-NT, BIOS 2.0 09/25/2019
Jul 02 15:27:55 host kernel: Call Trace:
Jul 02 15:27:55 host kernel: dump_stack+0x6d/0x9a
Jul 02 15:27:55 host kernel: warn_alloc.cold.119+0x7b/0xdd
Jul 02 15:27:55 host kernel: ? __get_vm_area_node+0x149/0x160
Jul 02 15:27:55 host kernel: ? bpf_jit_alloc_exec+0xe/0x10
Jul 02 15:27:55 host kernel: __vmalloc_node_range+0x1aa/0x270
Jul 02 15:27:55 host kernel: ? pcpu_block_refresh_hint+0xb0/0xf0
Jul 02 15:27:55 host kernel: ? bpf_jit_alloc_exec+0xe/0x10
Jul 02 15:27:55 host kernel: module_alloc+0x82/0xe0
Jul 02 15:27:55 host kernel: ? bpf_jit_alloc_exec+0xe/0x10
Jul 02 15:27:55 host kernel: bpf_jit_alloc_exec+0xe/0x10
Jul 02 15:27:55 host kernel: bpf_jit_binary_alloc+0x63/0xf0
Jul 02 15:27:55 host kernel: ? emit_mov_reg+0xf0/0xf0
Jul 02 15:27:55 host kernel: bpf_int_jit_compile+0x133/0x34d
Jul 02 15:27:55 host kernel: bpf_prog_select_runtime+0xa8/0x130
Jul 02 15:27:55 host kernel: bpf_prepare_filter+0x52e/0x5a0
Jul 02 15:27:55 host kernel: bpf_prog_create_from_user+0xc5/0x110
Jul 02 15:27:55 host kernel: ? hardlockup_detector_perf_cleanup.cold.9+0x1a/0x1a
Jul 02 15:27:55 host kernel: do_seccomp+0x2bf/0x8d0
Jul 02 15:27:55 host kernel: __x64_sys_seccomp+0x1a/0x20
Jul 02 15:27:55 host kernel: do_syscall_64+0x57/0x190
Jul 02 15:27:55 host kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 02 15:27:55 host kernel: RIP: 0033:0x7f64d704af59
Jul 02 15:27:55 host kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48
Jul 02 15:27:55 host kernel: RSP: 002b:00007ffd9587c558 EFLAGS: 00000246 ORIG_RAX: 000000000000013d
Jul 02 15:27:55 host kernel: RAX: ffffffffffffffda RBX: 000055ad70461440 RCX: 00007f64d704af59
Jul 02 15:27:55 host kernel: RDX: 000055ad70463250 RSI: 0000000000000000 RDI: 0000000000000001
Jul 02 15:27:55 host kernel: RBP: 000055ad70463250 R08: 000055ad70461440 R09: 000055ad7045acc0
Jul 02 15:27:55 host kernel: R10: 000055ad70465eb0 R11: 0000000000000246 R12: 00007ffd9587c5fc
Jul 02 15:27:55 host kernel: R13: 0000000000000000 R14: 00000000ffffffff R15: 000055ad7045acc0
Jul 02 15:27:55 host kernel: Mem-Info:
Jul 02 15:27:55 host kernel: active_anon:54587451 inactive_anon:90967442 isolated_anon:0
active_file:2235167 inactive_file:3460520 isolated_file:0
unevictable:164343 dirty:425 writeback:143 unstable:0
slab_reclaimable:4713349 slab_unreclaimable:21539956
mapped:5452772 shmem:47729 pagetables:1967324 bounce:0
free:51835905 free_pcp:134387 free_cma:0
Jul 02 15:27:55 host kernel: Node 0 active_anon:107404408kB inactive_anon:172444724kB active_file:4307976kB inactive_file:6898316kB unevictable:619824kB isolated(anon):0kB isolated(file):0kB mapped:10756116kB dirty:1108kB writeback:388kB shmem:128096kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Jul 02 15:27:55 host kernel: Node 1 active_anon:110945396kB inactive_anon:191425044kB active_file:4632692kB inactive_file:6943764kB unevictable:37548kB isolated(anon):0kB isolated(file):0kB mapped:11054972kB dirty:592kB writeback:184kB shmem:62820kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Jul 02 15:27:55 host kernel: Node 0 DMA free:15876kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jul 02 15:27:55 host kernel: lowmem_reserve[]: 0 2561 515798 515798 515798
Jul 02 15:27:55 host kernel: Node 0 DMA32 free:2625288kB min:220kB low:2840kB high:5460kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2732964kB managed:2665112kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:2956kB local_pcp:0kB free_cma:0kB
Jul 02 15:27:55 host kernel: lowmem_reserve[]: 0 0 513236 513236 513236
Jul 02 15:27:55 host kernel: Node 0 Normal free:101709384kB min:44820kB low:570372kB high:1095924kB active_anon:107404408kB inactive_anon:172444724kB active_file:4307976kB inactive_file:6898316kB unevictable:619824kB writepending:1496kB present:533970944kB managed:525553736kB mlocked:619824kB kernel_stack:586248kB pagetables:4112108kB bounce:0kB free_pcp:261276kB local_pcp:1340kB free_cma:0kB
Jul 02 15:27:55 host kernel: lowmem_reserve[]: 0 0 0 0 0
Jul 02 15:27:55 host kernel: Node 1 Normal free:102993072kB min:45064kB low:573476kB high:1101888kB active_anon:110945396kB inactive_anon:191425044kB active_file:4632692kB inactive_file:6943764kB unevictable:37548kB writepending:776kB present:536866816kB managed:528422152kB mlocked:37548kB kernel_stack:522184kB pagetables:3757188kB bounce:0kB free_pcp:273316kB local_pcp:1384kB free_cma:0kB
Jul 02 15:27:55 host kernel: lowmem_reserve[]: 0 0 0 0 0
Jul 02 15:27:55 host kernel: Node 0 DMA: 1*4kB (U) 2*8kB (U) 1*16kB (U) 1*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15876kB
Jul 02 15:27:55 host kernel: Node 0 DMA32: 6*4kB (UM) 6*8kB (M) 8*16kB (M) 6*32kB (M) 6*64kB (M) 6*128kB (M) 5*256kB (UM) 8*512kB (UM) 9*1024kB (UM) 10*2048kB (UM) 632*4096kB (M) = 2625288kB
Jul 02 15:27:55 host kernel: Node 0 Normal: 7673547*4kB (UME) 1710859*8kB (UME) 258272*16kB (UME) 263691*32kB (UME) 157376*64kB (UME) 28369*128kB (UME) 5251*256kB (UME) 1575*512kB (UM) 565*1024kB (UME) 57*2048kB (UM) 6887*4096kB (M) = 101709924kB
Jul 02 15:27:55 host kernel: Node 1 Normal: 7677080*4kB (UME) 1580146*8kB (UME) 174051*16kB (UME) 172343*32kB (UME) 130469*64kB (UME) 21087*128kB (UME) 4590*256kB (UME) 1912*512kB (UME) 3157*1024kB (UME) 791*2048kB (UM) 8127*4096kB (M) = 102993344kB

I looked into using Alpine, but currently the application i need to run on these containers has not successfully been compiled in Alpine, so its doesnt look like that is an option.

root@host:~# cat /proc/meminfo
MemTotal: 1056656876 kB
MemFree: 142853604 kB
MemAvailable: 233692684 kB
Buffers: 14372 kB
Cached: 78175552 kB
SwapCached: 133641820 kB
Active: 272853792 kB
Inactive: 394824896 kB
Active(anon): 224174860 kB
Inactive(anon): 365528324 kB
Active(file): 48678932 kB
Inactive(file): 29296572 kB
Unevictable: 435424 kB
Mlocked: 435424 kB
SwapTotal: 7814100640 kB
SwapFree: 6110969248 kB
Dirty: 828 kB
Writeback: 400 kB
AnonPages: 491102496 kB
Mapped: 21304232 kB
Shmem: 197976 kB
KReclaimable: 19429068 kB
Slab: 106750752 kB
SReclaimable: 19429068 kB
SUnreclaim: 87321684 kB
KernelStack: 1106816 kB
PageTables: 7871644 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 8342429076 kB
Committed_AS: 2406483908 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 23569272 kB
VmallocChunk: 0 kB
Percpu: 23470080 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 132551588 kB
DirectMap2M: 247926784 kB
DirectMap1G: 693108736 kB

I can now easily retest this without any load concerns, but I am a little stuck as to where to look next. Vmalloc as reported in meminfo has been higher at times when container count is lower, and seems inconsistent. However, most of the attempts to go past 1350 have seen it sitting at around 22/23GB. There seems to be very little in the way of this specific error on recent posts online.

Does anyone have any ideas as to where to look next?
 
From what I've read on various posts, related, but mostly unrelated, it sounds like it is either fragmentation, or zfs/cgroups usage of certain kernel memory areas.

* soft nofile 1048576 unset
* hard nofile 1048576 unset
root soft nofile 1048576 unset
root hard nofile 1048576 unset
* soft memlock unlimited unset
* hard memlock unlimited unset
root soft memlock unlimited unset
root hard memlock unlimited unset

today i tried adding root memlock limits, as they weren't specified, and I wasn't sure if there was an interaction there, but no progress.
 
I tested with a privileged containers & got the same result, so I'm guessing that points more to a kernel issue that something LXC specific around unprivileged containers (like the bpf_jit_limit, that seems to be involved).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!