We are currently experiencing a problem on our main server, which uses Proxmox 6.1. This seems to happen since we updated from 5.4 to 6.1 on 2020-03-17 on that node.
Our System uses mostly containers on a local ZFS storage. There is also one old KVM running. Backups of the containers are stored on an NFS storage (synology) using vzdump.
Yesterday we first noticed that processes got killed. Since we are running some java applications, they get killed first (jira, confluence). The Proxmox UI reported only nearly full memory usage. That is weird, because the processes in these containers are usually not nearly using that much of the 96GB on that machine. And in the overview in the UI of all container, the combined used memory was much lower.
First: It's not ZFS.
The processes on the system don't use that much memory (that corresponds to the per container usage on the Proxmox UI):
slabtop also seemed ok (around 13GB usage). I only have a copy of /proc/slabinfo of that time (which I attach).
meminfo showed how few memory is left:
I wasn't able to figure out what was allocating all the memory. The processes, arc and slab caches shouldn't be even 40GB.
So I started shutting the VM and each container one after another and checked the memory consumption. But that only freed the memory of the processes in each container. In the end I got:
slab sizes also didn't change significantly.
The number that got my attention was the Percpu-row. That is roughly the "missing" memory, but if that is the problem I'm not sure how to proceed and get detailed usage information for that memory. I just don't find useful information about it.
To get the machine usable again, I rebooted it and the memory is available, but we would like a better solution:
I later updated to newest packages (Proxmox 6.1-8), but still the same kernel and also limited the maximum arcsize for ZFS, but only because I don't think the 47GB are necessary there.
We are now running the following versions
And the problem seems to persist:
I'm looking forward to any help or suggestions !
Thanks,
Stephan
Our System uses mostly containers on a local ZFS storage. There is also one old KVM running. Backups of the containers are stored on an NFS storage (synology) using vzdump.
Yesterday we first noticed that processes got killed. Since we are running some java applications, they get killed first (jira, confluence). The Proxmox UI reported only nearly full memory usage. That is weird, because the processes in these containers are usually not nearly using that much of the 96GB on that machine. And in the overview in the UI of all container, the combined used memory was much lower.
First: It's not ZFS.
Code:
root@amor:~# arcstat
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
16:32:08 40 0 0 0 0 0 0 0 0 3.3G 3.3G
The processes on the system don't use that much memory (that corresponds to the per container usage on the Proxmox UI):
Code:
root@amor:~# ps -ax -o rss | tail -n +2 | paste -s -d+ - | bc
19807784
slabtop also seemed ok (around 13GB usage). I only have a copy of /proc/slabinfo of that time (which I attach).
meminfo showed how few memory is left:
Code:
root@amor:~# cat /proc/meminfo
MemTotal: 98885928 kB
MemFree: 1138156 kB
MemAvailable: 1664236 kB
Buffers: 0 kB
Cached: 1378444 kB
SwapCached: 192508 kB
Active: 12403076 kB
Inactive: 6676824 kB
Active(anon): 11869160 kB
Inactive(anon): 6370316 kB
Active(file): 533916 kB
Inactive(file): 306508 kB
Unevictable: 159248 kB
Mlocked: 159248 kB
SwapTotal: 8388604 kB
SwapFree: 1680 kB
Dirty: 16552 kB
Writeback: 1228 kB
AnonPages: 17710588 kB
Mapped: 704340 kB
Shmem: 539972 kB
KReclaimable: 1010396 kB
Slab: 13705372 kB
SReclaimable: 1010396 kB
SUnreclaim: 12694976 kB
KernelStack: 89024 kB
PageTables: 251980 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 57831568 kB
Committed_AS: 52493672 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 1688564 kB
VmallocChunk: 0 kB
Percpu: 60249024 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 82923100 kB
DirectMap2M: 17659904 kB
DirectMap1G: 2097152 kB
I wasn't able to figure out what was allocating all the memory. The processes, arc and slab caches shouldn't be even 40GB.
So I started shutting the VM and each container one after another and checked the memory consumption. But that only freed the memory of the processes in each container. In the end I got:
Code:
# no running processes
root@amor:~# pct list | grep running | wc -l
0
# and few available memory
root@amor:~# free -m
total used free shared buff/cache available
Mem: 96568 78852 16076 75 1639 16107
Swap: 8191 1065 7126
root@amor:~# arcstat
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
20:42:15 38 0 0 0 0 0 0 0 0 4.0G 4.0G
root@amor:~# cat /proc/meminfo
MemTotal: 98885928 kB
MemFree: 16451828 kB
MemAvailable: 16485748 kB
Buffers: 0 kB
Cached: 1042656 kB
SwapCached: 124856 kB
Active: 947648 kB
Inactive: 431952 kB
Active(anon): 302944 kB
Inactive(anon): 123776 kB
Active(file): 644704 kB
Inactive(file): 308176 kB
Unevictable: 159244 kB
Mlocked: 159244 kB
SwapTotal: 8388604 kB
SwapFree: 7297552 kB
Dirty: 248 kB
Writeback: 0 kB
AnonPages: 410996 kB
Mapped: 100072 kB
Shmem: 77512 kB
KReclaimable: 638892 kB
Slab: 12985512 kB
SReclaimable: 638892 kB
SUnreclaim: 12346620 kB
KernelStack: 13152 kB
PageTables: 8508 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 57831568 kB
Committed_AS: 4248996 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 1639340 kB
VmallocChunk: 0 kB
Percpu: 62615808 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 85710428 kB
DirectMap2M: 14872576 kB
DirectMap1G: 2097152 kB
slab sizes also didn't change significantly.
The number that got my attention was the Percpu-row. That is roughly the "missing" memory, but if that is the problem I'm not sure how to proceed and get detailed usage information for that memory. I just don't find useful information about it.
To get the machine usable again, I rebooted it and the memory is available, but we would like a better solution:
Code:
root@amor:~# arcstat
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
21:19:13 0 0 0 0 0 0 0 0 0 7.1G 47G
root@amor:~# cat /proc/meminfo
MemTotal: 98885928 kB
MemFree: 94535824 kB
MemAvailable: 94246216 kB
Buffers: 8356 kB
Cached: 391696 kB
SwapCached: 0 kB
Active: 1184412 kB
Inactive: 245936 kB
Active(anon): 1064432 kB
Inactive(anon): 101452 kB
Active(file): 119980 kB
Inactive(file): 144484 kB
Unevictable: 159248 kB
Mlocked: 159248 kB
SwapTotal: 8388604 kB
SwapFree: 8388604 kB
Dirty: 676 kB
Writeback: 136 kB
AnonPages: 1189644 kB
Mapped: 247864 kB
Shmem: 123324 kB
KReclaimable: 125632 kB
Slab: 723044 kB
SReclaimable: 125632 kB
SUnreclaim: 597412 kB
KernelStack: 15360 kB
PageTables: 29104 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 57831568 kB
Committed_AS: 5086664 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 1391068 kB
VmallocChunk: 0 kB
Percpu: 138240 kB
HardwareCorrupted: 0 kB
AnonHugePages: 6144 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 530012 kB
DirectMap2M: 7778304 kB
DirectMap1G: 94371840 kB
I later updated to newest packages (Proxmox 6.1-8), but still the same kernel and also limited the maximum arcsize for ZFS, but only because I don't think the 47GB are necessary there.
We are now running the following versions
Code:
root@amor:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-4.15: 5.4-6
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.4.117-2-pve: 4.4.117-110
pve-kernel-4.4.117-1-pve: 4.4.117-109
pve-kernel-4.4.98-5-pve: 4.4.98-105
pve-kernel-4.4.98-3-pve: 4.4.98-103
pve-kernel-4.4.62-1-pve: 4.4.62-88
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
And the problem seems to persist:
Code:
root@amor:~# date && cat /proc/meminfo
Fri 27 Mar 2020 01:56:54 PM CET
MemTotal: 98885928 kB
MemFree: 10125640 kB
MemAvailable: 57181032 kB
...
Percpu: 4957056 kB
...
I'm looking forward to any help or suggestions !
Thanks,
Stephan