Proxmox reports: 88% memory allocation, but no VM / CT runs - Is this a memory leak caused by Ceph?

cmonty14 · Jun 24, 2019

Hi,
in Proxmox WebUI the monitor reports 88% memory allocation (see screenshot).

This data is confirmed by command free:
root@ld5505:~# free -h
total used free shared buff/cache available
Mem: 251G 221G 1,2G 125M 29G 130G
Swap: 29G 3,9G 25G

However, on this host only 1 VM is running with 130GB RAM:
root@ld5505:~# qm list
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
102 vm102-sles12sp4 running 131072 20.00 128351

Comparing the RAM allocation with other tools show a different RAM usage: 130GB
Please check the second attachment displaying the output of glances.

This raises the following questions:
If the memory allocation displayed in Proxmox WebUI is correct, what is causing the memory leak?
If there's no memory leak as indicated by glances, why is Proxmox WebUI displaying an incorrect memory usage?

THX

sb-jw · Jun 24, 2019

All Services are running on these Node need RAM, take a look at CEPH OSDs, Mons and MDS - they normally take a bit more RAM if it is not configured.

cmonty14 · Jun 24, 2019

There's no doubt about other services / processes running on that node allocating memory.
But here we talk about an additional memory allocation of 100GB.

And there's a difference of used memory when using other monitoring tools, e.g. glances or netdata.

As a matter of fact the available memory is increasing in steps; this is documented with netdata (check the new attachment).

Tapio Lehtonen · Jun 24, 2019

https://www.linuxatemyram.com/
https://www.cloudways.com/blog/linux-ate-my-ram-memory-myth-busted/

Same thing applies to Proxmox.

cmonty14 · Jun 24, 2019

Thanks for this excurse in the theory of memory allocation on Linux systems.
But I have the impression that you simply ignore the facts documented by the monitoring tools glances and netdata.
One could argue that a monitor is inaccurate. But two different tools reporting the same metrics that are different to Proxmox WebUI?

netdata has earned many appraisals and if you rely on this tool the following question must be asked:
Is Proxmox WebUI monitoring capabilities accurate?

In my opinion the answer is: no!

And this raises another concern:
How will Proxmox judge where to start an VM / LXC instance if the measurement of resource allocation is incorrect?

THX

Vladimir Bulgaru · Jun 24, 2019

Are you using ZFS by any chance?

cmonty14 · Jun 24, 2019

Vladimir Bulgaru said:
Are you using ZFS by any chance?

Nope

cmonty14 · Jun 25, 2019

This issue is getting serious now because PVE reports 88% RAM allocation w/o any VM / CT running!
And the relevant node has 250GB RAM.

As this is not a single node issue I would conclude that something is wrong with PVE.

BobhWasatch · Jun 25, 2019

c.monty said:
This issue is getting serious now because PVE reports 88% RAM allocation w/o any VM / CT running!
And the relevant node has 250GB RAM.

As this is not a single node issue I would conclude that something is wrong with PVE.

In your original post "free" reports 130G "available", 221G "used", and only 1.2G "free". The glances output shows 131G free. Wow, danger! But look more closely. It would seem that glances is reporting as "free" what linux calls "available". Similarly, if you add up the "used", "buffers", and "cached" from glances it is 222.2G, which is pretty close to what linux is reporting as "used".

This makes sense because to Linux memory used as cache is trivially retrieved if it is needed (just take the page away for something else, it can be read from disk if needed again). Buffers are slightly more work to grab back (data has to be written), but that memory is still usable for something else if needed.

What Linux calls "free" is really completely unused, which will tend toward zero over time. This is what the LinuxAteMyRam page is trying to explain.

I think what is going on here is a confusing of terminology and nothing more. Your favorite monitoring tools use different terms to describe things from what Linux & Proxmox use.

cmonty14 · Jun 25, 2019

All right. I agree that the terminology is not equal on the different tools.
Therefore let's stick with what is displayed by command free.

I take node ld5505 as the current most painful example, because this node runs with ~80% RAM allocation w/o any CT / VM running:
root@ld5505:~# free -m
total used free shared buff/cache available
Mem: 257554 207179 1125 40 49250 65364
Swap: 30517 1573 28944

This is not acceptable. The system must release the memory to be used by a VM.

Or another node, ld5508 running 1 VM (with 128GB RAM) but allocating 98% of available RAM:
root@ld5508:~# free -m
total used free shared buff/cache available
Mem: 257555 250406 3837 119 3310 219750
Swap: 30517 0 30517

I cannot start a VM on this node although 214GB are available!

Checking the memory allocation by users/processors show: effectively all memory is allocated by Ceph.

I can provide real-time monitoring data / statistics provided by Netdata.

RokaKen · Jun 26, 2019

c.monty said:
... I cannot start a VM on this node although 214GB are available! ...

Really? Have you tried? What errors does the attempt produce?

I have, for example:

Code:

# free -h
              total        used        free      shared  buff/cache   available
Mem:           283G        104G         53G        210M        125G        208G
Swap:          8.0G        268K        8.0G

# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID       
       101 xxxxxxxxxxx        running    8192              48.00 13577     
       102 xxxxxxxxxxx        running    24576             80.00 18292     
       112 xxxxxxxxxxx        running    4096              35.00 18249     
       115 xxxxxxxxxxx        running    6144              40.00 18661     
       116 xxxxxxxxxxx        running    2048              60.00 18456     
       120 xxxxxxxxxxx        running    8192              40.00 16358     
       127 xxxxxxxxxxx        running    4096              40.00 19079     
       128 xxxxxxxxxxx        running    4096              35.00 16354     
       136 xxxxxxxxxxx        running    4096               8.00 16594     
       139 xxxxxxxxxxx        running    2048               8.00 16219     
       140 xxxxxxxxxxx        running    4096              32.00 16884

cmonty14 · Jun 27, 2019

Hi,
I have updated the title because the issue is getting worse.
And this issue is not a single occurrence, but multiple nodes are affected.

Therefore I post the current statistics regarding memory allocation:
root@ld5505:# free -h
total used free shared buff/cache available
Mem: 251G 217G 17G 52M 16G 67G
Swap: 29G 4,0G 25G

root@ld5505:# cat /proc/meminfo
MemTotal: 263736284 kB
MemFree: 1988980 kB
MemAvailable: 70378100 kB
Buffers: 2800 kB
Cached: 31832092 kB
SwapCached: 159440 kB
Active: 188551272 kB
Inactive: 62465396 kB
Active(anon): 175571580 kB
Inactive(anon): 7177280 kB
Active(file): 12979692 kB
Inactive(file): 55288116 kB
Unevictable: 55788 kB
Mlocked: 55788 kB
SwapTotal: 31250428 kB
SwapFree: 27066380 kB
Dirty: 1272 kB
Writeback: 0 kB
AnonPages: 219187256 kB
Mapped: 118140 kB
Shmem: 53768 kB
Slab: 3332240 kB
SReclaimable: 1966036 kB
SUnreclaim: 1366204 kB
KernelStack: 85344 kB
PageTables: 504836 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 163118568 kB
Committed_AS: 276330568 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 1300584 kB
DirectMap2M: 102098944 kB
DirectMap1G: 166723584 kB

What you can see here are 2 things:
1. The amount of memory marked as free is 17GB. This memory is available for new VMs / CTs only, means I cannot start any VM / CT that demands more memory
2. ~60GB are marked as inactive.

I'm very sure that all memory is allocated by Ceph and is used for caching of the OSDs.
However this cache must either be limited or released after some time.
If not, then there's a memory leak.

THX

Bengt Nolin · Jun 27, 2019

Does the memory get released if you restart the ceph services?

OSD cache can be controlled by "osd_memory_target = x" (in bytes) which was introduced in Luminous 12.2.10 (https://ceph.com/releases/v12-2-10-luminous-released/) I noted an increased in RAM usage after updating to that version (can't recall when it was introduced in PVE).

cmonty14 · Jun 27, 2019

I didn't try to restart any Ceph services.
Which services do you suggest to restart?

Bengt Nolin · Jun 27, 2019

I suspect the OSD services consume the most RAM so restart one of them and check for differences in memory usage. How many OSD:s does the host have?

cmonty14 · Jun 27, 2019

I have ~60 OSDs per node.
My understanding is that restarting the service would trigger pg replacement.
Not really an option i.m.o.

Bengt Nolin · Jun 27, 2019

Restarting one single OSD service (ceph-osd@42 for example) will not cause you problem since it will be down such a short time, and if some PG:s are moved they will be rebalanced when the OSD comes up again. This is one of the major points of ceph - OSD failure is a common thing. But to avoid any response from the cluster set "noout" before to avoid recovery. But remember to unset it after performing your tests.

At 60 OSD per node your memory consumption is not strange at all but wholly expected in my opinion. Each OSD would use at least 2-3 GB, preferably more (4-5).

cmonty14 · Jun 27, 2019

In Ceph's documentation the hardware requirements are:
Process Criteria Minimum Recommended
ceph-osd Processor

1x 64-bit AMD-64
1x 32-bit ARM dual-core or better
1x i386 dual-core

RAM ~1GB for 1TB of storage per daemon
Volume Storage 1x storage drive per daemon
Journal 1x SSD partition per daemon (optional)
Network 2x 1GB Ethernet NICs

I have 60 disk with 2TB each, means the memory requirement is 120GB.

What I need are "real life figures" of RAM requirement to judge whether or not I can run VMs / CTs on top on this node.

Bengt Nolin · Jun 27, 2019

Minimum requirements does not mean maximum or typical RAM usage, it means the minimum you can get away with. If there is more available more will be used. But you can limit the per-OSD usage with the "osd_memory_target = x" parameter I posted earlier, but I don't think you should set it below 2GB. By default it is 4GB if I recall correctly, which would mean your 60 OSD:s will eventually consume 240GB.

For reference you can read for example Red Hat's papers on Ceph where I have never seen a recommended value below 2GB per OSD. Here's a recent one: https://access.redhat.com/documenta...ph_Storage_Hardware_Selection_Guide-en-US.pdf

Also, the larger your cluster is (more OSD) the more memory will be used by the Ceph MON as well, which I don't know if you are running on the same host.

alexskysilk · Jun 27, 2019

c.monty said:
This is not acceptable. The system must release the memory to be used by a VM.

c.monty said:
I have ~60 OSDs per node.

rofl. you're funny.

You want to run 60OSDs AND VMs and are complaining about running out of memory? Even if you didnt bother reading the ceph documentation this isnt a reasonable expectation simple based on the enormous amount of CPU required to do what you're doing. However, since you should have read the documentation (specifically http://docs.ceph.com/docs/luminous/...ight=osd memory target#automatic-cache-sizing) you must know that the default cache allocation per OSD is 4294967296 (hint- thats 4G). 60x4=240.

Proxmox reports: 88% memory allocation, but no VM / CT runs - Is this a memory leak caused by Ceph?

Well-Known Member

Attachments

Famous Member

Well-Known Member

Attachments

Active Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Famous Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Distinguished Member

We value your privacy