PVE 5.1 - Memory leak

Jun 8, 2016
344
74
93
48
Johannesburg, South Africa
We upgraded our cluster to PVE 5.1 on the 16th of November and are experiencing what appear to be memory leaks. The host has 192 GB of RAM but only runs virtual routers so tons of unused RAM. Host is hyper-converged and was previously running Ceph FileStore OSDs, which were migrated to BlueStore OSDs on the 19th of November.

ie: Memory leaks do not relate to FileStore or BlueStore OSDs but to PVE 5.1 or Ceph Luminous upgrades.

Each host only has 4 hdd OSDs so maximum memory utilisation should be around 1 GB per hdd OSDs or 3 GB per ssd OSD.
kvm5a_memory.jpg

Herewith almost 6 months history:
kvm5a_memory_6months.jpg


Herewith the output of 'top':
Code:
top - 14:45:20 up 5 days, 18:27,  1 user,  load average: 5.23, 5.23, 5.34
Tasks: 604 total,   2 running, 601 sleeping,   0 stopped,   1 zombie
%Cpu(s):  9.8 us,  6.3 sy,  0.0 ni, 80.0 id,  3.5 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem : 19799395+total, 10125188+free, 95258368 used,  1483688 buff/cache
KiB Swap: 26830438+total, 26830438+free,        0 used. 12658549+avail Mem

And 'htop' with sorting by memory utilisation:
Code:
  1  [||||||                    15.3%]   6  [|||||||                   17.4%]   11 [||                         3.3%]   16 [||                         2.1%]
  2  [|||||                     11.8%]   7  [|||||||||||||||||         49.4%]   12 [||                         1.7%]   17 [||                         1.6%]
  3  [||||                      10.7%]   8  [|||||||||||||||           44.2%]   13 [||                         2.1%]   18 [||                         2.5%]
  4  [|||||                     10.0%]   9  [|||||||||||||||           46.0%]   14 [||                         2.1%]   19 [||                         2.9%]
  5  [||||||||||||              34.7%]   10 [||||||||||||||||||||||||||80.4%]   15 [||                         2.9%]   20 [||                         2.5%]
  Mem[||||||||||||||||||||||||||||||||||||                        91.0G/189G]   Tasks: 75, 378 thr; 6 running
  Swp[                                                               0K/256G]   Load average: 5.40 5.24 5.33
                                                                                Uptime: 5 days, 18:28:16

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
30326 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  9:31.79 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30327 ceph       20   0 7065M 6282M 28936 S  0.8  3.2 59:55.45 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30328 ceph       20   0 7065M 6282M 28936 S  0.8  3.2 58:13.03 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30329 ceph       20   0 7065M 6282M 28936 S  0.8  3.2 38:58.86 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30331 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:01.11 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30332 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30394 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:01.49 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30395 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:05.99 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30396 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:15.79 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30397 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30398 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:04.58 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30399 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  8:09.85 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30400 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  4:30.37 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30401 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:14.82 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30421 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30422 ceph       20   0 7065M 6282M 28936 S  0.4  3.2 21:29.48 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30423 ceph       20   0 7065M 6282M 28936 S  1.2  3.2  1h10:23 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30424 ceph       20   0 7065M 6282M 28936 S  0.4  3.2 22:27.70 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30425 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  2:41.31 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30426 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:01.18 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30427 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30428 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:16.50 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30429 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30430 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30431 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30432 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30433 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30434 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30435 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30436 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30437 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30438 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30439 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30440 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:03.26 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30441 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30442 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:37.06 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30443 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:09.19 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30444 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:09.49 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30445 ceph       20   0 7065M 6282M 28936 S  0.4  3.2 10:49.98 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30446 ceph       20   0 7065M 6282M 28936 S  0.4  3.2 18:18.19 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30447 ceph       20   0 7065M 6282M 28936 S  0.4  3.2  5:44.48 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30448 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  8:29.75 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30449 ceph       20   0 7065M 6282M 28936 S  0.4  3.2 23:07.27 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30450 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  4:55.51 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30451 ceph       20   0 7065M 6282M 28936 S  0.4  3.2  7:31.26 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30452 ceph       20   0 7065M 6282M 28936 S  0.0  3.2 11:50.82 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30453 ceph       20   0 7065M 6282M 28936 S  0.0  3.2 10:49.41 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30454 ceph       20   0 7065M 6282M 28936 S  0.0  3.2 18:20.21 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30455 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  5:45.58 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30456 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  8:30.74 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30457 ceph       20   0 7065M 6282M 28936 S  0.4  3.2 23:07.64 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30458 ceph       20   0 7065M 6282M 28936 S  0.4  3.2  4:53.78 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30459 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  7:32.11 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30460 ceph       20   0 7065M 6282M 28936 S  0.4  3.2 11:50.66 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30461 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:04.07 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30462 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:03.63 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30463 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:36.13 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30464 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30465 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30466 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.18 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30467 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30468 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30469 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
30470 ceph       20   0 7065M 6282M 28936 S  0.0  3.2  0:00.04 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
 
Last edited:
I wonder if it's related to the following thread?

https://forum.proxmox.com/threads/zfs-memory-issues-leak.37878/#post-188934

It mentions ZFS on the thread but later on you'll see it talks about it not being ZFS but probably the Virtio network card driver. Long shot but might be related to this. Might be worth trying the 4.10 kernel as 4.13 is very problematic at the moment.

I run a virtual router and on the 4.13 kernel my RAM filled up. I'm back on the 4.10 kernel because Windows also BSOD on the 4.13 kernel and so far all is good.
 
  • Like
Reactions: chrone
I wonder if it's related to the following thread?

https://forum.proxmox.com/threads/zfs-memory-issues-leak.37878/#post-188934

It mentions ZFS on the thread but later on you'll see it talks about it not being ZFS but probably the Virtio network card driver. Long shot but might be related to this. Might be worth trying the 4.10 kernel as 4.13 is very problematic at the moment.

I run a virtual router and on the 4.13 kernel my RAM filled up. I'm back on the 4.10 kernel because Windows also BSOD on the 4.13 kernel and so far all is good.

we could not reproduce the memory leak you reported (so far). but there is an upstream discussion about a vhost performance regression in 4.13 which also has an associated leak which probably deserves a closer look.

regarding the blue screen issue, we identified a likely culprit and there is a 4.13.8-2-pve kernel available on pvetest with a revert.
 
  • Like
Reactions: chrone
I would consider it very likely that the memory leaks are due to VirtIO network throughput. The virtual routers are Linux based and use VirtIO. There is virtually no disc I/O... Memory utilisation is directly related to network throughput:
kvm5a_memory_and_network.jpg


Fabian, I'll install 4.13.8-2-pve on one of the nodes where we experienced this problem and report back whether or not it causes the VM to crash...
 
  • Like
Reactions: chrone
I took a quick glance at the vhost patches upstream, and feedback looks very promising. I'll spin up a test kernel tomorrow with two candidates applied and upload it somewhere for further testing.
 
  • Like
Reactions: chrone
Many thanks, we jumped ahead of ourselves a little with bcache now supporting partitions (as of 4.13) and setup BlueStore OSDs on hdds using ssd caching...

ie: Not possible for us to revert to 4.10.

Teaser: FileStore OSDs with SSD journaling will generally perform better than BlueStore OSDs with RocksDB and RocksDB WAL (write ahead log) on SSDs. BlueStore OSDs avoid the double write but there is no file system caching (slower) and no SSD journal to absorb random writes so I would expect many people to experience an overall slower cluster. I have a post I'm finalising which details how to setup BlueStore OSDs using SSDs for RocksDB, RocksDB WAL and SSD caching to get a cluster to run faster than FileStore OSDs with SSD journaling.
 
Thanks Fabian, appreciate everyone at Proxmox's work on sorting this out.

I've a host I can try a new 4.13 kernel on, the main one will need to stay on 4.10 until it's sure it's fixed as I need that stuff to stay running.
 
Teaser: FileStore OSDs with SSD journaling will generally perform better than BlueStore OSDs with RocksDB and RocksDB WAL (write ahead log) on SSDs. BlueStore OSDs avoid the double write but there is no file system caching (slower) and no SSD journal to absorb random writes so I would expect many people to experience an overall slower cluster. I have a post I'm finalising which details how to setup BlueStore OSDs using SSDs for RocksDB, RocksDB WAL and SSD caching to get a cluster to run faster than FileStore OSDs with SSD journaling.
Hi,
off topic here, but you know that Wal and DB device on SSD makes no sense? http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
in short - if enough fast SSD-storage is available, DB on SSD is enough.

Anyway, i'm interested on your howto.

Udo
 
  • Like
Reactions: chrone
Hi Udo,

My interpretation of the information available on BlueStore is that RocksDB's WAL will co-exist on the faster device when designating a SSD partition for RocksDB, when creating a hdd OSD. If one has NVMe, SSD and a large slow HDD, one could place data on the HDD, RocksDB on the SSD and RocksDB's WAL on the NVMe.

The pain point I discovered coming from FileStore hdd OSDs with journals on SSD partitions (2:1 ratio) is that BlueStore is slower... Allot slower... This is due to us running a number of virtual Check Point vSEC security gateways (kernel 2.6.18). Kernels earlier than 2.6.32 do not send flush commands so RBD continues to operate in writethrough mode and never switches to writeback. This results in constant tiny writes by these VMs needing to be committed to stable storage.

My initial understanding was that BlueStore OSDs would be faster, as data would only be written once. Whilst this is true, these tiny writes (about 1600 IOPS, totally a measly 8MB/s) constantly interrupt sequential writes on the HDDs and subsequently cause the hosts and virtuals to spend a lot of their time waiting on storage I/O requests.
 
One can view the individual block storage device activity by running 'apt-get install sysstat' and then watching the output of the following command:
Code:
iostat -xd sda sdb sdc sdd sde sdf 2 120

PS: Ignore the first print out, the following ones show the amount of activity since the previous information dump...

ie: Show stats in 2 second intervals 120 times. rkB/s and wkB/s are data volumes read and written to the various block devices.

PS: You can additionally simply run 'iostat -xd 2 120' to find individual VMs that are generating a lot of storage reads and/or writes. This assumes you use the kernel RBD module (aka 'KRBD' or 'krbd 1' in /etc/pve/storage.cfg).
 
This thread is somewhat convoluted and covers 3 things:

The memory issue appears to have been identified upstream and Fabian will hopefully provide a kernel with applied patches today...
 
just a short update: the memory leak patch was rejected upstream in favor of an upcoming alternative patch - let's hope it does not take too long!
 
  • Like
Reactions: chrone

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!