Intel Skylake video memory purge kill osd process

Dan Nicolae · Jan 4, 2017

We have just updated to the latest version on Proxmox 4.4.5 when the problem start.
Our configuration is using a cluster of ceph with 6 servers, 3 of them are Intel Skylake CPU's. On those Skylake based servers we have this,

Jan 4 09:32:20 ceph07 kernel: [139775.594411] Purging GPU memory, 0 bytes freed, 131072 bytes still pinned.
Jan 4 09:32:20 ceph07 kernel: [139775.594413] 3219456 and 0 bytes still available in the bound and unbound GPU page lists.
Jan 4 09:32:20 ceph07 kernel: [139775.594623] ksmtuned invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Jan 4 09:32:20 ceph07 kernel: [139775.594623] ksmtuned cpuset=/ mems_allowed=0
Jan 4 09:32:20 ceph07 kernel: [139775.594627] CPU: 0 PID: 1345 Comm: ksmtuned Tainted: P O 4.4.35-1-pve #1
Jan 4 09:32:20 ceph07 kernel: [139775.594627] Hardware name: System manufacturer System Product Name/B150M-K, BIOS 1801 05/11/2016
Jan 4 09:32:20 ceph07 kernel: [139775.594628] 0000000000000286 0000000008f31c30 ffff880229223b50 ffffffff813f9743
Jan 4 09:32:20 ceph07 kernel: [139775.594630] ffff880229223d40 0000000000000000 ffff880229223bb8 ffffffff8120adcb
Jan 4 09:32:20 ceph07 kernel: [139775.594631] 0000000008f31c30 00000000ffffffff 0000000000000000 0000000000000000

.......

Jan 4 09:32:20 ceph07 kernel: [139775.594769] Out of memory: Kill process 133424 (ceph-osd) score 33 or sacrifice child
Jan 4 09:32:20 ceph07 kernel: [139775.594939] Killed process 133424 (ceph-osd) total-vm:1752220kB, anon-rss:517284kB, file-rss:18788kB
Jan 4 09:32:20 ceph07 bash[133421]: /bin/bash: line 1: 133424 Killed /usr/bin/ceph-osd -i 10 --pid-file /var/run/ceph/osd.10.pid -c /etc/pve/ceph.conf --cluster ceph -f
Jan 4 09:32:20 ceph07 systemd[1]: ceph-osd.10.1483511774.015513636.service: main process exited, code=exited, status=137/n/a
Jan 4 09:32:27 ceph07 pmxcfs[1439]: [status] notice: received log
Jan 4 09:32:58 ceph07 bash[133684]: 2017-01-04 09:32:58.174642 7f4d9cf0c700 -1 osd.11 30648 heartbeat_check: no reply from osd.10 since back 2017-01-04 09:32:17.968612 front 2017-01-04 09:32:17.968612 (cutoff 2017-01-04 09:32:18.174640)
Jan 4 09:32:59 ceph07 bash[133684]: 2017-01-04 09:32:59.174851 7f4d9cf0c700 -1 osd.11 30648 heartbeat_check: no reply from osd.10 since back 2017-01-04 09:32:17.968612 front 2017-01-04 09:32:17.968612 (cutoff 2017-01-04 09:32:19.174849)
Jan 4 09:32:59 ceph07 bash[133684]: 2017-01-04 09:32:59.274507 7f4d6ec11700 -1 osd.11 30648 heartbeat_check: no reply from osd.10 since back 2017-01-04 09:32:17.968612 front 2017-01-04 09:32:17.968612 (cutoff 2017-01-04 09:32:19.274506)

I know is not a Proxmox/Ceph problem, but I'll appreciate any suggestion.

Dan Nicolae · Jan 4, 2017

Could it be possible that 8GB of RAM is not enough for a node with 2 OSD drives (HDD)? In summary area (Proxmox dashboard) it says that 1.74GB of 7.68 GB are in use.

Dan Nicolae · Jan 5, 2017

Afeter some hours of hell, I came to a conclusion that could help someone that is in the same situation.

Our Ceph cluster had 6 nodes, each node 2 OSD (HDD 2TB). Four of them has 16GB of RAM, two of them only 8GB of RAM each.
There are no virtual machines running on the Ceph nodes.

According to the errror above, at first we thought that problem was with the Skylake CPU (GPU) that was causing a memory leak. But the problem appeared only on the 8GB nodes, not on the Skylake CPU based nodes (there was a Skylake with 16GB that worked fine).

We installed another module of 8GB on the nodes that had only 8GB (so that all ceph nodes have 16GB) and guess what. Problem solved.

Our conclusion,

According to the Ceph guidelines, for an OSD node is required 1GB of RAM for each 1TB of OSD. Our nodes have 2 OSDs of 2TB each so, by the guidelines, it is necessary 4TB of RAM for OSDs. With the previous version of Proxmox/Ceph everything was looking fine with 8GB of RAM in this configuration. After update/upgrade 8GB of RAM is not sufficient to run the services stable and safe.

Weird is that in the Proxmox dashboard, in summary area it displayed no more than 2GB of RAM in use...

Hope it helps.

Dan Nicolae · Jan 6, 2017

Unfortunately the problem persist. Not that often as before adding ram, but from time to time it appears.

Dan Nicolae · Jan 8, 2017

I guess is a bug. The ceph-osd consume less than 500MB of RAM. There are 2 OSD's and 16GB of memory. It should be sufficient.

root@ceph05:~# ceph tell osd.6 heap stats
osd.6 tcmalloc heap stats:------------------------------------------------
MALLOC: 282135536 ( 269.1 MiB) Bytes in use by application
MALLOC: + 2416640 ( 2.3 MiB) Bytes in page heap freelist
MALLOC: + 58514832 ( 55.8 MiB) Bytes in central cache freelist
MALLOC: + 315376 ( 0.3 MiB) Bytes in transfer cache freelist
MALLOC: + 41797264 ( 39.9 MiB) Bytes in thread cache freelists
MALLOC: + 3240096 ( 3.1 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 388419744 ( 370.4 MiB) Actual memory used (physical + swap)
MALLOC: + 160595968 ( 153.2 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 549015712 ( 523.6 MiB) Virtual address space used
MALLOC:
MALLOC: 21061 Spans in use
MALLOC: 232 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Dan Nicolae · Jan 9, 2017

Is anyone alive on this forum? Does Proxmox has a living community?

fabian · Jan 10, 2017

see https://forum.proxmox.com/threads/p...cess-8543-kvm-score-or-sacrifice-child.31569/

Dan Nicolae · Jan 10, 2017

Hello, Fabian. Thanks for the answer. Today I found that topic and updated the kernel. I hope that it will be OK.

Dan Nicolae · Jan 10, 2017

Our cluster use as storage Ceph, this bug caused a lot of partition corruption, some of them impossible to recover and the result was data loss. A lot of pain...

Search

Search

Intel Skylake video memory purge kill osd process

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member

fabian

Proxmox Staff Member

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member