We have just updated to the latest version on Proxmox 4.4.5 when the problem start.
Our configuration is using a cluster of ceph with 6 servers, 3 of them are Intel Skylake CPU's. On those Skylake based servers we have this,
Jan 4 09:32:20 ceph07 kernel: [139775.594411] Purging GPU memory, 0 bytes freed, 131072 bytes still pinned.
Jan 4 09:32:20 ceph07 kernel: [139775.594413] 3219456 and 0 bytes still available in the bound and unbound GPU page lists.
Jan 4 09:32:20 ceph07 kernel: [139775.594623] ksmtuned invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Jan 4 09:32:20 ceph07 kernel: [139775.594623] ksmtuned cpuset=/ mems_allowed=0
Jan 4 09:32:20 ceph07 kernel: [139775.594627] CPU: 0 PID: 1345 Comm: ksmtuned Tainted: P O 4.4.35-1-pve #1
Jan 4 09:32:20 ceph07 kernel: [139775.594627] Hardware name: System manufacturer System Product Name/B150M-K, BIOS 1801 05/11/2016
Jan 4 09:32:20 ceph07 kernel: [139775.594628] 0000000000000286 0000000008f31c30 ffff880229223b50 ffffffff813f9743
Jan 4 09:32:20 ceph07 kernel: [139775.594630] ffff880229223d40 0000000000000000 ffff880229223bb8 ffffffff8120adcb
Jan 4 09:32:20 ceph07 kernel: [139775.594631] 0000000008f31c30 00000000ffffffff 0000000000000000 0000000000000000
.......
Jan 4 09:32:20 ceph07 kernel: [139775.594769] Out of memory: Kill process 133424 (ceph-osd) score 33 or sacrifice child
Jan 4 09:32:20 ceph07 kernel: [139775.594939] Killed process 133424 (ceph-osd) total-vm:1752220kB, anon-rss:517284kB, file-rss:18788kB
Jan 4 09:32:20 ceph07 bash[133421]: /bin/bash: line 1: 133424 Killed /usr/bin/ceph-osd -i 10 --pid-file /var/run/ceph/osd.10.pid -c /etc/pve/ceph.conf --cluster ceph -f
Jan 4 09:32:20 ceph07 systemd[1]: ceph-osd.10.1483511774.015513636.service: main process exited, code=exited, status=137/n/a
Jan 4 09:32:27 ceph07 pmxcfs[1439]: [status] notice: received log
Jan 4 09:32:58 ceph07 bash[133684]: 2017-01-04 09:32:58.174642 7f4d9cf0c700 -1 osd.11 30648 heartbeat_check: no reply from osd.10 since back 2017-01-04 09:32:17.968612 front 2017-01-04 09:32:17.968612 (cutoff 2017-01-04 09:32:18.174640)
Jan 4 09:32:59 ceph07 bash[133684]: 2017-01-04 09:32:59.174851 7f4d9cf0c700 -1 osd.11 30648 heartbeat_check: no reply from osd.10 since back 2017-01-04 09:32:17.968612 front 2017-01-04 09:32:17.968612 (cutoff 2017-01-04 09:32:19.174849)
Jan 4 09:32:59 ceph07 bash[133684]: 2017-01-04 09:32:59.274507 7f4d6ec11700 -1 osd.11 30648 heartbeat_check: no reply from osd.10 since back 2017-01-04 09:32:17.968612 front 2017-01-04 09:32:17.968612 (cutoff 2017-01-04 09:32:19.274506)
I know is not a Proxmox/Ceph problem, but I'll appreciate any suggestion.
Our configuration is using a cluster of ceph with 6 servers, 3 of them are Intel Skylake CPU's. On those Skylake based servers we have this,
Jan 4 09:32:20 ceph07 kernel: [139775.594411] Purging GPU memory, 0 bytes freed, 131072 bytes still pinned.
Jan 4 09:32:20 ceph07 kernel: [139775.594413] 3219456 and 0 bytes still available in the bound and unbound GPU page lists.
Jan 4 09:32:20 ceph07 kernel: [139775.594623] ksmtuned invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Jan 4 09:32:20 ceph07 kernel: [139775.594623] ksmtuned cpuset=/ mems_allowed=0
Jan 4 09:32:20 ceph07 kernel: [139775.594627] CPU: 0 PID: 1345 Comm: ksmtuned Tainted: P O 4.4.35-1-pve #1
Jan 4 09:32:20 ceph07 kernel: [139775.594627] Hardware name: System manufacturer System Product Name/B150M-K, BIOS 1801 05/11/2016
Jan 4 09:32:20 ceph07 kernel: [139775.594628] 0000000000000286 0000000008f31c30 ffff880229223b50 ffffffff813f9743
Jan 4 09:32:20 ceph07 kernel: [139775.594630] ffff880229223d40 0000000000000000 ffff880229223bb8 ffffffff8120adcb
Jan 4 09:32:20 ceph07 kernel: [139775.594631] 0000000008f31c30 00000000ffffffff 0000000000000000 0000000000000000
.......
Jan 4 09:32:20 ceph07 kernel: [139775.594769] Out of memory: Kill process 133424 (ceph-osd) score 33 or sacrifice child
Jan 4 09:32:20 ceph07 kernel: [139775.594939] Killed process 133424 (ceph-osd) total-vm:1752220kB, anon-rss:517284kB, file-rss:18788kB
Jan 4 09:32:20 ceph07 bash[133421]: /bin/bash: line 1: 133424 Killed /usr/bin/ceph-osd -i 10 --pid-file /var/run/ceph/osd.10.pid -c /etc/pve/ceph.conf --cluster ceph -f
Jan 4 09:32:20 ceph07 systemd[1]: ceph-osd.10.1483511774.015513636.service: main process exited, code=exited, status=137/n/a
Jan 4 09:32:27 ceph07 pmxcfs[1439]: [status] notice: received log
Jan 4 09:32:58 ceph07 bash[133684]: 2017-01-04 09:32:58.174642 7f4d9cf0c700 -1 osd.11 30648 heartbeat_check: no reply from osd.10 since back 2017-01-04 09:32:17.968612 front 2017-01-04 09:32:17.968612 (cutoff 2017-01-04 09:32:18.174640)
Jan 4 09:32:59 ceph07 bash[133684]: 2017-01-04 09:32:59.174851 7f4d9cf0c700 -1 osd.11 30648 heartbeat_check: no reply from osd.10 since back 2017-01-04 09:32:17.968612 front 2017-01-04 09:32:17.968612 (cutoff 2017-01-04 09:32:19.174849)
Jan 4 09:32:59 ceph07 bash[133684]: 2017-01-04 09:32:59.274507 7f4d6ec11700 -1 osd.11 30648 heartbeat_check: no reply from osd.10 since back 2017-01-04 09:32:17.968612 front 2017-01-04 09:32:17.968612 (cutoff 2017-01-04 09:32:19.274506)
I know is not a Proxmox/Ceph problem, but I'll appreciate any suggestion.