Since we updated to the new kernel (Linux 4.4.35-1-pve #1 SMP Thu Dec 22 14:58:39 CET 2016) and 4.4 environment, we occasionally see the oom-killer killing Ceph-OSD's:
ceph02 kernel: [171428.546446] Out of memory: Kill process 6712 (ceph-osd) score 36 or sacrifice child
The stack trace from kernel.log always shows the Infiniband driver to be involved in triggering the oom-killer. It worked great with the previous kernel version. Any ideas?
ceph02 kernel: [171428.546446] Out of memory: Kill process 6712 (ceph-osd) score 36 or sacrifice child
The stack trace from kernel.log always shows the Infiniband driver to be involved in triggering the oom-killer. It worked great with the previous kernel version. Any ideas?
Code:
[/B]
[B]Jan 5 16:01:28 ceph02 kernel: [171428.546083] kworker/u48:2 invoked oom-killer: gfp_mask=0x24202c2, order=3, oom_score_adj=0
Jan 5 16:01:28 ceph02 kernel: [171428.546088] kworker/u48:2 cpuset=/ mems_allowed=0-1
Jan 5 16:01:28 ceph02 kernel: [171428.546094] CPU: 7 PID: 21193 Comm: kworker/u48:2 Tainted: P IO 4.4.35-1-pve #1
...
Jan 5 16:01:28 ceph02 kernel: [171428.546103] Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib]
Jan 5 16:01:28 ceph02 kernel: [171428.546105] 0000000000000286 00000000e5898a76 ffff88032b3f3770 ffffffff813f9743
Jan 5 16:01:28 ceph02 kernel: [171428.546107] ffff88032b3f3960 0000000000000000 ffff88032b3f37d8 ffffffff8120adcb
Jan 5 16:01:28 ceph02 kernel: [171428.546109] ffff88043f6dae10 ffffea00108e08c0 0000000100000001 ffff88043f6d7300
Jan 5 16:01:28 ceph02 kernel: [171428.546112] Call Trace:
Jan 5 16:01:28 ceph02 kernel: [171428.546121] [<ffffffff813f9743>] dump_stack+0x63/0x90
Jan 5 16:01:28 ceph02 kernel: [171428.546126] [<ffffffff8120adcb>] dump_header+0x67/0x1d5
Jan 5 16:01:28 ceph02 kernel: [171428.546130] [<ffffffff811925c5>] oom_kill_process+0x205/0x3c0
Jan 5 16:01:28 ceph02 kernel: [171428.546132] [<ffffffff81192a17>] out_of_memory+0x237/0x4a0
Jan 5 16:01:28 ceph02 kernel: [171428.546136] [<ffffffff81198d0e>] __alloc_pages_nodemask+0xcee/0xe20
Jan 5 16:01:28 ceph02 kernel: [171428.546148] [<ffffffffc0051cbc>] mlx4_alloc_icm+0x32c/0x5f0 [mlx4_core]
Jan 5 16:01:28 ceph02 kernel: [171428.546156] [<ffffffffc005207f>] mlx4_table_get+0x9f/0x110 [mlx4_core]
Jan 5 16:01:28 ceph02 kernel: [171428.546165] [<ffffffffc0062df6>] __mlx4_qp_alloc_icm+0xf6/0x120 [mlx4_core]
Jan 5 16:01:28 ceph02 kernel: [171428.546173] [<ffffffffc0062f71>] mlx4_qp_alloc+0x51/0x150 [mlx4_core]
Jan 5 16:01:28 ceph02 kernel: [171428.546182] [<ffffffffc0062b3d>] ? __mlx4_qp_reserve_range+0x4d/0x80 [mlx4_core]
Jan 5 16:01:28 ceph02 kernel: [171428.546187] [<ffffffffc04fba24>] create_qp_common.isra.27+0x644/0xef0 [mlx4_ib]
Jan 5 16:01:28 ceph02 kernel: [171428.546191] [<ffffffff811ed459>] ? kmem_cache_alloc_trace+0x1e9/0x210
Jan 5 16:01:28 ceph02 kernel: [171428.546195] [<ffffffffc04fc4ee>] mlx4_ib_create_qp+0x21e/0x2e0 [mlx4_ib]
Jan 5 16:01:28 ceph02 kernel: [171428.546203] [<ffffffffc01aab02>] ib_create_qp+0x32/0x1c0 [ib_core]
Jan 5 16:01:28 ceph02 kernel: [171428.546206] [<ffffffffc0525037>] ipoib_cm_tx_init+0xf7/0x370 [ib_ipoib]
Jan 5 16:01:28 ceph02 kernel: [171428.546209] [<ffffffff811cf09c>] ? vunmap_page_range+0x20c/0x330
Jan 5 16:01:28 ceph02 kernel: [171428.546213] [<ffffffffc05272b9>] ipoib_cm_tx_start+0x259/0x400 [ib_ipoib]
Jan 5 16:01:28 ceph02 kernel: [171428.546217] [<ffffffff81085d4b>] ? __local_bh_enable_ip+0x8b/0x90
Jan 5 16:01:28 ceph02 kernel: [171428.546220] [<ffffffff8109b148>] process_one_work+0x158/0x420
Jan 5 16:01:28 ceph02 kernel: [171428.546222] [<ffffffff8109bc29>] worker_thread+0x69/0x480
Jan 5 16:01:28 ceph02 kernel: [171428.546224] [<ffffffff8109bbc0>] ? rescuer_thread+0x330/0x330
Jan 5 16:01:28 ceph02 kernel: [171428.546227] [<ffffffff810a126a>] kthread+0xea/0x100
Jan 5 16:01:28 ceph02 kernel: [171428.546228] [<ffffffff810a1180>] ? kthread_park+0x60/0x60
Jan 5 16:01:28 ceph02 kernel: [171428.546232] [<ffffffff8185c60f>] ret_from_fork+0x3f/0x70
Jan 5 16:01:28 ceph02 kernel: [171428.546234] [<ffffffff810a1180>] ? kthread_park+0x60/0x60
Jan 5 16:01:28 ceph02 kernel: [171428.546235] Mem-Info:
Jan 5 16:01:28 ceph02 kernel: [171428.546241] active_anon:278696 inactive_anon:281339 isolated_anon:50
Jan 5 16:01:28 ceph02 kernel: [171428.546241] active_file:1628643 inactive_file:1610034 isolated_file:50
Jan 5 16:01:28 ceph02 kernel: [171428.546241] unevictable:880 dirty:43196 writeback:0 unstable:0
Jan 5 16:01:28 ceph02 kernel: [171428.546241] slab_reclaimable:105675 slab_unreclaimable:49398
Jan 5 16:01:28 ceph02 kernel: [171428.546241] mapped:23963 shmem:16203 pagetables:4195 bounce:0
Jan 5 16:01:28 ceph02 kernel: [171428.546241] free:21632 free_pcp:65 free_cma:0
...
Jan 5 16:01:28 ceph02 kernel: [171428.546446] Out of memory: Kill process 6712 (ceph-osd) score 36 or sacrifice child
Jan 5 16:01:28 ceph02 kernel: [171428.546669] Killed process 6712 (ceph-osd) total-vm:1700052kB, anon-rss:822856kB, file-rss:14048k