Linux 4.4.35-1-pve oom-killer kills Ceph-OSD Process

lrab

New Member
Jul 28, 2016
9
0
1
51
Since we updated to the new kernel (Linux 4.4.35-1-pve #1 SMP Thu Dec 22 14:58:39 CET 2016) and 4.4 environment, we occasionally see the oom-killer killing Ceph-OSD's:

ceph02 kernel: [171428.546446] Out of memory: Kill process 6712 (ceph-osd) score 36 or sacrifice child

The stack trace from kernel.log always shows the Infiniband driver to be involved in triggering the oom-killer. It worked great with the previous kernel version. Any ideas?

Code:
[/B]
[B]Jan  5 16:01:28 ceph02 kernel: [171428.546083] kworker/u48:2 invoked oom-killer: gfp_mask=0x24202c2, order=3, oom_score_adj=0
Jan  5 16:01:28 ceph02 kernel: [171428.546088] kworker/u48:2 cpuset=/ mems_allowed=0-1
Jan  5 16:01:28 ceph02 kernel: [171428.546094] CPU: 7 PID: 21193 Comm: kworker/u48:2 Tainted: P          IO    4.4.35-1-pve #1

...

Jan  5 16:01:28 ceph02 kernel: [171428.546103] Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib]
Jan  5 16:01:28 ceph02 kernel: [171428.546105]  0000000000000286 00000000e5898a76 ffff88032b3f3770 ffffffff813f9743
Jan  5 16:01:28 ceph02 kernel: [171428.546107]  ffff88032b3f3960 0000000000000000 ffff88032b3f37d8 ffffffff8120adcb
Jan  5 16:01:28 ceph02 kernel: [171428.546109]  ffff88043f6dae10 ffffea00108e08c0 0000000100000001 ffff88043f6d7300
Jan  5 16:01:28 ceph02 kernel: [171428.546112] Call Trace:
Jan  5 16:01:28 ceph02 kernel: [171428.546121]  [<ffffffff813f9743>] dump_stack+0x63/0x90
Jan  5 16:01:28 ceph02 kernel: [171428.546126]  [<ffffffff8120adcb>] dump_header+0x67/0x1d5
Jan  5 16:01:28 ceph02 kernel: [171428.546130]  [<ffffffff811925c5>] oom_kill_process+0x205/0x3c0
Jan  5 16:01:28 ceph02 kernel: [171428.546132]  [<ffffffff81192a17>] out_of_memory+0x237/0x4a0
Jan  5 16:01:28 ceph02 kernel: [171428.546136]  [<ffffffff81198d0e>] __alloc_pages_nodemask+0xcee/0xe20
Jan  5 16:01:28 ceph02 kernel: [171428.546148]  [<ffffffffc0051cbc>] mlx4_alloc_icm+0x32c/0x5f0 [mlx4_core]
Jan  5 16:01:28 ceph02 kernel: [171428.546156]  [<ffffffffc005207f>] mlx4_table_get+0x9f/0x110 [mlx4_core]
Jan  5 16:01:28 ceph02 kernel: [171428.546165]  [<ffffffffc0062df6>] __mlx4_qp_alloc_icm+0xf6/0x120 [mlx4_core]
Jan  5 16:01:28 ceph02 kernel: [171428.546173]  [<ffffffffc0062f71>] mlx4_qp_alloc+0x51/0x150 [mlx4_core]
Jan  5 16:01:28 ceph02 kernel: [171428.546182]  [<ffffffffc0062b3d>] ? __mlx4_qp_reserve_range+0x4d/0x80 [mlx4_core]
Jan  5 16:01:28 ceph02 kernel: [171428.546187]  [<ffffffffc04fba24>] create_qp_common.isra.27+0x644/0xef0 [mlx4_ib]
Jan  5 16:01:28 ceph02 kernel: [171428.546191]  [<ffffffff811ed459>] ? kmem_cache_alloc_trace+0x1e9/0x210
Jan  5 16:01:28 ceph02 kernel: [171428.546195]  [<ffffffffc04fc4ee>] mlx4_ib_create_qp+0x21e/0x2e0 [mlx4_ib]
Jan  5 16:01:28 ceph02 kernel: [171428.546203]  [<ffffffffc01aab02>] ib_create_qp+0x32/0x1c0 [ib_core]
Jan  5 16:01:28 ceph02 kernel: [171428.546206]  [<ffffffffc0525037>] ipoib_cm_tx_init+0xf7/0x370 [ib_ipoib]
Jan  5 16:01:28 ceph02 kernel: [171428.546209]  [<ffffffff811cf09c>] ? vunmap_page_range+0x20c/0x330
Jan  5 16:01:28 ceph02 kernel: [171428.546213]  [<ffffffffc05272b9>] ipoib_cm_tx_start+0x259/0x400 [ib_ipoib]
Jan  5 16:01:28 ceph02 kernel: [171428.546217]  [<ffffffff81085d4b>] ? __local_bh_enable_ip+0x8b/0x90
Jan  5 16:01:28 ceph02 kernel: [171428.546220]  [<ffffffff8109b148>] process_one_work+0x158/0x420
Jan  5 16:01:28 ceph02 kernel: [171428.546222]  [<ffffffff8109bc29>] worker_thread+0x69/0x480
Jan  5 16:01:28 ceph02 kernel: [171428.546224]  [<ffffffff8109bbc0>] ? rescuer_thread+0x330/0x330
Jan  5 16:01:28 ceph02 kernel: [171428.546227]  [<ffffffff810a126a>] kthread+0xea/0x100
Jan  5 16:01:28 ceph02 kernel: [171428.546228]  [<ffffffff810a1180>] ? kthread_park+0x60/0x60
Jan  5 16:01:28 ceph02 kernel: [171428.546232]  [<ffffffff8185c60f>] ret_from_fork+0x3f/0x70
Jan  5 16:01:28 ceph02 kernel: [171428.546234]  [<ffffffff810a1180>] ? kthread_park+0x60/0x60
Jan  5 16:01:28 ceph02 kernel: [171428.546235] Mem-Info:
Jan  5 16:01:28 ceph02 kernel: [171428.546241] active_anon:278696 inactive_anon:281339 isolated_anon:50
Jan  5 16:01:28 ceph02 kernel: [171428.546241]  active_file:1628643 inactive_file:1610034 isolated_file:50
Jan  5 16:01:28 ceph02 kernel: [171428.546241]  unevictable:880 dirty:43196 writeback:0 unstable:0
Jan  5 16:01:28 ceph02 kernel: [171428.546241]  slab_reclaimable:105675 slab_unreclaimable:49398
Jan  5 16:01:28 ceph02 kernel: [171428.546241]  mapped:23963 shmem:16203 pagetables:4195 bounce:0
Jan  5 16:01:28 ceph02 kernel: [171428.546241]  free:21632 free_pcp:65 free_cma:0

...

Jan  5 16:01:28 ceph02 kernel: [171428.546446] Out of memory: Kill process 6712 (ceph-osd) score 36 or sacrifice child
Jan  5 16:01:28 ceph02 kernel: [171428.546669] Killed process 6712 (ceph-osd) total-vm:1700052kB, anon-rss:822856kB, file-rss:14048k
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!