Ceph kernel 4.4.8 bug

homozavrus

Renowned Member
Oct 22, 2010
19
1
68
Moscow,Russia
Hi guys.
Today we watched strange bug with kernel, after that one of OSD went down and it's process hangs
Stack trace:
Jul 07 09:28:37 node12 kernel: divide error: 0000 [#1] SMP

Jul 07 09:28:37 node12 kernel: Modules linked in: veth rbd libceph nfsv3 rpcsec_gss_krb5 nfsv4 binfmt_misc ip_set ip6table_filter ip6_tables iptable_filter ip_tables x_tables softdog nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp mrp bonding nfnetlink_log nfnetlink xfs libcrc32c ipmi_ssif intel_rapl x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul cryptd ast ttm drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect snd_pcm sysimgblt snd_timer snd soundcore sb_edac pcspkr mei_me edac_core joydev input_leds i2c_i801 lpc_ich mei shpchp ioatdma wmi ipmi_si 8250_fintek ipmi_msghandler mac_hid acpi_power_meter vhost_net vhost macvtap macvlan autofs4

Jul 07 09:28:37 node12 kernel: zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) hid_generic usbkbd usbmouse usbhid hid ixgbe(O) dca vxlan ahci mpt3sas ip6_udp_tunnel libahci udp_tunnel raid_class ptp scsi_transport_sas pps_core fjes

Jul 07 09:28:37 node12 kernel: CPU: 10 PID: 31581 Comm: ceph-osd Tainted: P O 4.4.8-1-pve #1

Jul 07 09:28:37 node12 kernel: Hardware name: Supermicro SYS-2028U-TNRT+/X10DRU-i+, BIOS 1.1 07/22/2015

Jul 07 09:28:37 node12 kernel: task: ffff881fee535280 ti: ffff881ef5268000 task.ti: ffff881ef5268000

Jul 07 09:28:37 node12 kernel: RIP: 0010:[<ffffffff810b598c>] [<ffffffff810b598c>] task_numa_find_cpu+0x2cc/0x710

Jul 07 09:28:37 node12 kernel: RSP: 0000:ffff881ef526bbd8 EFLAGS: 00010257

Jul 07 09:28:37 node12 kernel: RAX: 0000000000000000 RBX: ffff881ef526bc80 RCX: 000000000000000e

Jul 07 09:28:37 node12 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff883f8bdfb600

Jul 07 09:28:37 node12 kernel: RBP: ffff881ef526bc48 R08: 0000000000000009 R09: 0000000000000271

Jul 07 09:28:37 node12 kernel: R10: 0000000000000176 R11: 00000000000001b4 R12: ffff881f3fcba940

Jul 07 09:28:37 node12 kernel: R13: ffff883f8bdfb600 R14: 00000000000001b4 R15: 0000000000000011

Jul 07 09:28:37 node12 kernel: FS: 00007f945ec24700(0000) GS:ffff881fffa80000(0000) knlGS:0000000000000000

Jul 07 09:28:37 node12 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Jul 07 09:28:37 node12 kernel: CR2: 00000000174212c0 CR3: 0000003f72832000 CR4: 00000000001426e0

Jul 07 09:28:37 node12 kernel: Stack:

Jul 07 09:28:37 node12 kernel: ffff881ef526bbe8 ffffffff81050b9d 0000000000000009 0000000000017180

Jul 07 09:28:37 node12 kernel: 000000000000000e 0000000000017180 00000000000000b9 fffffffffffffe3f

Jul 07 09:28:37 node12 kernel: 0000000000000009 ffff881fee535280 ffff881ef526bc80 0000000000000197

Jul 07 09:28:37 node12 kernel: Call Trace:

Jul 07 09:28:37 node12 kernel: [<ffffffff81050b9d>] ? native_smp_send_reschedule+0x4d/0x70

Jul 07 09:28:37 node12 kernel: [<ffffffff810b62b6>] task_numa_migrate+0x4e6/0xa00

Jul 07 09:28:37 node12 kernel: [<ffffffff810b6849>] numa_migrate_preferred+0x79/0x80

Jul 07 09:28:37 node12 kernel: [<ffffffff810bb348>] task_numa_fault+0x848/0xd10

Jul 07 09:28:37 node12 kernel: [<ffffffff810ba969>] ? should_numa_migrate_memory+0x59/0x130

Jul 07 09:28:37 node12 kernel: [<ffffffff811c03d4>] handle_mm_fault+0xc64/0x1a20

Jul 07 09:28:37 node12 kernel: [<ffffffff8170c3d4>] ? SYSC_recvfrom+0x144/0x160

Jul 07 09:28:37 node12 kernel: [<ffffffff818441aa>] ? __schedule+0x38a/0xa30

Jul 07 09:28:37 node12 kernel: [<ffffffff8106b4ed>] __do_page_fault+0x19d/0x410

Jul 07 09:28:37 node12 kernel: [<ffffffff8106b782>] do_page_fault+0x22/0x30

Jul 07 09:28:37 node12 kernel: [<ffffffff8184ab38>] page_fault+0x28/0x30

Jul 07 09:28:37 node12 kernel: Code: d0 4c 89 ef e8 26 d0 ff ff 49 8b 85 b0 00 00 00 49 8b 75 78 31 d2 49 0f af 84 24 d8 01 00 00 4c 8b 45 d0 48 8b 4d b0 48 83 c6 01 <48> f7 f6 4c 89 c6 48 89 da 48 8d 3c 01 48 29 c6 e8 9f cd ff ff

Jul 07 09:28:37 node12 kernel: RIP [<ffffffff810b598c>] task_numa_find_cpu+0x2cc/0x710

Jul 07 09:28:37 node12 kernel: RSP <ffff881ef526bbd8>

Jul 07 09:28:37 node12 kernel: ---[ end trace 1b119ce7b8e959c7 ]---
############################

After some googling i find the active bug with ubuntu kernel 4.4-4.6
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1568729
And it seems to be related to ceph and fair scheduler
http://thread.gmane.org/gmane.comp.file-systems.ceph.user/30793/focus=30987

We have 4-node ceph proxmox cluster. All nodes the same version of pve.
# pvecm status
Quorum information
------------------
Date: Thu Jul 7 17:20:18 2016
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 8992
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.16.110.20 (local)
0x00000002 1 172.16.110.30
0x00000003 1 172.16.110.40
0x00000004 1 172.16.110.50
# ceph -s
cluster 505fe7b7-0a8c-4adf-85bb-d82adda98a5a
health HEALTH_OK
monmap e4: 4 mons at {0=172.16.140.20:6789/0,1=172.16.140.30:6789/0,2=172.16.140.40:6789/0,3=172.16.140.50:6789/0}
election epoch 258, quorum 0,1,2,3 0,1,2,3
osdmap e1082: 64 osds: 64 up, 64 in
pgmap v3016545: 2112 pgs, 3 pools, 994 GB data, 252 kobjects
2952 GB used, 68552 GB / 71505 GB avail
2112 active+clean
client io 3021 B/s rd, 643 kB/s wr, 144 op/s

Today catch strange kernel panic with
# pveversion
pve-manager/4.2-11/2c626aa1 (running kernel: 4.4.8-1-pve)

Does anyone met the same bug?
Any chance to upgrade to patched kernel (after bug will be resolved by ubuntu)?

P.S.
The bug are very rare, first met on one node after 2 month in production.
 
Hi There,

Unfortunately this bug is not rare :-\ If you are running a busy CEPH cluster OR KVM environment and have dual sockets in my experience you will hit it it's just a matter of time. For us we could only get about 4 hours uptime before triggering it.

There is a fix for it, it's in 4.7 rc6 it has not yet been back-ported to other kernel versions, but I can confirm this issue is resolved in 4.7 rc6.

If you are looking for an emergency temporary fix then you can disable all the CPU cores on one of your sockets and this will stop this issue from triggering while you arrange to patch or upgrade your kernel.

Relevant lkml thread is here:

https://lkml.org/lkml/2016/6/22/102
 
  • Like
Reactions: homozavrus
Hi There,
There is a fix for it, it's in 4.7 rc6 it has not yet been back-ported to other kernel versions, but I can confirm this issue is resolved in 4.7 rc6.

Thank you for your advice :)
Where did you get build 4.7 rc6 for proxmox?
I think maybe i should rollback to one of the oldest kernel in 4.2 proxmox.
One question to proxmox staff - have we any chance to get patched (if patches will be ported to 4.4 kernel) or 4.7 kernel for proxmox in near future ?

Thank you all!
 
I compiled my own 4.7rc6 kernel packages for this issue as we needed to work around it quickly and also confirm the patch worked. The patch I believe will most certainly be backported all the way back to kernel 4.2 I believe where this problem started and is in all versions post 4.2.

I believe that proxmox uses the Ubuntu 16.04 sources for their kernel (I could be wrong) it's possible Ubuntu could backport this patch faster than the mainline/stable developers do so it could land in the Proxmox kernel faster.
 
Hmm, we haven't hit this bug ... yet ... on our cluster nodes with SuperMicro X10DRI-O and 2x Xeon E5-2630v3 ... but we also fairly recently did the migration from Proxmox v3 to v4, and our load isn't ramped up either.

But I did notice that there is a maintained ppa for ubuntu kernels: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7-rc6-yakkety/
I haven't tried installing those debs to see if they work or not on a Debian Jessie install ... I'll probably start staging that so we're prepared incase we hit this before the patch is backported to proxmox's kernel.
 
I will build a new kernel based on 4.4.15 with that patch applied and upload it to pvetest later on.
 
Excellent. When do you expect this will be rolled into the subscription branch as an official update?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!