Kernel Oops / Panic on 3.10-5 and 3.10-7 Kernels

fat_kid

New Member
Jan 27, 2015
7
1
1
Hi,
We have had a kernel panic on almost all our hosts at different times. We have 7 Dell C8220 hosts with 192GB, 256GB or 512GB RAM and either Sandy bridge E5-2680 or Ivy bridge E5-2697v2 CPU's. This has happened on Proxmox kernels 3.10-5 and 3.10-7 as well as our own compiled 3.10.5 kernel.

It seems to be more likely to happen if the host exceeds 80% memory usage, but this has not been confirmed.

Here's a stack trace from syslog:
Code:
Mar  5 14:02:37 vmh02 kernel: [846539.132897] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
Mar  5 14:02:37 vmh02 kernel: [846539.132973] IP: [<ffffffff810ce44e>] isolate_migratepages_range+0x262/0x5f4
Mar  5 14:02:37 vmh02 kernel: [846539.133021] PGD 0
Mar  5 14:02:37 vmh02 kernel: [846539.133059] Oops: 0000 [#1] SMP
Mar  5 14:02:37 vmh02 kernel: [846539.133113] Modules linked in: nfsv3 dlm configfs ip_set iptable_filter ip_tables x_tables vhost_net nfnetlink_log nfnetlink tun macvtap macvlan sctp libcrc32c ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core i
scsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rpcsec_gss_krb5 nfsv4 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd dns_resolver fscache sunrpc bridge 8021q garp stp llc bonding dell_rbu fuse joydev hid_generic usbhid hid coretemp kvm_intel kvm c
rc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul ehci_pci snd_pcm snd_page_alloc snd_timer ttm snd drm_kms_helper glue_helper soundcore drm ehci_hcd mperf usbcore processor ioatdma syscopyarea microcode sysfillrect
 iTCO_wdt iTCO_vendor_support sysimgblt lpc_ich mfd_core i2c_i801 usb_common thermal_sys shpchp dcdbas evdev wmi pcspkr button ext3 mbcache jbd dm_mod sg sd_mod crc_t10dif ixgbe ahci mdio libahci igb libata i2c_algo_bit i2c_core scsi_mod dca ptp pps_co
re [last unloaded:
Mar  5 14:02:37 vmh02 kernel: configfs]
Mar  5 14:02:37 vmh02 kernel: [846539.134075] CPU: 5 PID: 120869 Comm: kvm Not tainted 3.10.5 #1
Mar  5 14:02:37 vmh02 kernel: [846539.134115] Hardware name: Dell Inc. PowerEdge C5220/N/A, BIOS 1.2.1 05/27/2013
Mar  5 14:02:37 vmh02 kernel: [846539.134178] task: ffff883f0fc807b0 ti: ffff883281434000 task.ti: ffff883281434000
Mar  5 14:02:37 vmh02 kernel: [846539.134241] RIP: 0010:[<ffffffff810ce44e>]  [<ffffffff810ce44e>] isolate_migratepages_range+0x262/0x5f4
Mar  5 14:02:37 vmh02 kernel: [846539.134312] RSP: 0018:ffff883281435698  EFLAGS: 00010282
Mar  5 14:02:37 vmh02 kernel: [846539.134350] RAX: 0000000000000000 RBX: 00000000027ee925 RCX: 0000000000000008
Mar  5 14:02:37 vmh02 kernel: [846539.134412] RDX: 0600000000008000 RSI: 0000000000000006 RDI: 0000000000000000
Mar  5 14:02:37 vmh02 kernel: [846539.134475] RBP: ffff883281435788 R08: 000000008bc30018 R09: 0000000000000006
Mar  5 14:02:37 vmh02 kernel: [846539.134539] R10: 000000000000001c R11: 0000000000003906 R12: ffff88407fff8d80
Mar  5 14:02:37 vmh02 kernel: [846539.134602] R13: 0000000000000000 R14: ffffea008bc30018 R15: 0000000000000000
Mar  5 14:02:37 vmh02 kernel: [846539.134665] FS:  00007f435c8f2700(0000) GS:ffff88207f6a0000(0000) knlGS:fffff88002469000
Mar  5 14:02:37 vmh02 kernel: [846539.134729] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  5 14:02:37 vmh02 kernel: [846539.134768] CR2: 000000000000001c CR3: 0000001dfeecb000 CR4: 00000000000427e0
Mar  5 14:02:37 vmh02 kernel: [846539.134841] DR0: 0000000000000001 DR1: 0000000000000002 DR2: 0000000000000001
Mar  5 14:02:37 vmh02 kernel: [846539.134903] DR3: 000000000000000a DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar  5 14:02:37 vmh02 kernel: [846539.134965] Stack:
Mar  5 14:02:37 vmh02 kernel: [846539.134995]  ffff88207fffcd80 ffffea008bc2c000 0000000000013f74 0000000000000126
Mar  5 14:02:37 vmh02 kernel: [846539.135075]  ffff883281435798 0000000000000000 00000000027eea00 ffff88407fff9200
Mar  5 14:02:37 vmh02 kernel: [846539.135155]  0000000000000000 ffff883f0fc807b0 ffff883281435830 0000000000000000
Mar  5 14:02:37 vmh02 kernel: [846539.135246] Call Trace:
Mar  5 14:02:37 vmh02 kernel: [846539.135281]  [<ffffffff810ce969>] ? compact_zone+0x108/0x307
Mar  5 14:02:37 vmh02 kernel: [846539.135320]  [<ffffffff810ced8a>] ? compact_zone_order+0x94/0xa7
Mar  5 14:02:37 vmh02 kernel: [846539.135361]  [<ffffffff810cee35>] ? try_to_compact_pages+0x98/0xec
Mar  5 14:02:37 vmh02 kernel: [846539.135404]  [<ffffffff813707b0>] ? __alloc_pages_direct_compact+0xa9/0x19a
Mar  5 14:02:37 vmh02 kernel: [846539.135450]  [<ffffffff810bc380>] ? __alloc_pages_nodemask+0x404/0x776
Mar  5 14:02:37 vmh02 kernel: [846539.135496]  [<ffffffff810f6b7c>] ? memcg_check_events+0xb4/0x1b2
Mar  5 14:02:37 vmh02 kernel: [846539.135539]  [<ffffffff810e9115>] ? alloc_pages_vma+0xbf/0xfe
Mar  5 14:02:37 vmh02 kernel: [846539.135580]  [<ffffffff810f4d83>] ? do_huge_pmd_anonymous_page+0x139/0x2b9
Mar  5 14:02:37 vmh02 kernel: [846539.135627]  [<ffffffff810d396b>] ? handle_mm_fault+0x108/0x20d
Mar  5 14:02:37 vmh02 kernel: [846539.135668]  [<ffffffff810d3dde>] ? __get_user_pages+0x2e7/0x47e
Mar  5 14:02:37 vmh02 kernel: [846539.135712]  [<ffffffffa0259787>] ? vmx_set_tsc_khz+0x36/0x36 [kvm_intel]
Mar  5 14:02:37 vmh02 kernel: [846539.135772]  [<ffffffffa038876c>] ? __gfn_to_pfn_memslot+0x16d/0x332 [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.135819]  [<ffffffffa0388971>] ? __gfn_to_pfn+0x2b/0x50 [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.135866]  [<ffffffffa039c961>] ? try_async_pf+0x38/0x1b1 [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.135911]  [<ffffffffa038a361>] ? kvm_host_page_size+0x73/0x7b [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.135970]  [<ffffffffa03a0ed4>] ? tdp_page_fault+0xf4/0x1dc [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.136017]  [<ffffffffa03a97ff>] ? kvm_ioapic_send_eoi+0x28/0x5f [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.136063]  [<ffffffffa039d918>] ? kvm_mmu_page_fault+0x1e/0xbb [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.136111]  [<ffffffffa0258360>] ? vmx_invpcid_supported+0x16/0x16 [kvm_intel]
Mar  5 14:02:37 vmh02 kernel: [846539.146611]  [<ffffffffa0258360>] ? vmx_invpcid_supported+0x16/0x16 [kvm_intel]
Mar  5 14:02:37 vmh02 kernel: [846539.146678]  [<ffffffffa025e099>] ? vmx_handle_exit+0x6e2/0x728 [kvm_intel]
Mar  5 14:02:37 vmh02 kernel: [846539.146723]  [<ffffffffa025fad9>] ? vmx_vcpu_run+0x3d9/0x17d1 [kvm_intel]
Mar  5 14:02:37 vmh02 kernel: [846539.146775]  [<ffffffffa03a9778>] ? apic_update_ppr+0x15/0x74 [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.146821]  [<ffffffffa03a9778>] ? apic_update_ppr+0x15/0x74 [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.146865]  [<ffffffffa0258360>] ? vmx_invpcid_supported+0x16/0x16 [kvm_intel]
Mar  5 14:02:37 vmh02 kernel: [846539.146934]  [<ffffffffa039a637>] ? kvm_arch_vcpu_ioctl_run+0xad0/0xe65 [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.147000]  [<ffffffffa025aba9>] ? vmx_vcpu_load+0x28/0x145 [kvm_intel]
Mar  5 14:02:37 vmh02 kernel: [846539.147045]  [<ffffffff8106f66d>] ? futex_wake+0xd6/0xee
Mar  5 14:02:37 vmh02 kernel: [846539.147089]  [<ffffffffa0395e6c>] ? kvm_arch_vcpu_load+0xc1/0x18c [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.147135]  [<ffffffffa0388bd5>] ? kvm_vcpu_ioctl+0x118/0x48d [kvm]
Mar  5 14:02:37 vmh02 kernel: [846539.147178]  [<ffffffff8110adda>] ? vfs_ioctl+0x1e/0x31
Mar  5 14:02:37 vmh02 kernel: [846539.147217]  [<ffffffff8110b60a>] ? do_vfs_ioctl+0x3ea/0x42c
Mar  5 14:02:37 vmh02 kernel: [846539.147257]  [<ffffffff810713db>] ? SyS_futex+0x133/0x168
Mar  5 14:02:37 vmh02 kernel: [846539.147298]  [<ffffffff813741a3>] ? __schedule+0x4c5/0x51b
Mar  5 14:02:37 vmh02 kernel: [846539.147338]  [<ffffffff8110b69a>] ? SyS_ioctl+0x4e/0x7c
Mar  5 14:02:37 vmh02 kernel: [846539.147379]  [<ffffffff810b39fc>] ? fire_user_return_notifiers+0x35/0x3d
Mar  5 14:02:37 vmh02 kernel: [846539.147423]  [<ffffffff81379f92>] ? system_call_fastpath+0x16/0x1b
Mar  5 14:02:37 vmh02 kernel: [846539.147463] Code: 00 00 00 41 f7 06 ff ff ff 01 0f 85 46 02 00 00 41 8b 46 18 85 c0 0f 89 3a 02 00 00 49 8b 16 4c 89 f0 80 e6 80 74 04 49 8b 46 30 <8b> 40 1c ff c8 0f 85 20 02 00 00 49 8b 46 08 48 85 c0 74 0d 48
Mar  5 14:02:37 vmh02 kernel: [846539.147863] RIP  [<ffffffff810ce44e>] isolate_migratepages_range+0x262/0x5f4
Mar  5 14:02:37 vmh02 kernel: [846539.147909]  RSP <ffff883281435698>
Mar  5 14:02:37 vmh02 kernel: [846539.147943] CR2: 000000000000001c
Mar  5 14:02:37 vmh02 kernel: [846539.148549] ---[ end trace 142db42c5150a15d ]—


We've found a bugzilla post that seems to have a very similar trace and wondered if the patch mentioned has been put into the proxmox kernel?
https://bugzilla.novell.com/show_bug.cgi?id=871160

The above trace was while the machine is still responsive enough that we can read the syslog, however we have also had complete crashes. See attached photo of panic.
20150203_134604_resized.jpg

Has anyone else seen this? Particularly on Dell C8220's? Or on systems with high memory usage?

Rich
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!