BUG: unable to handle kernel paging request

Deli Veli

New Member
Dec 6, 2018
18
0
1
36
With latest Proxmox 5.3-1 I am running into a Bug that halts KVMs. It is running on Intel NUC 8. Below is the syslog and pve info.

Code:
Mar 26 00:21:09 pmox2 kernel: [630425.268008] BUG: unable to handle kernel paging request at ffffffffc17edb60

Mar 26 00:21:09 pmox2 kernel: [630425.268452] IP: vmx_handle_exit+0x200/0x1560 [kvm_intel]
Mar 26 00:21:09 pmox2 kernel: [630425.268882] PGD 4de60e067 P4D 4de60e067 PUD 4de610067 PMD 854fdd067 PTE 854d73061
Mar 26 00:21:09 pmox2 kernel: [630425.269319] Oops: 0003 [#3] SMP PTI
Mar 26 00:21:09 pmox2 kernel: [630425.269757] Modules linked in: tcp_diag inet_diag nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables iptable_filter softdog nfnetlink_log nfnetlink dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc zfs(PO) aesni_intel zunicode(PO) aes_x86_64 crypto_simd zavl(PO) glue_helper cryptd icp(PO) intel_cstate arc4 wmi_bmof intel_wmi_thunderbolt snd_soc_skl snd_soc_skl_ipc snd_hda_ext_core iwlmvm snd_soc_sst_dsp snd_soc_sst_ipc snd_soc_acpi mac80211 snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine ir_rc6_decoder snd_hda_intel iwlwifi btusb i915 btrtl btbcm snd_hda_codec
Mar 26 00:21:09 pmox2 kernel: [630425.272339]  btintel snd_hda_core pcspkr intel_rapl_perf rtsx_pci_ms drm_kms_helper bluetooth memstick snd_hwdep drm ecdh_generic i2c_algo_bit snd_pcm cfg80211 fb_sys_fops syscopyarea snd_timer sysfillrect snd sysimgblt soundcore mei_me intel_pch_thermal mei shpchp wmi rc_rc6_mce ir_lirc_codec lirc_dev ite_cir rc_core video mac_hid acpi_pad zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq rtsx_pci_sdmmc e1000e(O) ptp pps_core i2c_i801 rtsx_pci ahci libahci
Mar 26 00:21:09 pmox2 kernel: [630425.275059] CPU: 6 PID: 15012 Comm: kvm Tainted: P      D    O     4.15.18-12-pve #1
Mar 26 00:21:09 pmox2 kernel: [630425.276014] Hardware name: Intel(R) Client Systems NUC8i7BEH/NUC8BEB, BIOS BECFL357.86A.0041.2018.0719.1931 07/19/2018
Mar 26 00:21:09 pmox2 kernel: [630425.277001] RIP: 0010:vmx_handle_exit+0x200/0x1560 [kvm_intel]
Mar 26 00:21:09 pmox2 kernel: [630425.277998] RSP: 0018:ffffbf2f838cbd18 EFLAGS: 00010286
Mar 26 00:21:09 pmox2 kernel: [630425.278996] RAX: ffffffffc17edb60 RBX: ffff99be75f28000 RCX: 00000000000000e0
Mar 26 00:21:09 pmox2 kernel: [630425.280000] RDX: 00000000000000ef RSI: 00000000000000ef RDI: ffff99be75f28000
Mar 26 00:21:09 pmox2 kernel: [630425.281007] RBP: ffffbf2f838cbdc0 R08: 0000000000000000 R09: ffff99be90890000
Mar 26 00:21:09 pmox2 kernel: [630425.282020] R10: 0000000000000001 R11: 0000000000000000 R12: 000613033f431951
Mar 26 00:21:09 pmox2 kernel: [630425.283042] R13: 0000000000000000 R14: ffff99be678152b0 R15: ffff99be75f2c280
Mar 26 00:21:09 pmox2 kernel: [630425.284071] FS:  00007f98a3bff700(0000) GS:ffff99c43dd80000(0000) knlGS:0000000000000000
Mar 26 00:21:09 pmox2 kernel: [630425.285105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 26 00:21:09 pmox2 kernel: [630425.286141] CR2: ffffffffc17edb60 CR3: 0000000547382004 CR4: 00000000003626e0
Mar 26 00:21:09 pmox2 kernel: [630425.287192] Call Trace:
Mar 26 00:21:09 pmox2 kernel: [630425.288257]  ? kvm_arch_vcpu_ioctl_run+0x95c/0x16d0 [kvm]
Mar 26 00:21:09 pmox2 kernel: [630425.289332]  kvm_vcpu_ioctl+0x339/0x620 [kvm]
Mar 26 00:21:09 pmox2 kernel: [630425.290400]  ? kvm_vcpu_ioctl+0x339/0x620 [kvm]
Mar 26 00:21:09 pmox2 kernel: [630425.291464]  do_vfs_ioctl+0xa6/0x620
Mar 26 00:21:09 pmox2 kernel: [630425.292561]  ? kvm_on_user_return+0x70/0xa0 [kvm]
Mar 26 00:21:09 pmox2 kernel: [630425.293661]  SyS_ioctl+0x79/0x90
Mar 26 00:21:09 pmox2 kernel: [630425.294763]  ? exit_to_usermode_loop+0xa5/0xd0
Mar 26 00:21:09 pmox2 kernel: [630425.295864]  do_syscall_64+0x73/0x130
Mar 26 00:21:09 pmox2 kernel: [630425.296955]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Mar 26 00:21:09 pmox2 kernel: [630425.298058] RIP: 0033:0x7f98b3ebe017
Mar 26 00:21:09 pmox2 kernel: [630425.299163] RSP: 002b:00007f98a3bfc538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Mar 26 00:21:09 pmox2 kernel: [630425.300299] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f98b3ebe017
Mar 26 00:21:09 pmox2 kernel: [630425.301425] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000018
Mar 26 00:21:09 pmox2 kernel: [630425.302561] RBP: 0000000000000000 R08: 00007f98a64c3ac0 R09: 000000000000ffff
Mar 26 00:21:09 pmox2 kernel: [630425.303684] R10: 00007f98cc8dc000 R11: 0000000000000246 R12: 00007f98a6662000
Mar 26 00:21:09 pmox2 kernel: [630425.304819] R13: 00007f98cc8db000 R14: 0000000000000000 R15: 00007f98a6662000
Mar 26 00:21:09 pmox2 kernel: [630425.305956] Code: 00 b8 ff 01 00 00 0f 79 d0 0f 96 c0 84 c0 0f 84 5e fe ff ff be ff 01 00 00 bf 12 08 00 00 e8 f8 1f ff ff e9 4a fe ff ff ba 04 44 <00> 00 0f 78 d0 41 80 be ea 59 00 00 00 48 89 45 b0 0f 85 4c fe
Mar 26 00:21:09 pmox2 kernel: [630425.307187] RIP: vmx_handle_exit+0x200/0x1560 [kvm_intel] RSP: ffffbf2f838cbd18
Mar 26 00:21:09 pmox2 kernel: [630425.308414] CR2: ffffffffc17edb60
Mar 26 00:21:09 pmox2 kernel: [630425.309641] ---[ end trace 205d4ecee86f6aff ]---

pveversion -v
Code:
proxmox-ve: 5.3-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.3-11 (running version: 5.3-11/d4907f84)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
pve-kernel-4.15.18-10-pve: 4.15.18-32
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-47
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-12
libpve-storage-perl: 5.0-39
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-23
pve-cluster: 5.0-33
pve-container: 2.0-35
pve-docs: 5.3-3
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-18
pve-firmware: 2.0-6
pve-ha-manager: 2.0-8
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-2
pve-xtermjs: 3.10.1-2
qemu-server: 5.0-47
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo1

Once this bug occurs I can ssh into Proxmox but all KVMs are unresponsive. When I reboot Pmox then it all KVMs are back to normal until the Bug resurfaces back.

Any suggestions on how to resolve this issue?

Thank you in advance,
 
Hi,
is this problem specific to the latest kernel? Have you tried to reproduce the issue on older kernels? When did you first encounter the issue?
 
Hi @Chris thank you for your response.

I had issues with the previous kernel but don't have the exact version of the kernel. Not sure if the issue was identical either.

No, I have not tried to reproduce this issue on the previous kernels. I can give it a shot if you could tell me which kernel I should try it on. In the end of the day I need a stable version.

We have first encountered issue today at 00:21 however these are new servers recently configured and only have been up and running for the past 7 days.

Should I install some older version of Pmox on to 1 of the servers at least to have it stable?
 
Ok got another crash today with unable to handle kernel paging on the same server. Below is the stack trace

Code:
Mar 26 19:53:22 pmox2 pvestatd[1543]: command '/sbin/lvs --separator : --noheadings --units b --unbuffered --nosuffix --options vg_name,lv_name,lv_size,lv_attr,po
ol_lv,data_percent,metadata_percent,snap_percent,uuid,tags,metadata_size' failed: got signal 9
Mar 26 19:53:22 pmox2 kernel: [32531.044959] BUG: unable to handle kernel paging request at 0000000000001004
Mar 26 19:53:22 pmox2 kernel: [32531.044965] IP: kernfs_dop_revalidate+0x38/0xd0
Mar 26 19:53:22 pmox2 kernel: [32531.044966] PGD 0 P4D 0
Mar 26 19:53:22 pmox2 kernel: [32531.044969] Oops: 0000 [#1] SMP PTI
Mar 26 19:53:22 pmox2 kernel: [32531.044970] Modules linked in: tcp_diag inet_diag nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables iptable_filter softdog nfnetlink_log nfnetlink dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc zfs(PO) aesni_intel aes_x86_64 zunicode(PO) crypto_simd glue_helper zavl(PO) cryptd icp(PO) intel_cstate arc4 wmi_bmof intel_wmi_thunderbolt snd_soc_skl snd_soc_skl_ipc snd_hda_ext_core iwlmvm snd_soc_sst_dsp snd_soc_sst_ipc snd_soc_acpi mac80211 i915 snd_soc_core rtsx_pci_ms ir_rc6_decoder snd_compress ac97_bus memstick snd_pcm_dmaengine drm_kms_helper btusb snd_hda_intel
Mar 26 19:53:22 pmox2 kernel: [32531.044999]  drm btrtl iwlwifi btbcm btintel i2c_algo_bit intel_rapl_perf fb_sys_fops snd_hda_codec syscopyarea sysfillrect bluetooth sysimgblt snd_hda_core snd_hwdep snd_pcm ecdh_generic snd_timer pcspkr cfg80211 snd soundcore mei_me mei intel_pch_thermal shpchp wmi rc_rc6_mce ir_lirc_codec lirc_dev ite_cir rc_core video mac_hid acpi_pad zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm sunrpc iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq rtsx_pci_sdmmc e1000e(O) ptp pps_core i2c_i801 rtsx_pci ahci libahci
Mar 26 19:53:22 pmox2 kernel: [32531.045024] CPU: 4 PID: 30602 Comm: lvs Tainted: P           O     4.15.18-12-pve #1
Mar 26 19:53:22 pmox2 kernel: [32531.045025] Hardware name: Intel(R) Client Systems NUC8i7BEH/NUC8BEB, BIOS BECFL357.86A.0041.2018.0719.1931 07/19/2018
Mar 26 19:53:22 pmox2 kernel: [32531.045028] RIP: 0010:kernfs_dop_revalidate+0x38/0xd0
Mar 26 19:53:22 pmox2 kernel: [32531.045029] RSP: 0018:ffffa99861e3fc28 EFLAGS: 00010246
Mar 26 19:53:22 pmox2 kernel: [32531.045030] RAX: 0000000000000000 RBX: ffff941218683a80 RCX: 0000000000000030
Mar 26 19:53:22 pmox2 kernel: [32531.045032] RDX: ffff9411c3461740 RSI: 0000000000000000 RDI: ffffffff9b53bf80
Mar 26 19:53:22 pmox2 kernel: [32531.045033] RBP: ffffa99861e3fc40 R08: ffff940f6e300042 R09: ffff941218683a80
Mar 26 19:53:22 pmox2 kernel: [32531.045034] R10: ffff940f6e30003c R11: ffff941218683ab8 R12: 0000000000001000
Mar 26 19:53:22 pmox2 kernel: [32531.045036] R13: ffffa99861e3fcc0 R14: ffff941219a92920 R15: ffff941218683900
Mar 26 19:53:22 pmox2 kernel: [32531.045037] FS:  00007f192f5393c0(0000) GS:ffff94123dd00000(0000) knlGS:0000000000000000
Mar 26 19:53:22 pmox2 kernel: [32531.045039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 26 19:53:22 pmox2 kernel: [32531.045040] CR2: 0000000000001004 CR3: 000000084f3ac003 CR4: 00000000003626e0
Mar 26 19:53:22 pmox2 kernel: [32531.045041] Call Trace:
Mar 26 19:53:22 pmox2 kernel: [32531.045044]  lookup_fast+0x29a/0x300
Mar 26 19:53:22 pmox2 kernel: [32531.045045]  ? __inode_permission+0x5b/0x160
Mar 26 19:53:22 pmox2 kernel: [32531.045047]  walk_component+0x49/0x360
Mar 26 19:53:22 pmox2 kernel: [32531.045049]  ? path_init+0x1bd/0x300
Mar 26 19:53:22 pmox2 kernel: [32531.045051]  path_lookupat+0x73/0x220
Mar 26 19:53:22 pmox2 kernel: [32531.045052]  filename_lookup+0xb8/0x1a0
Mar 26 19:53:22 pmox2 kernel: [32531.045054]  ? __check_object_size+0xb3/0x190
Mar 26 19:53:22 pmox2 kernel: [32531.045056]  ? strncpy_from_user+0x4d/0x170
Mar 26 19:53:22 pmox2 kernel: [32531.045058]  user_path_at_empty+0x36/0x40
Mar 26 19:53:22 pmox2 kernel: [32531.045059]  ? user_path_at_empty+0x36/0x40
Mar 26 19:53:22 pmox2 kernel: [32531.045061]  SyS_access+0xb4/0x220
Mar 26 19:53:22 pmox2 kernel: [32531.045063]  do_syscall_64+0x73/0x130
Mar 26 19:53:22 pmox2 kernel: [32531.045065]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Mar 26 19:53:22 pmox2 kernel: [32531.045067] RIP: 0033:0x7f192e3959c7
Mar 26 19:53:22 pmox2 kernel: [32531.045068] RSP: 002b:00007ffc60479c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
Mar 26 19:53:22 pmox2 kernel: [32531.045069] RAX: ffffffffffffffda RBX: 00005606aba3e530 RCX: 00007f192e3959c7
Mar 26 19:53:22 pmox2 kernel: [32531.045071] RDX: 00746e657665752f RSI: 0000000000000000 RDI: 00007ffc60479c40
Mar 26 19:53:22 pmox2 kernel: [32531.045072] RBP: 00007ffc60479cd0 R08: 0000000000000000 R09: 0000000000001000
Mar 26 19:53:22 pmox2 kernel: [32531.045073] R10: 00000000000001a0 R11: 0000000000000246 R12: 00005606aba3d6c0
Mar 26 19:53:22 pmox2 kernel: [32531.045075] R13: 00007f192f5572c0 R14: 00007ffc60479c40 R15: 00005606aba3d8d0
Mar 26 19:53:22 pmox2 kernel: [32531.045076] Code: 00 48 8b 57 30 31 c0 48 85 d2 74 57 55 48 89 e5 41 55 41 54 53 4c 8b a2 50 02 00 00 48 89 fb 48 c7 c7 80 bf 53 9b e8 18 56 6c 00 <41> 8b 44 24 04 85 c0 78 1b 48 8b 43 18 48 8b 40 30 48 85 c0 74
Mar 26 19:53:22 pmox2 kernel: [32531.045093] RIP: kernfs_dop_revalidate+0x38/0xd0 RSP: ffffa99861e3fc28
Mar 26 19:53:22 pmox2 kernel: [32531.045094] CR2: 0000000000001004
Mar 26 19:53:22 pmox2 kernel: [32531.045096] ---[ end trace ddde44c1d5fa934f ]---

What could be the reason and what can I do to have it stable?
 
Hmm multiple 'unable to handle kernel paging request at 0000000000001004' in seemingly vastly different areas of the kernel (running `lvs` and something in kvm_intel) seems, like it could be a potential hardware issue.

Maybe try to run memtest for a longer while on the system.

Hope this helps!
 
Ran below
root@pmox1:~# memtester 1024 5

Here is the results

Code:
root@pmox1:~# memtester 1024 5
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1024MB (1073741824 bytes)
got  1024MB (1073741824 bytes), trying mlock ...locked.
Loop 1/5:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : ok

Loop 2/5:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : testing  49FAILURE: 0x00000040 != 0x02000040 at offset 0x04fc8910.
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : ok

Loop 3/5:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : ok

Loop 4/5:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : \

Then it hang.
 
I actually would recommend running memtest86 (after rebooting the server you can choose it in grub) - since this has access to the complete memory

but
Bit Flip : testing 49FAILURE: 0x00000040 != 0x02000040 at offset 0x04fc8910.
and
8-bit Writes : \

could probably indicate a problem with the Hardware (maybe/likely a piece of RAM, but it could as well be a faulty powersupply)

If possible try replacing one component at a time and see if the issue goes away (but run memtest86 first)

hope this helps!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!