Kernel BUG PVE 4

Sakis

Active Member
Aug 14, 2013
121
6
38
Hi all,

My nodes in a Proxmox 4 cluster frequently expirience a kernel bug and remain in a strange state.
Most services, including kvms are running well but ssh and pveproxy are unresponsive. I cant even get a iLO console. I am totaly out of control and i have to reset it everytime. I am unable ofc to move ha kvms from this node.

I am using older kernel cause of this related issue https://forum.proxmox.com/threads/new-pve-4-guest-kvm-cant-see-more-than-one-core.24802/#post-124242

Code:
proxmox-ve: 4.0-21 (running kernel: 4.1.3-1-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.1.3-1-pve: 4.1.3-7
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-21
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie

And this is the kernel bug i get

Code:
an  3 17:46:08 node1 kernel: [527484.297828] ------------[ cut here ]------------
Jan  3 17:46:08 node1 kernel: [527484.297913] kernel BUG at mm/migrate.c:569!
Jan  3 17:46:08 node1 kernel: [527484.297975] invalid opcode: 0000 [#1] SMP
Jan  3 17:46:08 node1 kernel: [527484.298045] Modules linked in: nfsv3 ip_set ip6table_filter ip6_tables softdog nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_m
ad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt
able_filter ip_tables nfnetlink_log x_tables nfnetlink xfs intel_powerclamp coretemp amdkfd amd_iommu_v2 kvm_intel kvm radeon snd_pcm ipmi_ssif iTCO_wdt ttm snd_timer crct10dif_pclmul gpio_ich crc32_pclmul snd i
TCO_vendor_support ghash_clmulni_intel aesni_intel aes_x86_64 ipmi_si soundcore drm_kms_helper lrw joydev gf128mul pcspkr glue_helper ablk_helper drm hpilo psmouse cryptd serio_raw i7core_edac ipmi_msghandler i2
c_algo_bit lpc_ich shpchp wmi acpi_power_meter mac_hid 8250_fintek edac_core vhost_net vhost macvtap macvlan autofs4 zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) bnx2x hid_generic usbkbd usbmouse
ptp pps_core usbhid mdio hid netxen_nic hpsa pata_acpi libcrc32c
Jan  3 17:46:08 node1 kernel: [527484.299854] CPU: 10 PID: 2616 Comm: cfs_loop Tainted: P           O    4.1.3-1-pve #1
Jan  3 17:46:08 node1 kernel: [527484.299968] Hardware name: HP ProLiant DL580 G7, BIOS P65 10/01/2013
Jan  3 17:46:08 node1 kernel: [527484.300058] task: ffff885fa96aa840 ti: ffff885fa39f4000 task.ti: ffff885fa39f4000
Jan  3 17:46:08 node1 kernel: [527484.300165] RIP: 0010:[<ffffffff811e6680>]  [<ffffffff811e6680>] migrate_page+0x50/0x60
Jan  3 17:46:08 node1 kernel: [527484.300295] RSP: 0000:ffff885fa39f7c70  EFLAGS: 00010202
Jan  3 17:46:08 node1 kernel: [527484.300372] RAX: 020a380000002009 RBX: ffffea00f8ea27c0 RCX: 0000000000000000
Jan  3 17:46:08 node1 kernel: [527484.300475] RDX: ffffea004707e780 RSI: ffffea00f8ea27c0 RDI: ffff887fa6aa3e80
Jan  3 17:46:08 node1 kernel: [527484.300579] RBP: ffff885fa39f7ce8 R08: 000000000000007d R09: 0000000000000000
Jan  3 17:46:08 node1 kernel: [527484.300682] R10: ffffffff81cae204 R11: ffffea00f8ea2800 R12: ffffea004707e780
Jan  3 17:46:08 node1 kernel: [527484.300785] R13: 0000000000000001 R14: ffff887fa6aa3e80 R15: 00000000fffffff5
Jan  3 17:46:08 node1 kernel: [527484.300889] FS:  00007fe9ebd03700(0000) GS:ffff883fbf800000(0000) knlGS:0000000000000000
Jan  3 17:46:08 node1 kernel: [527484.301006] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  3 17:46:08 node1 kernel: [527484.301090] CR2: 00007fe9f2cfcbf8 CR3: 0000007fafdc5000 CR4: 00000000000007e0
Jan  3 17:46:08 node1 kernel: [527484.301193] Stack:
Jan  3 17:46:08 node1 kernel: [527484.301224]  ffffffff811e6886 ffff885fa39f7d78 000000004707e780 0000000000000000
Jan  3 17:46:08 node1 kernel: [527484.301347]  00000000fffffff5 ffff885fa39f7ce8 ffffffff811c143c 000000000007121a
Jan  3 17:46:08 node1 kernel: [527484.301471]  0000000000000302 ffffffff811c00f0 ffffea00f8ea27c0 ffff885fa39f7d78
Jan  3 17:46:08 node1 kernel: [527484.301593] Call Trace:
Jan  3 17:46:08 node1 kernel: [527484.301635]  [<ffffffff811e6886>] ? move_to_new_page+0x1f6/0x230
Jan  3 17:46:08 node1 kernel: [527484.301727]  [<ffffffff811c143c>] ? try_to_unmap+0x5c/0x90
Jan  3 17:46:08 node1 kernel: [527484.301806]  [<ffffffff811c00f0>] ? page_remove_rmap+0x120/0x120
Jan  3 17:46:08 node1 kernel: [527484.301893]  [<ffffffff811e71b2>] migrate_pages+0x7d2/0x810
Jan  3 17:46:08 node1 kernel: [527484.301974]  [<ffffffff811e4e70>] ? remove_migration_ptes+0x50/0x50
Jan  3 17:46:08 node1 kernel: [527484.302065]  [<ffffffff811e78d7>] migrate_misplaced_page+0xc7/0x150
Jan  3 17:46:08 node1 kernel: [527484.302159]  [<ffffffff811b5841>] handle_mm_fault+0xad1/0x1730
Jan  3 17:46:08 node1 kernel: [527484.306563]  [<ffffffff81066ce4>] __do_page_fault+0x1c4/0x490
Jan  3 17:46:08 node1 kernel: [527484.311000]  [<ffffffff81066fdf>] do_page_fault+0x2f/0x80
Jan  3 17:46:08 node1 kernel: [527484.315387]  [<ffffffff8180ae98>] page_fault+0x28/0x30
Jan  3 17:46:08 node1 kernel: [527484.319718] Code: 10 e8 f5 f5 ff ff 85 c0 75 11 48 89 de 4c 89 e7 89 45 ec e8 f3 fa ff ff 8b 45 ec 48 83 c4 10 5b 41 5c 5d c3 0f 1f 80 00 00 00 00 <0f> 0b 66 66 66 66 66 2e
0f 1f 84 00 00 00 00 00 66 66 66 66 90
Jan  3 17:46:08 node1 kernel: [527484.328964] RIP  [<ffffffff811e6680>] migrate_page+0x50/0x60
Jan  3 17:46:08 node1 kernel: [527484.333507]  RSP <ffff885fa39f7c70>
Jan  3 17:46:08 node1 kernel: [527484.352886] ---[ end trace e541faf8bb18e2a4 ]---
 
I also have a similar issue with the cluster not responding properly. The VM's still work but proxmox is not communicating across the cluster.

I had a similar issue with v3.x but I could clear it by resetting services cmon and pve-cluster on each node in the cluster. This fixed the issue and did not affect the running guest instances. On v4.0 however, the services have changed and my process no longer works. I was finally able to clear the issue by resetting (corosync, pveproxy and pve-manager) BUT this also caused all running guests to shutdown and then restart. A surprising and unfortunate result.

I guess I will try upgrading to v4.1 and hopefully never run into this again but would like to know if there is a proper process for clearing this issue without it restarting any running guest instances.

Regards,

-Glen
 
Last edited:
I had a similar issue with v3.x but I could clear it by resetting services cmon and pve-cluster on each node in the cluster. This fixed the issue and did not affect the running guest instances. On v4.0 however, the services have changed and my process no longer works. I was finally able to clear the issue by resetting (corosync, pveproxy and pve-manager) BUT this also caused all running guests to shutdown and then restart. A surprising and unfortunate result.
this is because you have restarted "pve-manager" which is a fake service which start/stop all vms on server stop/start.
You don't need to restart it (pve-proxy is enough)
 
We had similar issue in the past with proxmox3. Since we upgraded us clusters to PVE4, or also ne PVE4 clusters we changed the network completly: https://pve.proxmox.com/wiki/Proxmox_VE_4.x_Cluster#Requirements

We are useing dedicated NICs for clustercommunication. A extra VLAN with no gateway. Since this we never had any problem with clustering.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!