Hi all,
My nodes in a Proxmox 4 cluster frequently expirience a kernel bug and remain in a strange state.
Most services, including kvms are running well but ssh and pveproxy are unresponsive. I cant even get a iLO console. I am totaly out of control and i have to reset it everytime. I am unable ofc to move ha kvms from this node.
I am using older kernel cause of this related issue https://forum.proxmox.com/threads/new-pve-4-guest-kvm-cant-see-more-than-one-core.24802/#post-124242
And this is the kernel bug i get
My nodes in a Proxmox 4 cluster frequently expirience a kernel bug and remain in a strange state.
Most services, including kvms are running well but ssh and pveproxy are unresponsive. I cant even get a iLO console. I am totaly out of control and i have to reset it everytime. I am unable ofc to move ha kvms from this node.
I am using older kernel cause of this related issue https://forum.proxmox.com/threads/new-pve-4-guest-kvm-cant-see-more-than-one-core.24802/#post-124242
Code:
proxmox-ve: 4.0-21 (running kernel: 4.1.3-1-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.1.3-1-pve: 4.1.3-7
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-21
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
And this is the kernel bug i get
Code:
an 3 17:46:08 node1 kernel: [527484.297828] ------------[ cut here ]------------
Jan 3 17:46:08 node1 kernel: [527484.297913] kernel BUG at mm/migrate.c:569!
Jan 3 17:46:08 node1 kernel: [527484.297975] invalid opcode: 0000 [#1] SMP
Jan 3 17:46:08 node1 kernel: [527484.298045] Modules linked in: nfsv3 ip_set ip6table_filter ip6_tables softdog nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_m
ad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt
able_filter ip_tables nfnetlink_log x_tables nfnetlink xfs intel_powerclamp coretemp amdkfd amd_iommu_v2 kvm_intel kvm radeon snd_pcm ipmi_ssif iTCO_wdt ttm snd_timer crct10dif_pclmul gpio_ich crc32_pclmul snd i
TCO_vendor_support ghash_clmulni_intel aesni_intel aes_x86_64 ipmi_si soundcore drm_kms_helper lrw joydev gf128mul pcspkr glue_helper ablk_helper drm hpilo psmouse cryptd serio_raw i7core_edac ipmi_msghandler i2
c_algo_bit lpc_ich shpchp wmi acpi_power_meter mac_hid 8250_fintek edac_core vhost_net vhost macvtap macvlan autofs4 zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) bnx2x hid_generic usbkbd usbmouse
ptp pps_core usbhid mdio hid netxen_nic hpsa pata_acpi libcrc32c
Jan 3 17:46:08 node1 kernel: [527484.299854] CPU: 10 PID: 2616 Comm: cfs_loop Tainted: P O 4.1.3-1-pve #1
Jan 3 17:46:08 node1 kernel: [527484.299968] Hardware name: HP ProLiant DL580 G7, BIOS P65 10/01/2013
Jan 3 17:46:08 node1 kernel: [527484.300058] task: ffff885fa96aa840 ti: ffff885fa39f4000 task.ti: ffff885fa39f4000
Jan 3 17:46:08 node1 kernel: [527484.300165] RIP: 0010:[<ffffffff811e6680>] [<ffffffff811e6680>] migrate_page+0x50/0x60
Jan 3 17:46:08 node1 kernel: [527484.300295] RSP: 0000:ffff885fa39f7c70 EFLAGS: 00010202
Jan 3 17:46:08 node1 kernel: [527484.300372] RAX: 020a380000002009 RBX: ffffea00f8ea27c0 RCX: 0000000000000000
Jan 3 17:46:08 node1 kernel: [527484.300475] RDX: ffffea004707e780 RSI: ffffea00f8ea27c0 RDI: ffff887fa6aa3e80
Jan 3 17:46:08 node1 kernel: [527484.300579] RBP: ffff885fa39f7ce8 R08: 000000000000007d R09: 0000000000000000
Jan 3 17:46:08 node1 kernel: [527484.300682] R10: ffffffff81cae204 R11: ffffea00f8ea2800 R12: ffffea004707e780
Jan 3 17:46:08 node1 kernel: [527484.300785] R13: 0000000000000001 R14: ffff887fa6aa3e80 R15: 00000000fffffff5
Jan 3 17:46:08 node1 kernel: [527484.300889] FS: 00007fe9ebd03700(0000) GS:ffff883fbf800000(0000) knlGS:0000000000000000
Jan 3 17:46:08 node1 kernel: [527484.301006] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 3 17:46:08 node1 kernel: [527484.301090] CR2: 00007fe9f2cfcbf8 CR3: 0000007fafdc5000 CR4: 00000000000007e0
Jan 3 17:46:08 node1 kernel: [527484.301193] Stack:
Jan 3 17:46:08 node1 kernel: [527484.301224] ffffffff811e6886 ffff885fa39f7d78 000000004707e780 0000000000000000
Jan 3 17:46:08 node1 kernel: [527484.301347] 00000000fffffff5 ffff885fa39f7ce8 ffffffff811c143c 000000000007121a
Jan 3 17:46:08 node1 kernel: [527484.301471] 0000000000000302 ffffffff811c00f0 ffffea00f8ea27c0 ffff885fa39f7d78
Jan 3 17:46:08 node1 kernel: [527484.301593] Call Trace:
Jan 3 17:46:08 node1 kernel: [527484.301635] [<ffffffff811e6886>] ? move_to_new_page+0x1f6/0x230
Jan 3 17:46:08 node1 kernel: [527484.301727] [<ffffffff811c143c>] ? try_to_unmap+0x5c/0x90
Jan 3 17:46:08 node1 kernel: [527484.301806] [<ffffffff811c00f0>] ? page_remove_rmap+0x120/0x120
Jan 3 17:46:08 node1 kernel: [527484.301893] [<ffffffff811e71b2>] migrate_pages+0x7d2/0x810
Jan 3 17:46:08 node1 kernel: [527484.301974] [<ffffffff811e4e70>] ? remove_migration_ptes+0x50/0x50
Jan 3 17:46:08 node1 kernel: [527484.302065] [<ffffffff811e78d7>] migrate_misplaced_page+0xc7/0x150
Jan 3 17:46:08 node1 kernel: [527484.302159] [<ffffffff811b5841>] handle_mm_fault+0xad1/0x1730
Jan 3 17:46:08 node1 kernel: [527484.306563] [<ffffffff81066ce4>] __do_page_fault+0x1c4/0x490
Jan 3 17:46:08 node1 kernel: [527484.311000] [<ffffffff81066fdf>] do_page_fault+0x2f/0x80
Jan 3 17:46:08 node1 kernel: [527484.315387] [<ffffffff8180ae98>] page_fault+0x28/0x30
Jan 3 17:46:08 node1 kernel: [527484.319718] Code: 10 e8 f5 f5 ff ff 85 c0 75 11 48 89 de 4c 89 e7 89 45 ec e8 f3 fa ff ff 8b 45 ec 48 83 c4 10 5b 41 5c 5d c3 0f 1f 80 00 00 00 00 <0f> 0b 66 66 66 66 66 2e
0f 1f 84 00 00 00 00 00 66 66 66 66 90
Jan 3 17:46:08 node1 kernel: [527484.328964] RIP [<ffffffff811e6680>] migrate_page+0x50/0x60
Jan 3 17:46:08 node1 kernel: [527484.333507] RSP <ffff885fa39f7c70>
Jan 3 17:46:08 node1 kernel: [527484.352886] ---[ end trace e541faf8bb18e2a4 ]---