I have been having regular crashes for the past few months now. Sometimes my Proxmox box can't even manage to run for a week. The symptoms are usually as follows: one of my VM's goes to 100% CPU load and the Proxmox web interface shows a (?) for all servers. I'm not sure in which order this happens. That particular server is unavailable at that point. The other VM's usully keep running but with a performance hit (this is a small server with only 1 dual-core CPU), the perfomance hit is worst on the containers as the host usually jumps to a load of 15 or more. Whenever I try to do something about it, such as shutting down the affected server, usually within 15 minutes the whole Proxmox box goes offline.
I have thought for a long time that this was a memory issue, as by box always had near 100% memory usage. But, recently, I worked on that (replaced a VM by a container, decreased memory on some other and shut down a VM I'm not using too often) and now I have about 50% memory usage, but it still happens.
Today, the same thing: one VM became completely unresponsive, the container too, and the load on the host climbed over the course of about 10 hours to 500 (!!).
This is what I found in the logs of the host:
PVE 5.2-9
I hope someone has any idea what's going on...
Thanks,
Jeroen
I have thought for a long time that this was a memory issue, as by box always had near 100% memory usage. But, recently, I worked on that (replaced a VM by a container, decreased memory on some other and shut down a VM I'm not using too often) and now I have about 50% memory usage, but it still happens.
Today, the same thing: one VM became completely unresponsive, the container too, and the load on the host climbed over the course of about 10 hours to 500 (!!).
This is what I found in the logs of the host:
Code:
[249494.540871] BUG: unable to handle kernel paging request at ffff880017a06d80
[249494.540915] IP: _raw_spin_lock_irqsave+0x22/0x40
[249494.540943] PGD 0 P4D 0
[249494.540972] Oops: 0002 [#1] SMP NOPTI
[249494.541002] Modules linked in: veth nfsv3 nfs_acl nfs lockd grace fscache ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_mac xt_NFLOG ipt_REJECT nf_reject_ipv4 xt_physdev xt_tcpudp xt_comment xt_addrtype xt_multiport xt_conntrack xt_set xt_mark ip_set_hash_net ip_set iptable_filter openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack softdog nfnetlink_log nfnetlink dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c wmi_bmof ppdev snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic edac_mce_amd radeon ttm drm_kms_helper drm snd_hda_intel i2c_algo_bit kvm_amd kvm snd_hda_codec fb_sys_fops syscopyarea snd_hda_core sysfillrect snd_hwdep irqbypass sysimgblt pcspkr snd_pcm
[249494.541091] k10temp serio_raw snd_timer snd soundcore pl2303 usbserial wmi parport_pc parport mac_hid shpchp zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 uas usb_storage pata_acpi psmouse i2c_piix4 pata_atiixp r8169 mii ahci libahci
[249494.541162] CPU: 1 PID: 2290 Comm: kvm Tainted: P O 4.15.18-4-pve #1
[249494.541194] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7596/785GM-E51 (MS-7596), BIOS V2.12 02/18/2011
[249494.541228] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
[249494.541257] RSP: 0018:ffffb122c5297a20 EFLAGS: 00010046
[249494.541287] RAX: 0000000000000000 RBX: 0000000000000286 RCX: ffff8fdec5a02088
[249494.541319] RDX: 0000000000000001 RSI: ffff8fdcd23990a0 RDI: ffff880017a06d80
[249494.541352] RBP: ffffb122c5297a28 R08: ffff8fdecbcb0a00 R09: 0000000000000042
[249494.541383] R10: ffff8fdecbcb0a38 R11: 000000000000028e R12: ffff880017a06d80
[249494.541415] R13: ffff8fdc2e399010 R14: ffff8fdc2e399000 R15: 0000000000000009
[249494.541446] FS: 00007f82a77b9fc0(0000) GS:ffff8fdf1fc40000(0000) knlGS:0000000000000000
[249494.541478] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[249494.541507] CR2: ffff880017a06d80 CR3: 00000003c82c0000 CR4: 00000000000006e0
[249494.541538] Call Trace:
[249494.541571] remove_wait_queue+0x17/0x60
[249494.541602] poll_freewait+0x6f/0xb0
[249494.541631] do_sys_poll+0x3a8/0x5d0
[249494.541689] ? ioapic_service+0x11f/0x140 [kvm]
[249494.541719] ? compat_poll_select_copy_remaining+0x140/0x140
[249494.541749] ? compat_poll_select_copy_remaining+0x140/0x140
[249494.541779] ? compat_poll_select_copy_remaining+0x140/0x140
[249494.541809] ? compat_poll_select_copy_remaining+0x140/0x140
[249494.541840] ? compat_poll_select_copy_remaining+0x140/0x140
[249494.541870] ? compat_poll_select_copy_remaining+0x140/0x140
[249494.541900] ? compat_poll_select_copy_remaining+0x140/0x140
[249494.541930] ? compat_poll_select_copy_remaining+0x140/0x140
[249494.541960] ? compat_poll_select_copy_remaining+0x140/0x140
[249494.541990] SyS_ppoll+0x166/0x180
[249494.542019] ? SyS_ppoll+0x166/0x180
[249494.542049] ? SyS_ioctl+0x63/0x90
[249494.542079] do_syscall_64+0x73/0x130
[249494.542109] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[249494.542138] RIP: 0033:0x7f828ef63741
[249494.542167] RSP: 002b:00007ffd3ca6bc20 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
[249494.542199] RAX: ffffffffffffffda RBX: 00007f823dba7f00 RCX: 00007f828ef63741
[249494.542231] RDX: 00007ffd3ca6bc30 RSI: 000000000000000c RDI: 00007f823dba7f00
[249494.542262] RBP: 000000000000000c R08: 0000000000000008 R09: 0000000000000000
[249494.542294] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[249494.542325] R13: 00007f8283065e80 R14: 0000558804f958e0 R15: 0000558804f95900
[249494.542356] Code: b1 6c ff 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 53 9c 58 0f 1f 44 00 00 48 89 c3 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 06 48 89 d8 5b 5d c3 89 c6 e8 09 67 71 ff
[249494.542413] RIP: _raw_spin_lock_irqsave+0x22/0x40 RSP: ffffb122c5297a20
[249494.542442] CR2: ffff880017a06d80
[249494.542471] ---[ end trace 88a1ae6808741842 ]---
[249498.378265] BUG: unable to handle kernel paging request at ffffb122c5297c08
[249498.378307] IP: pollwake+0x53/0x90
[249498.378335] PGD 40f535067 P4D 40f535067 PUD 40f542067 PMD 406f2f067 PTE 0
[249498.378368] Oops: 0000 [#2] SMP NOPTI
[249498.378397] Modules linked in: veth nfsv3 nfs_acl nfs lockd grace fscache ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_mac xt_NFLOG ipt_REJECT nf_reject_ipv4 xt_physdev xt_tcpudp xt_comment xt_addrtype xt_multiport xt_conntrack xt_set xt_mark ip_set_hash_net ip_set iptable_filter openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack softdog nfnetlink_log nfnetlink dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c wmi_bmof ppdev snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic edac_mce_amd radeon ttm drm_kms_helper drm snd_hda_intel i2c_algo_bit kvm_amd kvm snd_hda_codec fb_sys_fops syscopyarea snd_hda_core sysfillrect snd_hwdep irqbypass sysimgblt pcspkr snd_pcm
[249498.378508] k10temp serio_raw snd_timer snd soundcore pl2303 usbserial wmi parport_pc parport mac_hid shpchp zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 uas usb_storage pata_acpi psmouse i2c_piix4 pata_atiixp r8169 mii ahci libahci
[249498.378576] CPU: 0 PID: 2378 Comm: kvm Tainted: P D O 4.15.18-4-pve #1
[249498.378608] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7596/785GM-E51 (MS-7596), BIOS V2.12 02/18/2011
[249498.378642] RIP: 0010:pollwake+0x53/0x90
[249498.378670] RSP: 0018:ffffb122c53278e8 EFLAGS: 00010002
[249498.378700] RAX: ffffb122c5297bf0 RBX: 0000000000000000 RCX: 0000000000000001
[249498.378731] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8fdc2e3990a0
[249498.378762] RBP: ffffb122c5327918 R08: 0000000000000001 R09: 0000000000000000
[249498.378794] R10: ffffb381401f1008 R11: ffff8fde8db90008 R12: 0000000000000000
[249498.378825] R13: ffff8fdeab3df7f8 R14: ffff8fdeab3df810 R15: 0000000000000000
[249498.378857] FS: 0000000000000000(0000) GS:ffff8fdf1fc00000(0000) knlGS:0000000000000000
[249498.378888] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[249498.378917] CR2: ffffb122c5297c08 CR3: 00000003c82c0000 CR4: 00000000000006f0
[249498.378949] Call Trace:
[249498.378981] __wake_up_common+0x8d/0x140
[249498.379011] __wake_up_locked_key+0x1b/0x20
[249498.379040] eventfd_signal+0x5c/0x80
[249498.379096] ioeventfd_write+0x60/0x80 [kvm]
[249498.379135] __kvm_io_bus_write+0x8b/0xc0 [kvm]
[249498.379175] kvm_io_bus_write+0x54/0x80 [kvm]
[249498.379216] write_mmio+0x7e/0x110 [kvm]
[249498.379257] emulator_read_write_onepage+0x114/0x300 [kvm]
[249498.379298] emulator_read_write+0xd0/0x180 [kvm]
[249498.379338] ? kvm_vcpu_read_guest_page+0xe1/0x110 [kvm]
[249498.379379] emulator_write_emulated+0x15/0x20 [kvm]
[249498.379420] segmented_write+0x5f/0x80 [kvm]
[249498.379462] writeback+0x12f/0x260 [kvm]
[249498.379504] x86_emulate_insn+0x72e/0xd40 [kvm]
[249498.379545] x86_emulate_instruction+0x1f2/0x6e0 [kvm]
[249498.379587] kvm_mmu_page_fault+0xcc/0x160 [kvm]
[249498.379619] npf_interception+0x4c/0xa0 [kvm_amd]
[249498.379650] handle_exit+0x128/0xa10 [kvm_amd]
[249498.379691] kvm_arch_vcpu_ioctl_run+0x935/0x16c0 [kvm]
[249498.379722] ? svm_vcpu_load+0x115/0x140 [kvm_amd]
[249498.379763] ? kvm_arch_vcpu_load+0x68/0x250 [kvm]
[249498.379802] kvm_vcpu_ioctl+0x339/0x620 [kvm]
[249498.379841] ? kvm_vcpu_ioctl+0x339/0x620 [kvm]
[249498.379871] ? __switch_to_asm+0x34/0x70
[249498.379900] ? __switch_to_asm+0x40/0x70
[249498.379929] ? __switch_to_asm+0x34/0x70
[249498.379958] ? __switch_to_asm+0x40/0x70
[249498.379988] do_vfs_ioctl+0xa6/0x620
[249498.380017] SyS_ioctl+0x79/0x90
[249498.380047] do_syscall_64+0x73/0x130
[249498.380076] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[249498.380106] RIP: 0033:0x7f828ef64dd7
[249498.380134] RSP: 002b:00007f82817fc538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[249498.380166] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f828ef64dd7
[249498.380197] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000025
[249498.380229] RBP: 00007f828338a000 R08: 0000558804bb5350 R09: 000000000000ffff
[249498.380260] R10: 00007f82a78a0000 R11: 0000000000000246 R12: 0000000000000000
[249498.380291] R13: 00007f82a789f000 R14: 0000000000000000 R15: 00007f828338a000
[249498.380323] Code: 8b 47 08 48 c7 45 d8 00 00 00 00 48 c7 45 e0 00 00 00 00 48 c7 45 d0 00 00 00 00 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 <48> 8b 78 18 48 c7 45 e0 d0 96 eb ac 48 89 7d d8 48 8d 7d d0 c7
[249498.380380] RIP: pollwake+0x53/0x90 RSP: ffffb122c53278e8
[249498.380409] CR2: ffffb122c5297c08
[249498.380438] ---[ end trace 88a1ae6808741843 ]---
PVE 5.2-9
Code:
jeroen@proxmox:~$ uname -a
Linux proxmox 4.15.18-4-pve #1 SMP PVE 4.15.18-23 (Thu, 30 Aug 2018 13:04:08 +0200) x86_64 GNU/Linux
I hope someone has any idea what's going on...
Thanks,
Jeroen