Regular crashes

Discussion in 'Proxmox VE: Installation and configuration' started by jeroenrnl, Oct 1, 2018.

Tags:
  1. jeroenrnl

    jeroenrnl New Member

    Joined:
    Oct 1, 2018
    Messages:
    3
    Likes Received:
    0
    I have been having regular crashes for the past few months now. Sometimes my Proxmox box can't even manage to run for a week. The symptoms are usually as follows: one of my VM's goes to 100% CPU load and the Proxmox web interface shows a (?) for all servers. I'm not sure in which order this happens. That particular server is unavailable at that point. The other VM's usully keep running but with a performance hit (this is a small server with only 1 dual-core CPU), the perfomance hit is worst on the containers as the host usually jumps to a load of 15 or more. Whenever I try to do something about it, such as shutting down the affected server, usually within 15 minutes the whole Proxmox box goes offline.

    I have thought for a long time that this was a memory issue, as by box always had near 100% memory usage. But, recently, I worked on that (replaced a VM by a container, decreased memory on some other and shut down a VM I'm not using too often) and now I have about 50% memory usage, but it still happens.

    Today, the same thing: one VM became completely unresponsive, the container too, and the load on the host climbed over the course of about 10 hours to 500 (!!).

    This is what I found in the logs of the host:

    Code:
    [249494.540871] BUG: unable to handle kernel paging request at ffff880017a06d80
    [249494.540915] IP: _raw_spin_lock_irqsave+0x22/0x40
    [249494.540943] PGD 0 P4D 0 
    [249494.540972] Oops: 0002 [#1] SMP NOPTI
    [249494.541002] Modules linked in: veth nfsv3 nfs_acl nfs lockd grace fscache ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_mac xt_NFLOG ipt_REJECT nf_reject_ipv4 xt_physdev xt_tcpudp xt_comment xt_addrtype xt_multiport xt_conntrack xt_set xt_mark ip_set_hash_net ip_set iptable_filter openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack softdog nfnetlink_log nfnetlink dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c wmi_bmof ppdev snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic edac_mce_amd radeon ttm drm_kms_helper drm snd_hda_intel i2c_algo_bit kvm_amd kvm snd_hda_codec fb_sys_fops syscopyarea snd_hda_core sysfillrect snd_hwdep irqbypass sysimgblt pcspkr snd_pcm
    [249494.541091]  k10temp serio_raw snd_timer snd soundcore pl2303 usbserial wmi parport_pc parport mac_hid shpchp zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 uas usb_storage pata_acpi psmouse i2c_piix4 pata_atiixp r8169 mii ahci libahci
    [249494.541162] CPU: 1 PID: 2290 Comm: kvm Tainted: P           O     4.15.18-4-pve #1
    [249494.541194] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7596/785GM-E51 (MS-7596), BIOS V2.12 02/18/2011
    [249494.541228] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
    [249494.541257] RSP: 0018:ffffb122c5297a20 EFLAGS: 00010046
    [249494.541287] RAX: 0000000000000000 RBX: 0000000000000286 RCX: ffff8fdec5a02088
    [249494.541319] RDX: 0000000000000001 RSI: ffff8fdcd23990a0 RDI: ffff880017a06d80
    [249494.541352] RBP: ffffb122c5297a28 R08: ffff8fdecbcb0a00 R09: 0000000000000042
    [249494.541383] R10: ffff8fdecbcb0a38 R11: 000000000000028e R12: ffff880017a06d80
    [249494.541415] R13: ffff8fdc2e399010 R14: ffff8fdc2e399000 R15: 0000000000000009
    [249494.541446] FS:  00007f82a77b9fc0(0000) GS:ffff8fdf1fc40000(0000) knlGS:0000000000000000
    [249494.541478] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [249494.541507] CR2: ffff880017a06d80 CR3: 00000003c82c0000 CR4: 00000000000006e0
    [249494.541538] Call Trace:
    [249494.541571]  remove_wait_queue+0x17/0x60
    [249494.541602]  poll_freewait+0x6f/0xb0
    [249494.541631]  do_sys_poll+0x3a8/0x5d0
    [249494.541689]  ? ioapic_service+0x11f/0x140 [kvm]
    [249494.541719]  ? compat_poll_select_copy_remaining+0x140/0x140
    [249494.541749]  ? compat_poll_select_copy_remaining+0x140/0x140
    [249494.541779]  ? compat_poll_select_copy_remaining+0x140/0x140
    [249494.541809]  ? compat_poll_select_copy_remaining+0x140/0x140
    [249494.541840]  ? compat_poll_select_copy_remaining+0x140/0x140
    [249494.541870]  ? compat_poll_select_copy_remaining+0x140/0x140
    [249494.541900]  ? compat_poll_select_copy_remaining+0x140/0x140
    [249494.541930]  ? compat_poll_select_copy_remaining+0x140/0x140
    [249494.541960]  ? compat_poll_select_copy_remaining+0x140/0x140
    [249494.541990]  SyS_ppoll+0x166/0x180
    [249494.542019]  ? SyS_ppoll+0x166/0x180
    [249494.542049]  ? SyS_ioctl+0x63/0x90
    [249494.542079]  do_syscall_64+0x73/0x130
    [249494.542109]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [249494.542138] RIP: 0033:0x7f828ef63741
    [249494.542167] RSP: 002b:00007ffd3ca6bc20 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
    [249494.542199] RAX: ffffffffffffffda RBX: 00007f823dba7f00 RCX: 00007f828ef63741
    [249494.542231] RDX: 00007ffd3ca6bc30 RSI: 000000000000000c RDI: 00007f823dba7f00
    [249494.542262] RBP: 000000000000000c R08: 0000000000000008 R09: 0000000000000000
    [249494.542294] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
    [249494.542325] R13: 00007f8283065e80 R14: 0000558804f958e0 R15: 0000558804f95900
    [249494.542356] Code: b1 6c ff 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 53 9c 58 0f 1f 44 00 00 48 89 c3 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 06 48 89 d8 5b 5d c3 89 c6 e8 09 67 71 ff 
    [249494.542413] RIP: _raw_spin_lock_irqsave+0x22/0x40 RSP: ffffb122c5297a20
    [249494.542442] CR2: ffff880017a06d80
    [249494.542471] ---[ end trace 88a1ae6808741842 ]---
    [249498.378265] BUG: unable to handle kernel paging request at ffffb122c5297c08
    [249498.378307] IP: pollwake+0x53/0x90
    [249498.378335] PGD 40f535067 P4D 40f535067 PUD 40f542067 PMD 406f2f067 PTE 0
    [249498.378368] Oops: 0000 [#2] SMP NOPTI
    [249498.378397] Modules linked in: veth nfsv3 nfs_acl nfs lockd grace fscache ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_mac xt_NFLOG ipt_REJECT nf_reject_ipv4 xt_physdev xt_tcpudp xt_comment xt_addrtype xt_multiport xt_conntrack xt_set xt_mark ip_set_hash_net ip_set iptable_filter openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack softdog nfnetlink_log nfnetlink dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c wmi_bmof ppdev snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic edac_mce_amd radeon ttm drm_kms_helper drm snd_hda_intel i2c_algo_bit kvm_amd kvm snd_hda_codec fb_sys_fops syscopyarea snd_hda_core sysfillrect snd_hwdep irqbypass sysimgblt pcspkr snd_pcm
    [249498.378508]  k10temp serio_raw snd_timer snd soundcore pl2303 usbserial wmi parport_pc parport mac_hid shpchp zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 uas usb_storage pata_acpi psmouse i2c_piix4 pata_atiixp r8169 mii ahci libahci
    [249498.378576] CPU: 0 PID: 2378 Comm: kvm Tainted: P      D    O     4.15.18-4-pve #1
    [249498.378608] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7596/785GM-E51 (MS-7596), BIOS V2.12 02/18/2011
    [249498.378642] RIP: 0010:pollwake+0x53/0x90
    [249498.378670] RSP: 0018:ffffb122c53278e8 EFLAGS: 00010002
    [249498.378700] RAX: ffffb122c5297bf0 RBX: 0000000000000000 RCX: 0000000000000001
    [249498.378731] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8fdc2e3990a0
    [249498.378762] RBP: ffffb122c5327918 R08: 0000000000000001 R09: 0000000000000000
    [249498.378794] R10: ffffb381401f1008 R11: ffff8fde8db90008 R12: 0000000000000000
    [249498.378825] R13: ffff8fdeab3df7f8 R14: ffff8fdeab3df810 R15: 0000000000000000
    [249498.378857] FS:  0000000000000000(0000) GS:ffff8fdf1fc00000(0000) knlGS:0000000000000000
    [249498.378888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [249498.378917] CR2: ffffb122c5297c08 CR3: 00000003c82c0000 CR4: 00000000000006f0
    [249498.378949] Call Trace:
    [249498.378981]  __wake_up_common+0x8d/0x140
    [249498.379011]  __wake_up_locked_key+0x1b/0x20
    [249498.379040]  eventfd_signal+0x5c/0x80
    [249498.379096]  ioeventfd_write+0x60/0x80 [kvm]
    [249498.379135]  __kvm_io_bus_write+0x8b/0xc0 [kvm]
    [249498.379175]  kvm_io_bus_write+0x54/0x80 [kvm]
    [249498.379216]  write_mmio+0x7e/0x110 [kvm]
    [249498.379257]  emulator_read_write_onepage+0x114/0x300 [kvm]
    [249498.379298]  emulator_read_write+0xd0/0x180 [kvm]
    [249498.379338]  ? kvm_vcpu_read_guest_page+0xe1/0x110 [kvm]
    [249498.379379]  emulator_write_emulated+0x15/0x20 [kvm]
    [249498.379420]  segmented_write+0x5f/0x80 [kvm]
    [249498.379462]  writeback+0x12f/0x260 [kvm]
    [249498.379504]  x86_emulate_insn+0x72e/0xd40 [kvm]
    [249498.379545]  x86_emulate_instruction+0x1f2/0x6e0 [kvm]
    [249498.379587]  kvm_mmu_page_fault+0xcc/0x160 [kvm]
    [249498.379619]  npf_interception+0x4c/0xa0 [kvm_amd]
    [249498.379650]  handle_exit+0x128/0xa10 [kvm_amd]
    [249498.379691]  kvm_arch_vcpu_ioctl_run+0x935/0x16c0 [kvm]
    [249498.379722]  ? svm_vcpu_load+0x115/0x140 [kvm_amd]
    [249498.379763]  ? kvm_arch_vcpu_load+0x68/0x250 [kvm]
    [249498.379802]  kvm_vcpu_ioctl+0x339/0x620 [kvm]
    [249498.379841]  ? kvm_vcpu_ioctl+0x339/0x620 [kvm]
    [249498.379871]  ? __switch_to_asm+0x34/0x70
    [249498.379900]  ? __switch_to_asm+0x40/0x70
    [249498.379929]  ? __switch_to_asm+0x34/0x70
    [249498.379958]  ? __switch_to_asm+0x40/0x70
    [249498.379988]  do_vfs_ioctl+0xa6/0x620
    [249498.380017]  SyS_ioctl+0x79/0x90
    [249498.380047]  do_syscall_64+0x73/0x130
    [249498.380076]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [249498.380106] RIP: 0033:0x7f828ef64dd7
    [249498.380134] RSP: 002b:00007f82817fc538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    [249498.380166] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f828ef64dd7
    [249498.380197] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000025
    [249498.380229] RBP: 00007f828338a000 R08: 0000558804bb5350 R09: 000000000000ffff
    [249498.380260] R10: 00007f82a78a0000 R11: 0000000000000246 R12: 0000000000000000
    [249498.380291] R13: 00007f82a789f000 R14: 0000000000000000 R15: 00007f828338a000
    [249498.380323] Code: 8b 47 08 48 c7 45 d8 00 00 00 00 48 c7 45 e0 00 00 00 00 48 c7 45 d0 00 00 00 00 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 <48> 8b 78 18 48 c7 45 e0 d0 96 eb ac 48 89 7d d8 48 8d 7d d0 c7 
    [249498.380380] RIP: pollwake+0x53/0x90 RSP: ffffb122c53278e8
    [249498.380409] CR2: ffffb122c5297c08
    [249498.380438] ---[ end trace 88a1ae6808741843 ]---
    
    PVE 5.2-9
    Code:
    jeroen@proxmox:~$ uname -a
    Linux proxmox 4.15.18-4-pve #1 SMP PVE 4.15.18-23 (Thu, 30 Aug 2018 13:04:08 +0200) x86_64 GNU/Linux
    
    I hope someone has any idea what's going on...

    Thanks,
    Jeroen
     
  2. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,076
    Likes Received:
    251
    Hi,

    this looks like a driver bug.
    Do you have this problem only with this kernel?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. jeroenrnl

    jeroenrnl New Member

    Joined:
    Oct 1, 2018
    Messages:
    3
    Likes Received:
    0
    Hi Wolfgang, thanks for your reply. No, I've had it with previous kernels. I can see from my logs that at least with 4.15.17-3-pve I have faced the same problem. Unfortunately I don't have any older logs, but I'm sure I've had this problem even longer ago. As said: I thought it was caused by running out of memory so I focussed on that.
     
  4. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,076
    Likes Received:
    251
    Do you use swap?
    Can you send a Hardware List of this machine?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. jeroenrnl

    jeroenrnl New Member

    Joined:
    Oct 1, 2018
    Messages:
    3
    Likes Received:
    0
    Yes. I am using swap on this machine. I have a 16GB USB stick that I use for swap.

    Since last message there are a few things I have done:
    - Stop using swap, mostly because I wanted to do a bad sector check on the flash disk, but I've left it like that for now
    - Done a bad sector check on the USB stick --> all ok
    - I found out that the NFS server that is running on one of the VMs and that is used by some VMs and also via a shared folder in one of the LXC's was still configured with the default 8 threads, I can imagine that that can cause a high load, I have changed it to 32 threads

    Hardware list: see attached output of lshw.
     

    Attached Files:

  6. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,076
    Likes Received:
    251
    I would guess this is the origin of your problems.
    USB has a huge overhead and swap gets a problem if it will not fast enough.
    I would recommend you to use normal sata, sas or nvme for swap.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice