BUG: soft lockup

Discussion in 'Proxmox VE: Installation and configuration' started by e100, Feb 15, 2019.

Tags:
  1. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Hello again everyone, been too long since my last post here.

    I have one server randomly locking up for over a month now, now a 2nd server is also having this problem.
    Unfortunately I've not captured all of the kernel messages that would help diagnose this but I have a couple screenshots from two servers.

    Both servers were installed onto zfs using proxmox iso.
    I have tried the following:
    * I installed the latest Proxmox updates
    * Updated the BIOS
    * Disabled zfs disk swap since that is known to cause a deadlock: https://github.com/zfsonlinux/zfs/issues/7734

    We did have zram enabled, I removed it after the most recent lockup, waiting to see if that helped or not.

    Any suggestions?

    pveversion -v
    Code:
    proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
    pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
    pve-kernel-4.15: 5.3-1
    pve-kernel-4.13: 5.2-2
    pve-kernel-4.15.18-10-pve: 4.15.18-32
    pve-kernel-4.15.18-9-pve: 4.15.18-30
    pve-kernel-4.15.17-3-pve: 4.15.17-14
    pve-kernel-4.13.16-4-pve: 4.13.16-51
    pve-kernel-4.13.16-3-pve: 4.13.16-50
    pve-kernel-4.13.16-2-pve: 4.13.16-48
    pve-kernel-4.13.13-4-pve: 4.13.13-35
    pve-kernel-4.13.13-1-pve: 4.13.13-31
    pve-kernel-4.13.8-3-pve: 4.13.8-30
    pve-kernel-4.13.8-2-pve: 4.13.8-28
    pve-kernel-4.10.15-1-pve: 4.10.15-15
    pve-kernel-4.10.11-1-pve: 4.10.11-9
    corosync: 2.4.4-pve1
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.1-3
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-43
    libpve-guest-common-perl: 2.0-19
    libpve-http-server-perl: 2.0-11
    libpve-storage-perl: 5.0-36
    libqb0: 1.0.3-1~bpo9
    lvm2: 2.02.168-pve6
    lxc-pve: 3.1.0-2
    lxcfs: 3.0.2-2
    novnc-pve: 1.0.0-2
    proxmox-widget-toolkit: 1.0-22
    pve-cluster: 5.0-33
    pve-container: 2.0-33
    pve-docs: 5.3-1
    pve-edk2-firmware: 1.20181023-1
    pve-firewall: 3.0-17
    pve-firmware: 2.0-6
    pve-ha-manager: 2.0-6
    pve-i18n: 1.0-9
    pve-libspice-server1: 0.14.1-1
    pve-qemu-kvm: 2.12.1-1
    pve-xtermjs: 3.10.1-1
    pve-zsync: 1.7-2
    qemu-server: 5.0-45
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.12-pve1~bpo1
    
    Server Specs:
    Supermicro X10DRI-T, two E5-2620 v4, 128GB RAM
    vm1-lockup.png
    Supermicro X9DRL-3F/i, two E5-2650 v1, 128GB RAM
    vm4.png
     
  2. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,136
    Likes Received:
    147
    Hi!
    Seems like a kernel regression? Could you try booting from an older kernel (installed over two months ago)?

    If you could capture more kernel logs, it would be really great, with the available info it's a bit hard to point in the direction this issue could come from...
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    If I am not mistaken zfs module was upgraded recently and I have already run zpool upgrade.
    I do not think it would be OK to boot up kernel with older zfs module, right?

    I went digging in the logs, these are attached as text files.
    All of these occurred when we had zfs swap and zram enabled, before we upgraded the BIOS.

    First boot after BIOS update:
    Feb 8 11:58:08 vm1 kernel: [ 0.000000] DMI: Supermicro X10DRi/X10DRI-T, BIOS 3.1 09/14/2018

    After upgrading BIOS issue still occurs but when it happens nothing is written to the logs.
    Feb 10th: zfs swap and zram enabled
    Feb 10 02:10:32 vm1 qm[17374]: <root@pam> update VM 901: -lock backup
    Feb 10 02:11:14 vm1 vzdump[27762]: <root@pam> end task UPID:vm1:00006C73:00C5D66B:5C5FB22A:vzdump::root@pam: OK
    --Server locked up about 5:14, errors only on console, nothing in logs
    Feb 10 05:28:58 vm1 kernel: [ 0.000000] Linux version 4.15.18-10-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-32 (Sat, 19 Jan 2019 10:09:37 +0100) ()

    Disabled zfs swap on Feb 11th, left zram enabled
    Feb 15th:
    Feb 15 02:10:49 vm1 qm[2320]: <root@pam> update VM 901: -lock backup
    Feb 15 02:11:22 vm1 vzdump[23487]: <root@pam> end task UPID:vm1:00005BC1:02762985:5C6649A9:vzdump::root@pam: OK
    -- Server locked up about 2:14, errors only on console, nothing in logs
    Feb 15 02:45:37 vm1 kernel: [ 0.000000] Linux version 4.15.18-10-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-32 (Sat, 19 Jan 2019 10:09:37 +0100) ()

    Disabled zram on Feb 15th so server is now running with no swap.

    While this problem has occurred at other times it most often happens during or within a few hours of completing a vzdump backup.
     

    Attached Files:

  4. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,136
    Likes Received:
    147
    So it seems a bit like either only zram is the issue with you, or both are (separate?) issues, as swap on ZFS is currently really not considered (as of linked issue). You may want create a separate partition and use that directly as swap, if possible. Most of the kernel logs also point at the memory subsystem at a point where it tries to get (swap in?) pages, so yes that's surely the area of problems.

    How much memory is roughly available for the OS (total memory - guest memory)?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Server has 128GB RAM, Virtual servers all combined are assigned just under 60GB.
    We have zfs_arc_max set to 20GB

    We have not had any issues since turning off zram on the 15th.
    It needs to run stable for at least a month to have confidence that turning off zram fixed anything.

    I am considering just running swapless moving forward.
     
  6. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,136
    Likes Received:
    147
    OK, then this is surely not an issue.

    Sounds reasonable, looking in the kernel git it seems that zRam is a bit unloved there, either it is just fetaure-complete and really stable (a good thing) or it does not sees to much wide-spread use and thus a few bugs with certain setups still hide in it. Our Ubuntu Bionic 4.15 based kernel saw not a single backport (so also the upstream stable kernel trees of 4.14 and 4.19 didn't), further the commits between 4.15 and 5.0-rc8 aren't many, but none did stick out as a obvious fix-candidate for your issue to me...

    if you have a good amount of spare resource like you, swapless sounds reasonable.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    We have not had any issues since turning off zRam over a month ago.
     
  8. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,136
    Likes Received:
    147
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    While running swapoff on a couple of nodes the swapoff task would hang, unable to turn swap off on zram devices. They would hang generating task hung messages.

    I believe these systems are still running.
    Could we get any diagnostic data from these systems that might help discover the source of this problem?
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice