Proxmox 5.3 ZFS Kernel Panic on Guests in SCSI driver

Discussion in 'Proxmox VE: Installation and configuration' started by Derek Rasmussen, Feb 21, 2019.

  1. Derek Rasmussen

    Derek Rasmussen New Member
    Proxmox Subscriber

    Joined:
    Jan 29, 2018
    Messages:
    8
    Likes Received:
    0
    Hello,

    We have been running ZFS in production for a whlie on various Virtual hosts and have been super happy with it.

    However, we have been getting these kernel panics recently such as in the screenshot below. It seems to be tied to an issue between proxmox and ZFS when using the scsi disk. We switched to scsi disks on certain hosts recently in order to get the discard support to minimize the disk usage. For what it is worth it only seems to happen when discard is on. Hosts that use scsi as the disk on ZFS and dont' have discard don't kernel panic. We also get this on Debian Jessie installs with discard being on. If we should disable discard on the disks let me know or if someone knows a fix or reason let me know as well.

    upload_2019-2-21_10-1-17.png

    Here are the node settings in proxmox.

    agent: 1
    boot: cdn
    bootdisk: scsi0
    cores: 2
    cpu: host
    ide2: none,media=cdrom
    memory: 6144
    net0: virtio=3A:2A:43:16:AD:A5,bridge=vmbr1,tag=401
    numa: 0
    ostype: l26
    scsi0: local-zfs:vm-115-disk-0,discard=on,size=32G
    smbios1: uuid=72ef68a8-0911-44f0-98db-3e45d754d918
    sockets: 2

    Here is some information on the guest.

    Debian Jessie 8.11
    Linux 3.16.0-7-amd64 #1 SMP Debian 3.16.59-1 (2018-10-03) x86_64 GNU/Linux

    Here is the host information.
    pve-manager/5.3-9/ba817b29 (running kernel: 4.15.18-10-pve)
    X32 Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
    X4 SEAGATE ST600MM0026 @10.5k in
    128 GB DDR3 1600 MHz

    zpool status
    pool: rpool
    state: ONLINE
    scan: scrub repaired 0B in 0h0m with 0 errors on Sun Feb 10 00:24:18 2019
    config:
    NAME STATE READ WRITE CKSUM
    rpool ONLINE 0 0 0
    raidz2-0 ONLINE 0 0 0
    sda3 ONLINE 0 0 0
    sdb3 ONLINE 0 0 0
    sdc3 ONLINE 0 0 0
    sdd3 ONLINE 0 0 0


    zfs get all rpool
    NAME PROPERTY VALUE SOURCE
    rpool type filesystem -
    rpool creation Sat Feb 9 20:25 2019 -
    rpool used 286G -
    rpool available 757G -
    rpool referenced 151K -
    rpool compressratio 1.41x -
    rpool mounted yes -
    rpool quota none default
    rpool reservation none default
    rpool recordsize 128K default
    rpool mountpoint /rpool default
    rpool sharenfs off default
    rpool checksum on default
    rpool compression on local
    rpool atime off local
    rpool devices on default
    rpool exec on default
    rpool setuid on default
    rpool readonly off default
    rpool zoned off default
    rpool snapdir hidden default
    rpool aclinherit restricted default
    rpool createtxg 1 -
    rpool canmount on default
    rpool xattr on default
    rpool copies 1 default
    rpool version 5 -
    rpool utf8only off -
    rpool normalization none -
    rpool casesensitivity sensitive -
    rpool vscan off default
    rpool nbmand off default
    rpool sharesmb off default
    rpool refquota none default
    rpool refreservation none default
    rpool guid 15275965653382357426 -
    rpool primarycache all default
    rpool secondarycache all default
    rpool usedbysnapshots 0B -
    rpool usedbydataset 151K -
    rpool usedbychildren 286G -
    rpool usedbyrefreservation 0B -
    rpool logbias latency default
    rpool dedup off default
    rpool mlslabel none default
    rpool sync standard local
    rpool dnodesize legacy default
    rpool refcompressratio 1.00x -
    rpool written 151K -
    rpool logicalused 282G -
    rpool logicalreferenced 44K -
    rpool volmode default default
    rpool filesystem_limit none default
    rpool snapshot_limit none default
    rpool filesystem_count none default
    rpool snapshot_count none default
    rpool snapdev hidden default
    rpool acltype off default
    rpool context none default
    rpool fscontext none default
    rpool defcontext none default
    rpool rootcontext none default
    rpool relatime off default
    rpool redundant_metadata all default
    rpool overlay off default

    Guest disk pool information
    zfs get all rpool/data/vm-115-disk-0
    NAME PROPERTY VALUE SOURCE
    rpool/data/vm-115-disk-0 type volume -
    rpool/data/vm-115-disk-0 creation Mon Feb 18 14:14 2019 -
    rpool/data/vm-115-disk-0 used 28.9G -
    rpool/data/vm-115-disk-0 available 757G -
    rpool/data/vm-115-disk-0 referenced 28.9G -
    rpool/data/vm-115-disk-0 compressratio 1.58x -
    rpool/data/vm-115-disk-0 reservation none default
    rpool/data/vm-115-disk-0 volsize 32G local
    rpool/data/vm-115-disk-0 volblocksize 8K default
    rpool/data/vm-115-disk-0 checksum on default
    rpool/data/vm-115-disk-0 compression on inherited from rpool
    rpool/data/vm-115-disk-0 readonly off default
    rpool/data/vm-115-disk-0 createtxg 148457 -
    rpool/data/vm-115-disk-0 copies 1 default
    rpool/data/vm-115-disk-0 refreservation none default
    rpool/data/vm-115-disk-0 guid 17035248028212799176 -
    rpool/data/vm-115-disk-0 primarycache all default
    rpool/data/vm-115-disk-0 secondarycache all default
    rpool/data/vm-115-disk-0 usedbysnapshots 0B -
    rpool/data/vm-115-disk-0 usedbydataset 28.9G -
    rpool/data/vm-115-disk-0 usedbychildren 0B -
    rpool/data/vm-115-disk-0 usedbyrefreservation 0B -
    rpool/data/vm-115-disk-0 logbias latency default
    rpool/data/vm-115-disk-0 dedup off default
    rpool/data/vm-115-disk-0 mlslabel none default
    rpool/data/vm-115-disk-0 sync standard inherited from rpool
    rpool/data/vm-115-disk-0 refcompressratio 1.58x -
    rpool/data/vm-115-disk-0 written 28.9G -
    rpool/data/vm-115-disk-0 logicalused 31.3G -
    rpool/data/vm-115-disk-0 logicalreferenced 31.3G -
    rpool/data/vm-115-disk-0 volmode default default
    rpool/data/vm-115-disk-0 snapshot_limit none default
    rpool/data/vm-115-disk-0 snapshot_count none default
    rpool/data/vm-115-disk-0 snapdev hidden default
    rpool/data/vm-115-disk-0 context none default
    rpool/data/vm-115-disk-0 fscontext none default
    rpool/data/vm-115-disk-0 defcontext none default
    rpool/data/vm-115-disk-0 rootcontext none default
    rpool/data/vm-115-disk-0 redundant_metadata all default
     
  2. Derek Rasmussen

    Derek Rasmussen New Member
    Proxmox Subscriber

    Joined:
    Jan 29, 2018
    Messages:
    8
    Likes Received:
    0
    Also for what it is worth, I haven't tried discard on non ZFS servers so ZFS may not be the issue.
     
  3. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,598
    Likes Received:
    306
    Hi,

    have you installed the intel-microcode package?
    If not please do and check if this happens again.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. NdK73

    NdK73 Member

    Joined:
    Jul 19, 2012
    Messages:
    69
    Likes Received:
    3
    Quite old thread, but I traced the panic to the
    Code:
    agent: 1
    Keeping all the rest the same, disabling support for guest agent avoids panic. On the other hand, if I replace SCSI with VirtIO keeping agent support enabled, it works.
    No change if actual agent is installed or not.
    Hope it helps troubleshoot the problem.

    PS: I have images on LVM with a shared DAS. No ZFS involved.
     
  5. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,598
    Likes Received:
    306
    Hi NdK73,

    Is your host on current version?
    have you got a fixed microcode?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. NdK73

    NdK73 Member

    Joined:
    Jul 19, 2012
    Messages:
    69
    Likes Received:
    3
    Hosts are updated to latest no-subscription release.
    Should intel-microcode package be installed on the host or in the guest?
     
  7. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,598
    Likes Received:
    306
    On the Host.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. NdK73

    NdK73 Member

    Joined:
    Jul 19, 2012
    Messages:
    69
    Likes Received:
    3
    Installed it and seems something changed. Now instead of a kernel panic it just says that a process (usually exim4) "blocked for more than 120 seconds" and performance dropped. But that could be a problem of that VM. Going to test with other VMs.
     
  9. NdK73

    NdK73 Member

    Joined:
    Jul 19, 2012
    Messages:
    69
    Likes Received:
    3
    I can definitely confirm that:
    1) "blocked for 120 seconds" is an issue of the VM used for testing
    2) installing intel-microcode package does not solve the kernel panic for another VM
     
  10. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,598
    Likes Received:
    306
    After the installation did you reboot the host system?
    A reboot is necessary.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  11. NdK73

    NdK73 Member

    Joined:
    Jul 19, 2012
    Messages:
    69
    Likes Received:
    3
    Yes. I was quite sure (had to reboot for an unrelated network issue), but to be 100% sure I rebooted the host again and re-tested.
    Definitely installing intel-microcode package does not solve the issue. Seems guest-agent and LSI 53C895A are incompatible.
    Seems unrelated to the workload, too: I tried having ClamAV (in the guest) do a full scan of the disks and that saturated the CPU and the disk reads, but the panic happened later, after the scan terminated.
     
  12. Derek Rasmussen

    Derek Rasmussen New Member
    Proxmox Subscriber

    Joined:
    Jan 29, 2018
    Messages:
    8
    Likes Received:
    0
    Hello everyone,

    Yes, the intel-microcode package is up to date and having the qemu-guest agent removed does seem to not cause the issue.

    We have to generate heavy IO load on the host for the panic to occur. This usually occurs when moving VMs or restoring VMs from backup.

    This occurs on all our systems, ZFS with all SSDs to LVM with spinning disks. We generally use LSI 3008 in IT mode and ZFS however our older hosts are LVM with LSI MR9271-8i cards.

    In order to reproduce you should just have to enable the qemu guest agent with scsi disks, have disk activity going on a guest while restoring 1 or 2 backups to the host at the same time.

    Thanks,
    Derek
     
  13. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,131
    Likes Received:
    92
    Does the issue also occur if you change the guest's scsi-controller to virtio-scsi?
    can be done in the GUI -> VM -> Hardware -> SCSI Controller.
    in the config-file this results in:
    Code:
    scsihw: virtio-scsi-pci
    
    just to rule out that the emulation of the scsi-controller (default is some LSI-controller as can be seen in the first screenshot in this thread) is not too well tested with discards etc.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice