Proxmox 5.3 ZFS Kernel Panic on Guests in SCSI driver

Jan 29, 2018
15
3
3
33
Hello,

We have been running ZFS in production for a whlie on various Virtual hosts and have been super happy with it.

However, we have been getting these kernel panics recently such as in the screenshot below. It seems to be tied to an issue between proxmox and ZFS when using the scsi disk. We switched to scsi disks on certain hosts recently in order to get the discard support to minimize the disk usage. For what it is worth it only seems to happen when discard is on. Hosts that use scsi as the disk on ZFS and dont' have discard don't kernel panic. We also get this on Debian Jessie installs with discard being on. If we should disable discard on the disks let me know or if someone knows a fix or reason let me know as well.

upload_2019-2-21_10-1-17.png

Here are the node settings in proxmox.

agent: 1
boot: cdn
bootdisk: scsi0
cores: 2
cpu: host
ide2: none,media=cdrom
memory: 6144
net0: virtio=3A:2A:43:16:AD:A5,bridge=vmbr1,tag=401
numa: 0
ostype: l26
scsi0: local-zfs:vm-115-disk-0,discard=on,size=32G
smbios1: uuid=72ef68a8-0911-44f0-98db-3e45d754d918
sockets: 2

Here is some information on the guest.

Debian Jessie 8.11
Linux 3.16.0-7-amd64 #1 SMP Debian 3.16.59-1 (2018-10-03) x86_64 GNU/Linux

Here is the host information.
pve-manager/5.3-9/ba817b29 (running kernel: 4.15.18-10-pve)
X32 Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
X4 SEAGATE ST600MM0026 @10.5k in
128 GB DDR3 1600 MHz

zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0h0m with 0 errors on Sun Feb 10 00:24:18 2019
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda3 ONLINE 0 0 0
sdb3 ONLINE 0 0 0
sdc3 ONLINE 0 0 0
sdd3 ONLINE 0 0 0


zfs get all rpool
NAME PROPERTY VALUE SOURCE
rpool type filesystem -
rpool creation Sat Feb 9 20:25 2019 -
rpool used 286G -
rpool available 757G -
rpool referenced 151K -
rpool compressratio 1.41x -
rpool mounted yes -
rpool quota none default
rpool reservation none default
rpool recordsize 128K default
rpool mountpoint /rpool default
rpool sharenfs off default
rpool checksum on default
rpool compression on local
rpool atime off local
rpool devices on default
rpool exec on default
rpool setuid on default
rpool readonly off default
rpool zoned off default
rpool snapdir hidden default
rpool aclinherit restricted default
rpool createtxg 1 -
rpool canmount on default
rpool xattr on default
rpool copies 1 default
rpool version 5 -
rpool utf8only off -
rpool normalization none -
rpool casesensitivity sensitive -
rpool vscan off default
rpool nbmand off default
rpool sharesmb off default
rpool refquota none default
rpool refreservation none default
rpool guid 15275965653382357426 -
rpool primarycache all default
rpool secondarycache all default
rpool usedbysnapshots 0B -
rpool usedbydataset 151K -
rpool usedbychildren 286G -
rpool usedbyrefreservation 0B -
rpool logbias latency default
rpool dedup off default
rpool mlslabel none default
rpool sync standard local
rpool dnodesize legacy default
rpool refcompressratio 1.00x -
rpool written 151K -
rpool logicalused 282G -
rpool logicalreferenced 44K -
rpool volmode default default
rpool filesystem_limit none default
rpool snapshot_limit none default
rpool filesystem_count none default
rpool snapshot_count none default
rpool snapdev hidden default
rpool acltype off default
rpool context none default
rpool fscontext none default
rpool defcontext none default
rpool rootcontext none default
rpool relatime off default
rpool redundant_metadata all default
rpool overlay off default

Guest disk pool information
zfs get all rpool/data/vm-115-disk-0
NAME PROPERTY VALUE SOURCE
rpool/data/vm-115-disk-0 type volume -
rpool/data/vm-115-disk-0 creation Mon Feb 18 14:14 2019 -
rpool/data/vm-115-disk-0 used 28.9G -
rpool/data/vm-115-disk-0 available 757G -
rpool/data/vm-115-disk-0 referenced 28.9G -
rpool/data/vm-115-disk-0 compressratio 1.58x -
rpool/data/vm-115-disk-0 reservation none default
rpool/data/vm-115-disk-0 volsize 32G local
rpool/data/vm-115-disk-0 volblocksize 8K default
rpool/data/vm-115-disk-0 checksum on default
rpool/data/vm-115-disk-0 compression on inherited from rpool
rpool/data/vm-115-disk-0 readonly off default
rpool/data/vm-115-disk-0 createtxg 148457 -
rpool/data/vm-115-disk-0 copies 1 default
rpool/data/vm-115-disk-0 refreservation none default
rpool/data/vm-115-disk-0 guid 17035248028212799176 -
rpool/data/vm-115-disk-0 primarycache all default
rpool/data/vm-115-disk-0 secondarycache all default
rpool/data/vm-115-disk-0 usedbysnapshots 0B -
rpool/data/vm-115-disk-0 usedbydataset 28.9G -
rpool/data/vm-115-disk-0 usedbychildren 0B -
rpool/data/vm-115-disk-0 usedbyrefreservation 0B -
rpool/data/vm-115-disk-0 logbias latency default
rpool/data/vm-115-disk-0 dedup off default
rpool/data/vm-115-disk-0 mlslabel none default
rpool/data/vm-115-disk-0 sync standard inherited from rpool
rpool/data/vm-115-disk-0 refcompressratio 1.58x -
rpool/data/vm-115-disk-0 written 28.9G -
rpool/data/vm-115-disk-0 logicalused 31.3G -
rpool/data/vm-115-disk-0 logicalreferenced 31.3G -
rpool/data/vm-115-disk-0 volmode default default
rpool/data/vm-115-disk-0 snapshot_limit none default
rpool/data/vm-115-disk-0 snapshot_count none default
rpool/data/vm-115-disk-0 snapdev hidden default
rpool/data/vm-115-disk-0 context none default
rpool/data/vm-115-disk-0 fscontext none default
rpool/data/vm-115-disk-0 defcontext none default
rpool/data/vm-115-disk-0 rootcontext none default
rpool/data/vm-115-disk-0 redundant_metadata all default
 
Hi,

have you installed the intel-microcode package?
If not please do and check if this happens again.
 
Quite old thread, but I traced the panic to the
Code:
agent: 1
Keeping all the rest the same, disabling support for guest agent avoids panic. On the other hand, if I replace SCSI with VirtIO keeping agent support enabled, it works.
No change if actual agent is installed or not.
Hope it helps troubleshoot the problem.

PS: I have images on LVM with a shared DAS. No ZFS involved.
 
Hi NdK73,

Is your host on current version?
have you got a fixed microcode?
 
Installed it and seems something changed. Now instead of a kernel panic it just says that a process (usually exim4) "blocked for more than 120 seconds" and performance dropped. But that could be a problem of that VM. Going to test with other VMs.
 
I can definitely confirm that:
1) "blocked for 120 seconds" is an issue of the VM used for testing
2) installing intel-microcode package does not solve the kernel panic for another VM
 
After the installation did you reboot the host system?
A reboot is necessary.
 
Yes. I was quite sure (had to reboot for an unrelated network issue), but to be 100% sure I rebooted the host again and re-tested.
Definitely installing intel-microcode package does not solve the issue. Seems guest-agent and LSI 53C895A are incompatible.
Seems unrelated to the workload, too: I tried having ClamAV (in the guest) do a full scan of the disks and that saturated the CPU and the disk reads, but the panic happened later, after the scan terminated.
 
Hello everyone,

Yes, the intel-microcode package is up to date and having the qemu-guest agent removed does seem to not cause the issue.

We have to generate heavy IO load on the host for the panic to occur. This usually occurs when moving VMs or restoring VMs from backup.

This occurs on all our systems, ZFS with all SSDs to LVM with spinning disks. We generally use LSI 3008 in IT mode and ZFS however our older hosts are LVM with LSI MR9271-8i cards.

In order to reproduce you should just have to enable the qemu guest agent with scsi disks, have disk activity going on a guest while restoring 1 or 2 backups to the host at the same time.

Thanks,
Derek
 
Does the issue also occur if you change the guest's scsi-controller to virtio-scsi?
can be done in the GUI -> VM -> Hardware -> SCSI Controller.
in the config-file this results in:
Code:
scsihw: virtio-scsi-pci

just to rule out that the emulation of the scsi-controller (default is some LSI-controller as can be seen in the first screenshot in this thread) is not too well tested with discards etc.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!