After Update - Kernelpanic an filesystem damage on all VM's with UEFI

fireon · Nov 1, 2019

Hello all,

since the latest update after some hour's, day's or minutes, VM's with ovmf are running in to a kernel panic. Really strange. Turning the repairprocess with an clonezilla iso, the liveiso get also an kernelpanic... what's going on? So after reboot into old kernel, it looks great again...

Code:

pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-2-pve)
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-5.0: 6.0-9
pve-kernel-helper: 6.0-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-3
pve-xtermjs: 3.13.2-1
pve-zsync: 2.0-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve1

Here are one VM-config:

Code:

agent: 1
bios: ovmf
bootdisk: scsi0
cores: 8
cpu: kvm64,flags=+pcid;+spec-ctrl
description: UCS Active Directory Slave DC%0A%0A- BlueSpice MediaWiki%0A- Wekan Kanbanboard
efidisk0: SSD-vmdata:vm-110-disk-0,size=1M
ide2: none,media=cdrom
memory: 5120
name: app.supertux.lan
net0: virtio=FA:8C:56:C1:BE:3C,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: l26
scsi0: SSD-vmdata:vm-110-disk-1,discard=on,size=40G,ssd=1
scsi1: SSD-vmdata:vm-110-disk-2,discard=on,size=8G,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=1315bc07-3f6d-4e4c-a5eb-c6c46429e28c
sockets: 1
vga: qxl
vmgenid: 570312bc-e910-440a-adc2-33cd2cd9add2

The only way to solve this filesystemerrors was to recover the whole backup.
I have remotelogging activated. So here the kernellog from the crash. (attached)

Maybe bad kernel 5.0.21-3-pve

Thanks a lot.

wolfgang · Nov 5, 2019

Hi,
Please reboot your system and run on 5.0.21-3.
There are mad kernels in the Kernel 5.0.21-2 series.
Or more precise a bug in the ZFS module.

t.lamprecht · Nov 5, 2019

Hmm, could not reproduce here, on an Intel host with a Ubuntu 19.10 OVMF installation (I tested the kernel with some OVMF VMs, but that one was still available to just run).

What I see in your log is that the error happens in the page fault code-path, and that the kernel cannot immediately allocate memory so it needs to free stuff first, possible high memory usage?

Can you tell me a bit more about the host HW?
I'll try to boot a plain Debian VM, to match your setup more closely.

t.lamprecht · Nov 5, 2019

If you use ZFS the issue could also be the following:
The data breakage could have happened when using the ZFS-problematic kernels with ABI 5.0.21-1 and 5.0.21-2, but only the reboot to a newer kernel, and thus reboot of the VM (?) made the issue show up, so while one may suspect the new kernel, it could have been the previous one and just a side-effect of kernel-update related reboot. An educate guess, only valid in the case the host really uses ZFS.

fireon · Nov 6, 2019

t.lamprecht said:
If you use ZFS the issue could also be the following:
The data breakage could have happened when using the ZFS-problematic kernels with ABI 5.0.21-1 and 5.0.21-2, but only the reboot to a newer kernel, and thus reboot of the VM (?) made the issue show up, so while one may suspect the new kernel, it could have been the previous one and just a side-effect of kernel-update related reboot. An educate guess, only valid in the case the host really uses ZFS.

Very strange...

Hello all, and thanks for the reply. On the hostmachine i had high memoryusage @the first crash, yes. I have no swap but zram. Normaly this should not be a problem (have this a long time), because if mem needed the hostkernel releases that. But @the second crash after reboot with the kernel 5.0.21-3 one VM (was repaired before) crashed again after some minutes another VM, and at this time there was no hight memory usage.

I Reboot the system in actual kernel.

t.lamprecht · Nov 6, 2019

FYI, probably relate issue: https://forum.proxmox.com/threads/pve-kernel-5-0-21-4-pve-cause-debian-guests-to-reboot-loop.59377/

What's the host Hardware? Vendor, CPU, ..? I try and try but cannot bring it to fail, for now we have one user which cannot reroduce it on an AMD but on an Intel, all others had intel too - so this may be relevant.

fireon · Nov 8, 2019

Supermicro Board and Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz. (VM's death) ZFS and LVM-Thin
HP ML310 Intel(R) Xeon(R) CPU E31220 @ 3.10GHz with QCOW2 (VM's death)
Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz (VM's not death) ZFS
Intel(R) Core(TM) i3-4170 CPU @ 3.70GHz (VM's not death) ZFS

LXC's and local zfs do not seem to be affected.

fireon · Nov 11, 2019

Kernel 5.3 look like the solution. It is working the last two day's stable on all affected servers.

Search

Search

After Update - Kernelpanic an filesystem damage on all VM's with UEFI

fireon

Distinguished Member

Attachments

wolfgang

Proxmox Retired Staff

t.lamprecht

Proxmox Staff Member

t.lamprecht

Proxmox Staff Member

fireon

Distinguished Member

t.lamprecht

Proxmox Staff Member

fireon

Distinguished Member

fireon

Distinguished Member