Kernel segfault on host while using spice display in Linux VM

VictorSTS · Jul 1, 2022

Hello,

I'm having a serious issue with a couple of Linux VMs (Ubuntu 20.04 Desktop, Linux Mint 20.1). They both have spice display:

Code:

agent: 1,fstrim_cloned_disks=1
audio0: device=ich9-intel-hda,driver=spice
boot: order=scsi0;ide2
cores: 4
cpu: host,flags=+md-clear;+pcid;+spec-ctrl;+ssbd;+pdpe1gb;+aes
hotplug: disk,network,usb
ide2: none,media=cdrom
machine: q35
memory: 20480
name: desk-lnx01
net0: virtio=32:D7:0D:59:AE:5B,bridge=vmbr0,tag=40
numa: 1
ostype: l26
scsi0: ceph:vm-2514-disk-0,discard=on,iothread=1,size=150G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=62de4b34-3e27-4789-b508-85507b14f030
sockets: 2
vga: qxl2,memory=64
vmgenid: f1e477d9-8ecb-4f53-bfd5-8c5ffc88f2b9

The VM randomly dies if the user is connected via virt-viewer to the spice display.
On the host there are messages like this:

Code:

Jul  1 10:21:40 pve03 kernel: [78486.212509] SPICE Worker[107911]: segfault at 100000027 ip 00007f85326b232a sp 00007f7ff33fa0a0 error 4 in libc-2.28.so[7f8532653000+147000]
Jul  1 10:21:40 pve03 kernel: [78486.212516] Code: 0f 85 9a 02 00 00 48 39 5a 10 0f 85 90 02 00 00 48 89 50 18 48 89 42 10 48 81 f9 ff 03 00 00 76 3f 48 8b 53 20 48 85 d2 74 36 <48> 39 5a 28 0f 85 2b 07 00 00 48 8b 4b 28 48 39 59 20 0f 85 1d 07

Code:

In the VM there are messages like:

Code:

Jul  1 12:29:33 desk-amorozov /usr/lib/gdm3/gdm-x-session[1451]: (EE) qxl(0): EXECBUFFER failed
Jul  1 12:29:33 desk-amorozov kernel: [ 5897.437745] [drm:qxl_execbuffer_ioctl [qxl]] *ERROR* got unwritten 183

Not sure if they are related, as they show up now and then and not only when the VM dies.

Host is using Proxmox 6.4 with latest updates (can't upgrade to 7.x atm):

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.189-2-pve)
pve-manager: 6.4-15 (running version: 6.4-15/af7986e6)
pve-kernel-5.4: 6.4-18
pve-kernel-helper: 6.4-18
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.189-2-pve: 5.4.189-2
pve-kernel-5.4.178-1-pve: 5.4.178-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.22-pve1
ceph-fuse: 14.2.22-pve1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-4
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.14-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1

It does happend in any of the 3 nodes of the cluster (all are exactly the same hardware and firmware revisions). I have tried different combinations of system memory, display memory and single/dual monitor settings for spice with the exact same results. I use spice for another two VMs in this same cluster and they do work correctly (Windows 10 and Ubuntu 18.04).

Has anyone had this issue? Anything that I may check in the host or the VMs to debug this problem?

Thanks in advance.

VictorSTS · Nov 29, 2022

Time for an update...

TLDR: the same VM works ok on a different system (old AMD Opteron) but does not work in any of the 3 Intel servers... yet! (I hope so!)

Long story:

On the Intel servers, still on PVE6.4, I tried with kernel 5.11 and the same segfaults happend. The I installed an old Opteron server with PV7.2 + kernel 5.15, moved the VM there and it ran flawlessly there, so I thought that maybe the problem was something related to the kernel or even QEMU version. Last week I finally could upgrade that cluster to PVE7.3 + kernel 5.15 and yesterday migrated the VM back to the cluster. The segfaults are back

I've also tried with kernel 5.19, same problem.

There is something in this systems (Supermicro SYS-6029U) that causes this problem in this very VM (other VMs which also use SPICE do work and have always worked flawlessly in these systems). Currently planning on doing BIOS upgrades and try disabling C states.

Any ideas are welcome

VictorSTS · Dec 21, 2022

Small update: at some point an update of kernel 5.19 has changed the error message. Now instead of a segfault I get one of these two errors:

QEMU[86165]: corrupted double-linked list (not small)
QEMU[45506]: malloc(): largebin double linked list corrupted (nextsize)

mira · Dec 21, 2022

Did you update to the latest BIOS?
Do you have the Intel microcode package installed? [0]

[0] https://wiki.debian.org/Microcode

VictorSTS · Jan 24, 2023

Sorry for the late reply!

Microcode is updated using the package:

Code:

root@pve03:~# journalctl --no-hostname -o short-monotonic --boot -0   | sed -n '1,/PM: Preparing system for sleep/p' | grep 'microcode\|smp'
[    0.000000] kernel: microcode: microcode updated early to revision 0x5003302, date = 2021-12-10
[    0.612943] kernel: smpboot: Allowing 64 CPUs, 0 hotplug CPUs
[    1.424490] kernel: smpboot: CPU0: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (family: 0x6, model: 0x55, stepping: 0x7)
[    1.430971] kernel: smp: Bringing up secondary CPUs ...
[    0.008718] kernel: smpboot: CPU 16 Converting physical 0 to logical die 1
[    1.818553] kernel: smp: Brought up 2 nodes, 64 CPUs
[    1.818553] kernel: smpboot: Max logical packages: 2
[    1.818553] kernel: smpboot: Total of 64 processors activated (294457.74 BogoMIPS)
[    3.798595] kernel: microcode: sig=0x50657, pf=0x80, revision=0x5003302
[    3.801866] kernel: microcode: Microcode Update Driver: v2.2.

I've been unable to update BIOS as it can't be done remotely and I'm still to schedule the update.

davide.accetturi · May 9, 2023

Hi, we still have the same problem. Do you have some solutions? We have tried Everything but after using it with SPICE for a lot of time (we have windows machines) the machine randomly stop to work. We have more than 200 server around Italy and it is a very noisy bug.
Thanks again

VictorSTS · May 11, 2023

By "same problem" you mean "SPICE produces a segfault in the host's kernel and the VM stops abruptly"?

In my case the customer moved the two users with this issue to PCs and no longer use SPICE. Other users in this same cluster still use SPICE but never had this issue. It's still a mystery to me what produces this problem.

Search

Search

Kernel segfault on host while using spice display in Linux VM

VictorSTS

Famous Member

VictorSTS

Famous Member

VictorSTS

Famous Member

mira

Proxmox Staff Member

VictorSTS

Famous Member

davide.accetturi

Member

VictorSTS

Famous Member