[SOLVED] VM stalls then finally reboots every few days

baas54

New Member
Nov 21, 2023
5
2
3
Problem with a KVM virtualmachine rebooting every couple of days.
The VM uses PCI passthru for a storage controller.

Proxmox host:
pveversion
pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.13-1-pve)

The PVE host logs that one vcpu stalls for a long time and then after a while the OS in the reboots without any logging in the VM.
Syslog on the PVE host:

Mar 12 07:22:53 pve3 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 21s! [kvm:2030]
Mar 12 07:22:53 pve3 kernel: Modules linked in: dm_snapshot bluetooth ecdh_generic ecc msr xt_ipvs ip_vs nf_conntrack_netlink xt_nat vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_bpf xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE xfrm_user xfrm_algo xt_addrtype nft_compat overlay cfg80211 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog sunrpc nfnetlink_log nfnetlink binfmt_misc snd_sof_pci_intel_icl x86_pkg_temp_thermal snd_sof_intel_hda_common intel_powerclamp coretemp soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence kvm_intel snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_hda_codec_hdmi snd_sof kvm snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi
Mar 12 07:22:53 pve3 kernel: crct10dif_pclmul mei_hdcp mei_pxp intel_rapl_msr soundwire_generic_allocation soundwire_bus polyval_generic ghash_clmulni_intel sha256_ssse3 snd_soc_core sha1_ssse3 aesni_intel snd_compress ac97_bus snd_pcm_dmaengine crypto_simd cryptd intel_cstate snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec pcspkr i915 cmdlinepart snd_hda_core wmi_bmof snd_hwdep snd_pcm snd_timer snd spi_nor mtd soundcore ee1004 8250_dw drm_buddy ttm drm_display_helper cec processor_thermal_device_pci_legacy rc_core processor_thermal_device mei_me processor_thermal_rfim drm_kms_helper processor_thermal_mbox processor_thermal_rapl mei intel_rapl_common i2c_algo_bit int340x_thermal_zone intel_soc_dts_iosf acpi_tad acpi_pad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c spi_pxa2xx_platform
Mar 12 07:22:53 pve3 kernel: dw_dmac dw_dmac_core crc32_pclmul xhci_pci nvme igc nvme_core xhci_pci_renesas spi_intel_pci i2c_i801 spi_intel nvme_common ahci intel_lpss_pci i2c_smbus intel_lpss libahci idma64 xhci_hcd video wmi
Mar 12 07:22:53 pve3 kernel: CPU: 1 PID: 2030 Comm: kvm Tainted: P O 6.5.13-1-pve #1
Mar 12 07:22:53 pve3 kernel: Hardware name: Default string Default string/MW-NAS-N5105, BIOS 5.19 03/28/2023
Mar 12 07:22:53 pve3 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x60

i don't know where to begin to debug this behaviour.

Steps taken until now:
1. Reduce the memory for this particular VM to give PVE more memory (On this PVE host only this VM and one container).
2. Removed IOThread on all virtual disks.
 
This motherboard is a "TOPTON" chineese NAS motherboard. It is quite diffiult to find accurate support information or BIOS files on this model.

A post somewhere on this topic lead me to a solution: make the OS run a microcode update while booting.
After i did this the system is stable and now runs fine for the last 10 days. Keep my fingers crossed.

What I did:

Added the Debian firmware repositories to /etc/apt/sources.list:

# extras for microcode updates
deb http://deb.debian.org/debian bookworm main contrib non-free-firmware
deb http://security.debian.org/debian-security bookworm-security main contrib non-free-firmware
deb http://deb.debian.org/debian bookworm-updates main contrib non-free-firmware

then followed by:

apt update
apt install intel-microcode

the reboot.THe system was stable after this.

dmesg reports after this change:

[ 0.000000] microcode: updated early: 0x1d -> 0x24000024, date = 2022-09-02
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!