Cpu soft lockup after VM win10 shutdown

atylv

New Member
Apr 30, 2020
6
0
1
26
Dear All,

I am currently using win10 VM on pve-6.2-4 with kernel 5.4.41 and pve-qume-kvm_5.0.0. My hardware are Xeon E2244G, 2x 16G ECC RAM, and 1660 super (passthrough). I found that after using win10 for a short time (about 10-20min, after perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79750) and shut it down, the cpu would be soft locked up due to kvm. dmesg is listed below:

[Fri May 29 00:01:03 2020] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kvm:6504]
[Fri May 29 00:01:03 2020] Modules linked in: tcp_diag(E) inet_diag(E) ebtable_filter(E) ebtables(E) ip_set(E) ip6table_raw(E) iptable_raw(E) ip6table_filter(E) ip6_tables(E) xt_nat(E) xt_tcpudp(E) veth(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) aufs(E) iptable_filter(E) bpfilter(E) overlay(E) softdog(E) nfnetlink_log(E) nfnetlink(E) intel_rapl_msr(E) intel_rapl_common(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) zfs(POE) aesni_intel(E) zunicode(POE) crypto_simd(E) zlua(POE) cryptd(E) glue_helper(E) zavl(POE) icp(POE) intel_cstate(E) ipmi_ssif(E) intel_rapl_perf(E) pcspkr(E) wmi_bmof(E) snd_hda_intel(E) 8250_dw(E) snd_intel_dspcfg(E) joydev(E) snd_hda_codec(E) input_leds(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) snd(E) soundcore(E) mei_me(E) mei(E) ie31200_edac(E)
[Fri May 29 00:01:03 2020] intel_pch_thermal(E) zcommon(POE) ipmi_si(E) ipmi_devintf(E) znvpair(POE) ipmi_msghandler(E) spl(OE) vhost_net(E) vhost(E) tap(E) ib_iser(E) acpi_tad(E) mac_hid(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) vfio_pci(E) vfio_virqfd(E) irqbypass(E) vfio_iommu_type1(E) vfio(E) sunrpc(E) ip_tables(E) x_tables(E) autofs4(E) btrfs(E) xor(E) zstd_compress(E) raid6_pq(E) usbmouse(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) usbkbd(E) dm_bufio(E) libcrc32c(E) hid_generic(E) usbhid(E) hid(E) ast(E) drm_vram_helper(E) ttm(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) i2c_i801(E) drm(E) igb(E) intel_lpss_pci(E) ahci(E) xhci_pci(E) dca(E) intel_lpss(E) i2c_algo_bit(E) libahci(E) idma64(E) virt_dma(E) xhci_hcd(E) wmi(E) video(E) pinctrl_cannonlake(E) pinctrl_intel(E)
[Fri May 29 00:01:03 2020] CPU: 5 PID: 6504 Comm: kvm Tainted: P OE 5.4.41-1-pve #1
[Fri May 29 00:01:03 2020] Hardware name: Supermicro Super Server/X11SCL-IF, BIOS 1.3 02/21/2020
[Fri May 29 00:01:03 2020] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[Fri May 29 00:01:03 2020] Code: c0 5d c3 b8 01 00 00 00 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 8b 07
[Fri May 29 00:01:03 2020] RSP: 0018:ffffb13789767ac8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[Fri May 29 00:01:03 2020] RAX: 0000000000000000 RBX: ffff9a169a1fa4a4 RCX: 0000000000000000
[Fri May 29 00:01:03 2020] RDX: 001f000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[Fri May 29 00:01:03 2020] RBP: ffffb13789767ac8 R08: 0000000000000000 R09: ffffffff9a372900
[Fri May 29 00:01:03 2020] R10: ffff9a1692cc92a0 R11: 0000000000000001 R12: 0000000000000001
[Fri May 29 00:01:03 2020] R13: ffff9a169a1fa428 R14: ffff9a169a1fa400 R15: 0000000000000246
[Fri May 29 00:01:03 2020] FS: 00007fe39a3ff700(0000) GS:ffff9a169eb40000(0000) knlGS:0000000000000000
[Fri May 29 00:01:03 2020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri May 29 00:01:03 2020] CR2: 00007fe39edff9d0 CR3: 000000028ce0a004 CR4: 00000000003626e0
[Fri May 29 00:01:03 2020] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Fri May 29 00:01:03 2020] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Fri May 29 00:01:03 2020] Call Trace:
[Fri May 29 00:01:03 2020] __synchronize_hardirq+0x6f/0xd0
[Fri May 29 00:01:03 2020] __free_irq+0x145/0x2c0
[Fri May 29 00:01:03 2020] free_irq+0x32/0x70
[Fri May 29 00:01:03 2020] vfio_intx_set_signal+0x39/0x1d0 [vfio_pci]
[Fri May 29 00:01:03 2020] vfio_intx_disable+0x3a/0x60 [vfio_pci]
[Fri May 29 00:01:03 2020] vfio_pci_set_intx_trigger+0x117/0x180 [vfio_pci]
[Fri May 29 00:01:03 2020] vfio_pci_set_irqs_ioctl+0x87/0xb0 [vfio_pci]
[Fri May 29 00:01:03 2020] vfio_pci_disable+0x58/0x4a0 [vfio_pci]
[Fri May 29 00:01:03 2020] ? vfio_pci_disable+0x4a0/0x4a0 [vfio_pci]
[Fri May 29 00:01:03 2020] vfio_pci_release+0x4d/0x50 [vfio_pci]
[Fri May 29 00:01:03 2020] vfio_device_fops_release+0x22/0x40 [vfio]
[Fri May 29 00:01:03 2020] __fput+0xc6/0x260
[Fri May 29 00:01:03 2020] ____fput+0xe/0x10
[Fri May 29 00:01:03 2020] task_work_run+0x9d/0xc0
[Fri May 29 00:01:03 2020] do_exit+0x367/0xab0
[Fri May 29 00:01:03 2020] do_group_exit+0x47/0xb0
[Fri May 29 00:01:03 2020] get_signal+0x140/0x850
[Fri May 29 00:01:03 2020] ? __fpu__restore_sig+0x48d/0x610
[Fri May 29 00:01:03 2020] ? __set_current_blocked+0x3b/0x60
[Fri May 29 00:01:03 2020] do_signal+0x34/0x6e0
[Fri May 29 00:01:03 2020] ? __x64_sys_futex+0x143/0x17f
[Fri May 29 00:01:03 2020] ? restore_altstack+0x51/0x70
[Fri May 29 00:01:03 2020] exit_to_usermode_loop+0x90/0x130
[Fri May 29 00:01:03 2020] do_syscall_64+0x160/0x190
[Fri May 29 00:01:03 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri May 29 00:01:03 2020] RIP: 0033:0x7fe7d771629c
[Fri May 29 00:01:03 2020] Code: Bad RIP value.
[Fri May 29 00:01:03 2020] RSP: 002b:00007fe39a3fa308 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[Fri May 29 00:01:03 2020] RAX: fffffffffffffe00 RBX: 00007fe7ca139280 RCX: 00007fe7d771629c
[Fri May 29 00:01:03 2020] RDX: 0000000000000002 RSI: 0000000000000080 RDI: 000055ac99eb08c0
[Fri May 29 00:01:03 2020] RBP: 0000000000000000 R08: 000055ac99eb08c0 R09: 000055ac99eb07c0
[Fri May 29 00:01:03 2020] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
[Fri May 29 00:01:03 2020] R13: 000055ac99eb08c0 R14: 0000000000000000 R15: 00007fe7ca1392a8

I have no idea what is going on and I wonder if anyone could kindly help me! Thanks!

Best,
Harold
 
Hi,

the debug message looks like you have a problem with the PCIe passthrough.
 
PCIe passthrough with GPU is tricky and there is no standard method.
What GPU it is and how does your VM config look like?
 
1591175236747.pngThe GPU is GTX 1660s. I know N card could be tricky, but I have never known that it would affect the host system.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!