Troubleshooting of hard lockup caused by Linux kernel or pve

xiaopo

New Member
Jul 11, 2022
9
1
3
1、pve configuration and passthrough

pve 7.4.3
Linux pve 5.15.102-1-pve #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) x86_64 GNU/Linux

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt initcall_blacklist=sysfb_init video=simplefb:off pcie_acs_override=downstream,multifunction split_lock_detect=off"

Windows with desktop passthrough has 1660s graphics card, usb keyboard and mouse, and DP graphics card to directly output the picture to the monitor.

linux with desktop pastthrough a teslpa p4 graphics card for decoding emby.

2、Crash phenomenon

Windows with desktop vm is stuck, linux with desktop vm is stuck, keyboard and mouse are unresponsive, pve host cannot ssh

The traces always identical when it crashes.

The traces of the last two lockups are as follows:

No. 8-2:
Code:
Aug  2 14:37:09 pve kernel: [1260278.267429] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
Aug  2 14:37:09 pve kernel: [1260278.267433] Modules linked in: tcp_diag inet_diag cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs veth ebtable_filter ebtables ip_set ip6table_raw iptabl
e_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal
 intel_powerclamp coretemp snd_hda_codec_realtek kvm_intel snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi kvm snd_hda_intel ast crct10dif_pclmul snd_intel_dspcfg ghash_clmulni_intel
snd_usb_audio drm_vram_helper snd_intel_sdw_acpi aesni_intel snd_hda_codec drm_ttm_helper snd_usbmidi_lib ttm crypto_simd snd_rawmidi snd_hda_core cryptd snd_seq_device drm_kms_helper snd_hw
dep mc cec snd_pcm rc_core rapl rndis_host snd_timer fb_sys_fops syscopyarea cdc_ether mei_me snd sysfillrect usbnet isst_if_mbox_pci isst_if_mmio sysimgblt intel_cstate mii soundcore pcspkr
 joydev efi_pstore input_leds acpi_ipmi isst_if_common intel_pch_thermal mei ioatdma
Aug  2 14:37:09 pve kernel: [1260278.267474]  ipmi_si ipmi_devintf zfs(PO) ipmi_msghandler acpi_power_meter acpi_pad zunicode(PO) zzstd(O) mac_hid zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpai
r(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_
type1 vfio drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio usbkbd libcrc32c usbmouse hi
d_generic usbhid hid crc32_pclmul nvme xhci_pci igb i2c_i801 xhci_pci_renesas nvme_core i2c_algo_bit i2c_smbus ahci dca libahci xhci_hcd intel_pmt wmi
Aug  2 14:37:09 pve kernel: [1260278.267507] CPU: 3 PID: 4137035 Comm: CPU 11/KVM Tainted: P        W  O      5.15.102-1-pve #1
Aug  2 14:37:09 pve kernel: [1260278.267510] Hardware name: Supermicro X12DAi-N6/X12DAi-N6, BIOS 1.1b 09/10/2021
Aug  2 14:37:09 pve kernel: [1260278.267511] RIP: 0010:_raw_spin_lock+0x0/0x30
Aug  2 14:37:09 pve kernel: [1260278.267516] Code: 00 f0 0f b1 17 75 05 c3 cc cc cc cc 55 89 c6 48 89 e5 e8 43 5b 39 ff 66 90 5d c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00
 00 31 c0 ba 01 00 00 00 f0 0f b1 17 75 05 c3 cc cc cc
Aug  2 14:37:09 pve kernel: [1260278.267518] RSP: 0018:ff724a9600540d18 EFLAGS: 00000046
Aug  2 14:37:09 pve kernel: [1260278.267520] RAX: ff724a96000c5000 RBX: 0000000000000004 RCX: ff22181d4004b400
Aug  2 14:37:09 pve kernel: [1260278.267521] RDX: ff22181d4004b400 RSI: 0000000000000000 RDI: ff22181d4020dcc0
Aug  2 14:37:09 pve kernel: [1260278.267522] RBP: ff724a9600540dc8 R08: 00000000000003ac R09: ff22181d4020dcc0
Aug  2 14:37:09 pve kernel: [1260278.267523] R10: 0000000000000010 R11: 0000000000000004 R12: 00000000000003ac
Aug  2 14:37:09 pve kernel: [1260278.267524] R13: 0000000000000000 R14: ff22181d401d4e00 R15: ff22181d4020dcc0
Aug  2 14:37:09 pve kernel: [1260278.267525] FS:  00007fa3dbdff700(0000) GS:ff22185bbf2c0000(0000) knlGS:ffffd4815cb15000
Aug  2 14:37:09 pve kernel: [1260278.267527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  2 14:37:09 pve kernel: [1260278.267528] CR2: 00001bbd00881000 CR3: 00000039b81ec004 CR4: 0000000000773ee0
Aug  2 14:37:09 pve kernel: [1260278.267529] PKRU: 55555554
Aug  2 14:37:09 pve kernel: [1260278.267530] Call Trace:
Aug  2 14:37:09 pve kernel: [1260278.267532]  <IRQ>
Aug  2 14:37:09 pve kernel: [1260278.267532]  ? qi_submit_sync+0x328/0x5c0
Aug  2 14:37:09 pve kernel: [1260278.267537]  qi_flush_iotlb+0x84/0xa0
Aug  2 14:37:09 pve kernel: [1260278.267539]  intel_flush_iotlb_all+0x59/0x160
Aug  2 14:37:09 pve kernel: [1260278.267541]  iommu_dma_flush_iotlb_all+0x1a/0x30
Aug  2 14:37:09 pve kernel: [1260278.267544]  iova_domain_flush+0x1b/0x30
Aug  2 14:37:09 pve kernel: [1260278.267546]  fq_flush_timeout+0x39/0xc0
Aug  2 14:37:09 pve kernel: [1260278.267547]  ? fq_ring_free+0x170/0x170
Aug  2 14:37:09 pve kernel: [1260278.267549]  call_timer_fn+0x29/0x120
Aug  2 14:37:09 pve kernel: [1260278.267554]  __run_timers.part.0+0x1e1/0x270
Aug  2 14:37:09 pve kernel: [1260278.267555]  ? ktime_get+0x43/0xc0
Aug  2 14:37:09 pve kernel: [1260278.267557]  ? lapic_next_deadline+0x2c/0x40
Aug  2 14:37:09 pve kernel: [1260278.267561]  ? clockevents_program_event+0xa8/0x130
Aug  2 14:37:09 pve kernel: [1260278.267564]  run_timer_softirq+0x2a/0x60
Aug  2 14:37:09 pve kernel: [1260278.267565]  __do_softirq+0xd6/0x2ea
Aug  2 14:37:09 pve kernel: [1260278.267568]  irq_exit_rcu+0x94/0xc0
Aug  2 14:37:09 pve kernel: [1260278.267570]  sysvec_apic_timer_interrupt+0x80/0x90
Aug  2 14:37:09 pve kernel: [1260278.267574]  </IRQ>
Aug  2 14:37:09 pve kernel: [1260278.267575]  <TASK>
Aug  2 14:37:09 pve kernel: [1260278.267575]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Aug  2 14:37:09 pve kernel: [1260278.267577] RIP: 0010:vmx_do_interrupt_nmi_irqoff+0x10/0x20 [kvm_intel]
Aug  2 14:37:09 pve kernel: [1260278.267590] Code: 41 5b 41 5a 41 59 41 58 5e 5f 5a 59 58 5d e9 47 da c7 dc 0f 1f 80 00 00 00 00 55 48 89 e5 48 83 e4 f0 6a 18 55 9c 6a 10 ff d7 <0f> 1f 00 48
 89 ec 5d e9 24 da c7 dc 0f 1f 40 00 0f 1f 44 00 00 55
Aug  2 14:37:09 pve kernel: [1260278.267591] RSP: 0018:ff724a9606cefcd8 EFLAGS: 00000082
Aug  2 14:37:09 pve kernel: [1260278.267593] RAX: 0000000000000e30 RBX: ff22181ef2ce8000 RCX: 0000000000000000
Aug  2 14:37:09 pve kernel: [1260278.267594] RDX: ffffffff00000000 RSI: 0001000000000000 RDI: ffffffff9e000e30

Most recent 8-19:
Code:
Aug 19 09:28:29 pve kernel: [777177.076338] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
Aug 19 09:28:29 pve kernel: [777177.076340] Modules linked in: tcp_diag inet_diag cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache net
fs veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bondin
g tls softdog nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powercl
amp coretemp kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd snd_hda_codec_realtek snd_hda_codec_
generic ledtrig_audio snd_hda_codec_hdmi ast drm_vram_helper drm_ttm_helper snd_hda_intel snd_usb_audio ttm snd_intel_dspcfg snd_us
bmidi_lib snd_intel_sdw_acpi drm_kms_helper snd_rawmidi snd_hda_codec snd_seq_device snd_hda_core cec snd_hwdep mc rc_core zfs(PO)
snd_pcm rndis_host fb_sys_fops snd_timer rapl cdc_ether syscopyarea mei_me zunicode(PO) snd sysfillrect usbnet isst_if_mbox_pci iss
t_if_mmio sysimgblt intel_cstate isst_if_common mii soundcore efi_pstore pcspkr joydev ioatdma intel_pch_thermal mei
Aug 19 09:28:29 pve kernel: [777177.076381]  input_leds zzstd(O) zlua(O) acpi_ipmi zavl(PO) ipmi_si icp(PO) ipmi_devintf ipmi_msgha
ndler acpi_power_meter acpi_pad zcommon(PO) mac_hid znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm
ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_type1 vfio drm
 sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb hid_generic usbmouse usbkbd dm_thin_po
ol dm_persistent_data dm_bio_prison dm_bufio libcrc32c usbhid hid crc32_pclmul nvme xhci_pci i2c_i801 xhci_pci_renesas igb nvme_cor
e i2c_smbus i2c_algo_bit ahci dca xhci_hcd libahci intel_pmt wmi
Aug 19 09:28:29 pve kernel: [777177.076415] CPU: 32 PID: 0 Comm: swapper/32 Tainted: P        W  O      5.15.102-1-pve #1
Aug 19 09:28:29 pve kernel: [777177.076417] Hardware name: Supermicro X12DAi-N6/X12DAi-N6, BIOS 1.1b 09/10/2021
Aug 19 09:28:29 pve kernel: [777177.076418] RIP: 0010:qi_submit_sync+0x2db/0x5c0
Aug 19 09:28:29 pve kernel: [777177.076424] Code: 4d 8b 8e 10 01 00 00 31 db 41 f6 46 25 08 0f 95 c3 49 8b 41 10 83 c3 04 42 83 3c
20 03 0f 84 a3 01 00 00 49 8b 06 44 8b 68 34 <41> f6 c5 70 0f 85 5c 01 00 00 41 f6 c5 10 74 18 49 8b 06 8b 80 80
Aug 19 09:28:29 pve kernel: [777177.076425] RSP: 0018:ff594f7a00b24d20 EFLAGS: 00000093
Aug 19 09:28:29 pve kernel: [777177.076427] RAX: ff594f7a000c5000 RBX: 0000000000000004 RCX: ff3e37f40004b400
Aug 19 09:28:29 pve kernel: [777177.076428] RDX: ff3e37f40004b400 RSI: 0000000000000000 RDI: ff3e37f40020d1c0
Aug 19 09:28:29 pve kernel: [777177.076429] RBP: ff594f7a00b24dc8 R08: 000000000000014c R09: ff3e37f40020d1c0
Aug 19 09:28:29 pve kernel: [777177.076430] R10: 0000000000000010 R11: 0000000000000004 R12: 000000000000014c
Aug 19 09:28:29 pve kernel: [777177.076431] R13: 0000000000000000 R14: ff3e37f4001d4000 R15: ff3e37f40020d1c0
Aug 19 09:28:29 pve kernel: [777177.076432] FS:  0000000000000000(0000) GS:ff3e38327fa00000(0000) knlGS:0000000000000000
Aug 19 09:28:29 pve kernel: [777177.076433] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 19 09:28:29 pve kernel: [777177.076434] CR2: 000000001f1a5080 CR3: 0000000128eb8002 CR4: 0000000000773ee0
Aug 19 09:28:29 pve kernel: [777177.076436] PKRU: 55555554
Aug 19 09:28:29 pve kernel: [777177.076436] Call Trace:
Aug 19 09:28:29 pve kernel: [777177.076437]  <IRQ>
Aug 19 09:28:29 pve kernel: [777177.076438]  ? enqueue_entity+0x17d/0x760
Aug 19 09:28:29 pve kernel: [777177.076446]  qi_flush_iotlb+0x84/0xa0
Aug 19 09:28:29 pve kernel: [777177.076447]  intel_flush_iotlb_all+0x59/0x160
Aug 19 09:28:29 pve kernel: [777177.076450]  iommu_dma_flush_iotlb_all+0x1a/0x30
Aug 19 09:28:29 pve kernel: [777177.076452]  iova_domain_flush+0x1b/0x30
Aug 19 09:28:29 pve kernel: [777177.076454]  fq_flush_timeout+0x39/0xc0
Aug 19 09:28:29 pve kernel: [777177.076456]  ? fq_ring_free+0x170/0x170
Aug 19 09:28:29 pve kernel: [777177.076458]  call_timer_fn+0x29/0x120
Aug 19 09:28:29 pve kernel: [777177.076462]  __run_timers.part.0+0x1e1/0x270
Aug 19 09:28:29 pve kernel: [777177.076463]  ? ktime_get+0x43/0xc0
Aug 19 09:28:29 pve kernel: [777177.076465]  ? lapic_next_deadline+0x2c/0x40
Aug 19 09:28:29 pve kernel: [777177.076469]  ? clockevents_program_event+0xa8/0x130
Aug 19 09:28:29 pve kernel: [777177.076473]  run_timer_softirq+0x2a/0x60
Aug 19 09:28:29 pve kernel: [777177.076474]  __do_softirq+0xd6/0x2ea
Aug 19 09:28:29 pve kernel: [777177.076478]  irq_exit_rcu+0x94/0xc0
Aug 19 09:28:29 pve kernel: [777177.076480]  sysvec_apic_timer_interrupt+0x80/0x90
Aug 19 09:28:29 pve kernel: [777177.076483]  </IRQ>
Aug 19 09:28:29 pve kernel: [777177.076484]  <TASK>
Aug 19 09:28:29 pve kernel: [777177.076484]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Aug 19 09:28:29 pve kernel: [777177.076486] RIP: 0010:cpuidle_enter_state+0xd9/0x620
Aug 19 09:28:29 pve kernel: [777177.076491] Code: 3d 04 78 5e 7c e8 37 36 6d ff 49 89 c7 0f 1f 44 00 00 31 ff e8 78 43 6d ff 80 7d
d0 00 0f 85 5e 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 6a 01 00 00 4d 63 ee 49 83 fd 09 0f 87 e5 03 00 00
Aug 19 09:28:29 pve kernel: [777177.076492] RSP: 0018:ff594f7a003a7e38 EFLAGS: 00000246
Aug 19 09:28:29 pve kernel: [777177.076493] RAX: ff3e38327fa30bc0 RBX: ff8b4f79fa637d00 RCX: 0000000000000000
Aug 19 09:28:29 pve kernel: [777177.076494] RDX: 0000000000016176 RSI: 00000000471c676c RDI: 0000000000000000
Aug 19 09:28:29 pve kernel: [777177.076495] RBP: ff594f7a003a7e88 R08: 0002c2d2999fe5b0 R09: 00000000000927c0
Aug 19 09:28:29 pve kernel: [777177.076496] R10: 0000000000000004 R11: 071c71c71c71c71c R12: ffffffff84ed4ca0
Aug 19 09:28:29 pve kernel: [777177.076497] R13: 0000000000000002 R14: 0000000000000002 R15: 0002c2d2999fe5b0
Aug 19 09:28:29 pve kernel: [777177.076499]  ? cpuidle_enter_state+0xc8/0x620
Aug 19 09:28:29 pve kernel: [777177.076502]  cpuidle_enter+0x2e/0x50

It can be seen that the two times are related to qi_submit_sync, iommu_dma_flus, and fq_flush_timeout, and these are related to iommu, so I guess it is related to the opening of iommu? or degrade to pve6 and linux kernel <5.7
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!