Hi,
I'm testing Proxmox 8 on a Dell Precision T7960 workstation with the new Xeon w5-3435X (Sapphire Rapids) and I get a kernel Oops everytime I try to launch a VM.
The computer has been configured to enable PCI(e) passthrough using the doc (1 / 2) but the Oops occurs even when trying to exec a VM without any PCI device attached to it, so I guess that was not the cause.
Strange thing though, according to the doc, I had to enable IOMMU by putting
I tried kernels 6.2.16-3-pve (default from ISO), 6.2.16-15-pve and 6.2.16-16-pve (most up to date at the moment).
Every time I want to launch a VM, first it starts up correctly and the kernel Oops a few seconds later (up to a point where a pre-restored from PBS Windows 10 VM makes it to its circular loading animation).
Here is the kernel output then :
Complete dmesg : https://pastebin.com/BVSJQrYL
Another tests : https://pastebin.com/jdVH7Yy0 and https://pastebin.com/v4Pk2BHQ
On the BIOS, up to date, all relevant virtualization support options are enabled already : VT, VT for Direct I/O, TXT (tried with and without) and Pre-Boot DMA protection + OS kernel DMA support (tried with and without). I also tried an option to "limit memory to less than 1 TB" (I have 64GB) because it is supposed to improve compatibility with some PCIE adapters, but no luck.
I also tested PVE 7 on this computer to check if that would work and, despite not having kernel oops, I was not able to launch a VM at all. I had a QEMU error preventing the start and did not spent much more time trying since it was just a test, but it made me suppose that the problem is somehow linked the (rather new) hardware configuration.
Any idea ? What can I do ?
Thanks.
I'm testing Proxmox 8 on a Dell Precision T7960 workstation with the new Xeon w5-3435X (Sapphire Rapids) and I get a kernel Oops everytime I try to launch a VM.
The computer has been configured to enable PCI(e) passthrough using the doc (1 / 2) but the Oops occurs even when trying to exec a VM without any PCI device attached to it, so I guess that was not the cause.
Strange thing though, according to the doc, I had to enable IOMMU by putting
intel_iommu=on iommu=pt
on the cmdline even though its written that it is only necessary for pre-5.15 kernels.I tried kernels 6.2.16-3-pve (default from ISO), 6.2.16-15-pve and 6.2.16-16-pve (most up to date at the moment).
Every time I want to launch a VM, first it starts up correctly and the kernel Oops a few seconds later (up to a point where a pre-restored from PBS Windows 10 VM makes it to its circular loading animation).
Here is the kernel output then :
Bash:
[ 143.155575] BUG: unable to handle page fault for address: ff2c3744a37f7cff
[ 143.155583] #PF: supervisor write access in kernel mode
[ 143.155586] #PF: error_code(0x0003) - permissions violation
[ 143.155588] PGD 117801067 P4D 117802067 PUD 1001f3063 PMD 1236c4063 PTE 80000001237f7161
[ 143.155593] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 143.155596] CPU: 0 PID: 631 Comm: z_wr_iss Tainted: P O 6.2.16-3-pve #1
[ 143.155598] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6, BIOS 1.1.10 07/27/2023
[ 143.155600] RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
[ 143.155612] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d d5 78 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
[ 143.155616] RSP: 0018:ff69853287b0f930 EFLAGS: 00010082
[ 143.155619] RAX: 00000000ffffffff RBX: ff2c3744fd314000 RCX: ff2c3744a37f5000
[ 143.155621] RDX: 00000000ffffffff RSI: ff2c3744fd314000 RDI: ff69853287b0fa80
[ 143.155623] RBP: ff69853287b0f930 R08: 0000000000000000 R09: 0000000000000000
[ 143.155625] R10: 0000000000000000 R11: 0000000000000000 R12: ff2c3744fd315000
[ 143.155627] R13: ff69853287b0fa80 R14: 0000000000001000 R15: 0000000000000000
[ 143.155629] FS: 0000000000000000(0000) GS:ff2c3753cfe00000(0000) knlGS:0000000000000000
[ 143.155632] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 143.155635] CR2: ff2c3744a37f7cff CR3: 000000010e620002 CR4: 0000000000773ef0
[ 143.155637] PKRU: 55555554
[ 143.155638] Call Trace:
[ 143.155641] <TASK>
[ 143.155644] fletcher_4_avx512f_native+0x1d/0xb0 [zcommon]
[ 143.155658] abd_fletcher_4_iter+0x71/0xe0 [zcommon]
[ 143.155668] abd_iterate_func+0x104/0x1e0 [zfs]
[ 143.155789] ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[ 143.155795] ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
[ 143.155912] abd_fletcher_4_native+0x89/0xd0 [zfs]
[ 143.156005] ? txg_all_lists_empty+0x4f/0xa0 [zfs]
[ 143.156091] ? zio_vdev_io_done+0x4e/0x240 [zfs]
[ 143.156169] zio_checksum_compute+0x154/0x550 [zfs]
[ 143.156240] ? __kmem_cache_alloc_node+0x19d/0x340
[ 143.156247] ? spl_kmem_alloc+0xc3/0x120 [spl]
[ 143.156257] ? spl_kmem_alloc+0xc3/0x120 [spl]
[ 143.156263] ? __kmalloc_node+0x52/0xe0
[ 143.156266] ? spl_kmem_alloc+0xc3/0x120 [spl]
[ 143.156273] zio_checksum_generate+0x4d/0x80 [zfs]
[ 143.156344] zio_execute+0x94/0x170 [zfs]
[ 143.156414] taskq_thread+0x2ac/0x4d0 [spl]
[ 143.156422] ? __pfx_default_wake_function+0x10/0x10
[ 143.156426] ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 143.156497] ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 143.156504] kthread+0xe6/0x110
[ 143.156508] ? __pfx_kthread+0x10/0x10
[ 143.156511] ret_from_fork+0x29/0x50
[ 143.156514] </TASK>
[ 143.156515] Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit snd_sof_pci_intel_tgl x86_pkg_temp_thermal snd_sof_intel_hda_common intel_powerclamp soundwire_intel soundwire_generic_allocation coretemp soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_ctl_led snd_sof_utils snd_soc_hdac_hda kvm_intel snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_hda_codec_realtek kvm snd_hda_codec_generic snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine crct10dif_pclmul polyval_clmulni snd_hda_intel polyval_generic ghash_clmulni_intel snd_intel_dspcfg dell_wmi pmt_crashlog sha512_ssse3 pmt_telemetry ledtrig_audio snd_intel_sdw_acpi intel_sdsi pmt_class aesni_intel snd_virtuoso
[ 143.156549] snd_hda_codec crypto_simd snd_oxygen_lib snd_mpu401_uart cryptd dell_wmi_ddv snd_hda_core snd_rawmidi rapl snd_hwdep snd_seq_device dell_smbios snd_pcm dell_wmi_sysman intel_cstate sparse_keymap dcdbas ucsi_ccg pcspkr cmdlinepart firmware_attributes_class video snd_timer typec_ucsi dell_wmi_descriptor isst_if_mmio wmi_bmof isst_if_mbox_pci spi_nor idxd snd mei_me typec isst_if_common intel_vsec idxd_bus soundcore mtd mei input_leds mac_hid vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor hid_generic usbmouse usbkbd usbhid hid raid6_pq libcrc32c simplefb rtsx_pci_sdmmc nvme xhci_pci i2c_nvidia_gpu xhci_pci_renesas nvme_core crc32_pclmul atlantic i2c_ccgx_ucsi spi_intel_pci nvme_common i2c_i801 e1000e ahci rtsx_pci spi_intel i2c_smbus macsec xhci_hcd libahci wmi
[ 143.156601] pinctrl_alderlake
[ 143.156610] CR2: ff2c3744a37f7cff
[ 143.156612] ---[ end trace 0000000000000000 ]---
[ 143.316004] RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
[ 143.316030] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d d5 78 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
[ 143.316034] RSP: 0018:ff69853287b0f930 EFLAGS: 00010082
[ 143.316037] RAX: 00000000ffffffff RBX: ff2c3744fd314000 RCX: ff2c3744a37f5000
[ 143.316039] RDX: 00000000ffffffff RSI: ff2c3744fd314000 RDI: ff69853287b0fa80
[ 143.316041] RBP: ff69853287b0f930 R08: 0000000000000000 R09: 0000000000000000
[ 143.316043] R10: 0000000000000000 R11: 0000000000000000 R12: ff2c3744fd315000
[ 143.316044] R13: ff69853287b0fa80 R14: 0000000000001000 R15: 0000000000000000
[ 143.316046] FS: 0000000000000000(0000) GS:ff2c3753cfe00000(0000) knlGS:0000000000000000
[ 143.316048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 143.316050] CR2: ff2c3744a37f7cff CR3: 000000010e620002 CR4: 0000000000773ef0
[ 143.316052] PKRU: 55555554
[ 143.316053] note: z_wr_iss[631] exited with irqs disabled
[ 143.316074] note: z_wr_iss[631] exited with preempt_count 1
Another tests : https://pastebin.com/jdVH7Yy0 and https://pastebin.com/v4Pk2BHQ
On the BIOS, up to date, all relevant virtualization support options are enabled already : VT, VT for Direct I/O, TXT (tried with and without) and Pre-Boot DMA protection + OS kernel DMA support (tried with and without). I also tried an option to "limit memory to less than 1 TB" (I have 64GB) because it is supposed to improve compatibility with some PCIE adapters, but no luck.
I also tested PVE 7 on this computer to check if that would work and, despite not having kernel oops, I was not able to launch a VM at all. I had a QEMU error preventing the start and did not spent much more time trying since it was just a test, but it made me suppose that the problem is somehow linked the (rather new) hardware configuration.
Any idea ? What can I do ?
Thanks.
Last edited: