Agreed, I am just not 100% sure this is my issue. From reading the threads it sounds like this change was made in kernel 6.8. I have tested all the way upto 6.8.12-18 and it I have no issues. But if I try 6.8.12-29, it crashes after some time. I actualy was able to update my bios to the newest version. It ran much longer but its still crashed.
I am interested in the difference between 18 and 29.
Same here: With update to 6.8.12-29-pve i'm seeing many "DMAR: ERROR: DMA PTE for vPFN" messages in journal. After some time it crashes/gets killed by fencing. Getting back to 6.8.12-22-pve solves the issue. My HW is "Dell PowerEdge R740" with "MegaRAID SAS controller".
To me it looks a regression between 6.8.12-22-pve and 6.8.12-29-pve,reproducible on multiple Dell R740 nodes with megaraid_sas controller.
The part of the log showing one of the kernel messeges:
Jun 04 03:14:30 clrz19-17 kernel: DMAR: ERROR: DMA PTE for vPFN 0x50700 already set (to 50700003 not 396de8001)
Jun 04 03:14:30 clrz19-17 kernel: ------------[ cut here ]------------
Jun 04 03:14:30 clrz19-17 kernel: WARNING: CPU: 75 PID: 2165752 at drivers/iommu/intel/iommu.c:2231 __domain_mapping+0x30c/0x510
Jun 04 03:14:30 clrz19-17 kernel: Modules linked in: veth rbd xt_mac ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_physdev xt_addrtype xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_tcpudp xt_set xt_mark iptable_filter ip_set_hash_net ip_set sctp scsi_dh_alua dm_service_time iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nf_tables libcrc32c vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp iTCO_wdt intel_pmc_bxt iTCO_vendor_support sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac skx_edac_common nfit x86_pkg_temp_thermal ipmi_ssif intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd cmdlinepart dell_smbios rapl
Jun 04 03:14:30 clrz19-17 kernel: dcdbas spi_nor mgag200 mei_me intel_cstate dell_wmi_descriptor wmi_bmof pcspkr i2c_algo_bit mtd acpi_power_meter mei intel_pch_thermal ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler mac_hid cdc_ether usbnet mii zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap dm_multipath efi_pstore dmi_sysfs ip_tables x_tables autofs4 xhci_pci xhci_pci_renesas crc32_pclmul i40e ahci i2c_i801 spi_intel_pci xhci_hcd tg3 megaraid_sas spi_intel lpc_ich i2c_smbus libahci wmi
Jun 04 03:14:30 clrz19-17 kernel: CPU: 75 PID: 2165752 Comm: kworker/u194:4 Tainted: P O 6.8.12-29-pve #1
Jun 04 03:14:30 clrz19-17 kernel: Hardware name: Dell Inc. PowerEdge R740/0WXD1Y, BIOS 2.24.0 03/27/2025
Jun 04 03:14:30 clrz19-17 kernel: Workqueue: writeback wb_workfn (flush-252:1)
Jun 04 03:14:30 clrz19-17 kernel: RIP: 0010:__domain_mapping+0x30c/0x510
Jun 04 03:14:30 clrz19-17 kernel: Code: 48 89 c2 4c 89 5d b0 48 c7 c7 80 38 44 95 e8 eb 31 6c ff 8b 05 89 ad 9c 01 4c 8b 5d b0 85 c0 74 09 83 e8 01 89 05 78 ad 9c 01 <0f> 0b e9 f3 fe ff ff 41 80 e2 7f e9 d1 fe ff ff 41 83 c8 01 49 63
Jun 04 03:14:30 clrz19-17 kernel: RSP: 0018:ffffd27489297028 EFLAGS: 00010202
Jun 04 03:14:30 clrz19-17 kernel: RAX: 0000000000000004 RBX: 0000000000000001 RCX: 0000000000000000
Jun 04 03:14:30 clrz19-17 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jun 04 03:14:30 clrz19-17 kernel: RBP: ffffd274892970c0 R08: 0000000000000000 R09: 0000000000000000
Jun 04 03:14:30 clrz19-17 kernel: R10: 0000000000000000 R11: 0000000000000002 R12: ffff8d2a92f95700
Jun 04 03:14:30 clrz19-17 kernel: R13: ffff8d2a9253c800 R14: 0000000396de8001 R15: ffff8d2a9253c800
Jun 04 03:14:30 clrz19-17 kernel: FS: 0000000000000000(0000) GS:ffff8ea6bf880000(0000) knlGS:0000000000000000
Jun 04 03:14:30 clrz19-17 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 04 03:14:30 clrz19-17 kernel: CR2: 000073c9c865e0a0 CR3: 0000001557836002 CR4: 00000000007726f0
Jun 04 03:14:30 clrz19-17 kernel: PKRU: 55555554
Jun 04 03:14:30 clrz19-17 kernel: Call Trace:
Jun 04 03:14:30 clrz19-17 kernel: <TASK>
Jun 04 03:14:30 clrz19-17 kernel: intel_iommu_map_pages+0xe1/0x140
Jun 04 03:14:30 clrz19-17 kernel: __iommu_map+0x11e/0x280
Jun 04 03:14:30 clrz19-17 kernel: iommu_map_sg+0xbf/0x1f0
Jun 04 03:14:30 clrz19-17 kernel: iommu_dma_map_sg+0x45f/0x4f0
Jun 04 03:14:30 clrz19-17 kernel: __dma_map_sg_attrs+0x32/0xd0
Jun 04 03:14:30 clrz19-17 kernel: dma_map_sg_attrs+0xe/0x30
Jun 04 03:14:30 clrz19-17 kernel: scsi_dma_map+0x47/0x70
Jun 04 03:14:30 clrz19-17 kernel: megasas_build_and_issue_cmd_fusion+0x209/0x1890 [megaraid_sas]
Jun 04 03:14:30 clrz19-17 kernel: megasas_queue_command+0x11d/0x1b0 [megaraid_sas]
Jun 04 03:14:30 clrz19-17 kernel: scsi_queue_rq+0x3fe/0xcc0
Jun 04 03:14:30 clrz19-17 kernel: blk_mq_dispatch_rq_list+0x137/0x810
Jun 04 03:14:30 clrz19-17 kernel: ? sbitmap_get+0x73/0x180
Jun 04 03:14:30 clrz19-17 kernel: __blk_mq_sched_dispatch_requests+0x41f/0x5d0
Jun 04 03:14:30 clrz19-17 kernel: ? dd_insert_requests+0x13e/0x450
Jun 04 03:14:30 clrz19-17 kernel: ? sbitmap_get_shallow+0x68/0x140
Jun 04 03:14:30 clrz19-17 kernel: blk_mq_sched_dispatch_requests+0x2f/0x80
Jun 04 03:14:30 clrz19-17 kernel: blk_mq_run_hw_queue+0x259/0x350
Jun 04 03:14:30 clrz19-17 kernel: blk_mq_flush_plug_list.part.0+0x187/0x5c0
Jun 04 03:14:30 clrz19-17 kernel: blk_add_rq_to_plug+0x14d/0x1b0
Jun 04 03:14:30 clrz19-17 kernel: blk_mq_submit_bio+0x596/0x690
Jun 04 03:14:30 clrz19-17 kernel: __submit_bio+0xb3/0x1c0
Jun 04 03:14:30 clrz19-17 kernel: submit_bio_noacct_nocheck+0x17a/0x390
Jun 04 03:14:30 clrz19-17 kernel: submit_bio_noacct+0x1ca/0x660
Jun 04 03:14:30 clrz19-17 kernel: ? bio_add_page+0xa3/0xd0
Jun 04 03:14:30 clrz19-17 kernel: submit_bio+0xb2/0x110
Jun 04 03:14:30 clrz19-17 kernel: ext4_bio_write_folio+0x1b7/0x6f0
Jun 04 03:14:30 clrz19-17 kernel: mpage_submit_folio+0x94/0xc0
Jun 04 03:14:30 clrz19-17 kernel: mpage_map_and_submit_buffers+0x1ae/0x340
Jun 04 03:14:30 clrz19-17 kernel: ext4_do_writepages+0x770/0xe10
Jun 04 03:14:30 clrz19-17 kernel: ext4_writepages+0xb5/0x190
Jun 04 03:14:30 clrz19-17 kernel: do_writepages+0xcd/0x1f0
Jun 04 03:14:30 clrz19-17 kernel: ? wakeup_preempt+0x6b/0x80
Jun 04 03:14:30 clrz19-17 kernel: ? ttwu_do_activate+0x75/0x250
Jun 04 03:14:30 clrz19-17 kernel: __writeback_single_inode+0x44/0x370
Jun 04 03:14:30 clrz19-17 kernel: writeback_sb_inodes+0x211/0x510
Jun 04 03:14:30 clrz19-17 kernel: __writeback_inodes_wb+0x54/0x100
Jun 04 03:14:30 clrz19-17 kernel: ? queue_io+0x115/0x120
Jun 04 03:14:30 clrz19-17 kernel: wb_writeback+0x2df/0x350
Jun 04 03:14:30 clrz19-17 kernel: wb_workfn+0x368/0x4d0
Jun 04 03:14:30 clrz19-17 kernel: ? __schedule+0x433/0x1500
Jun 04 03:14:30 clrz19-17 kernel: ? add_timer+0x20/0x40
Jun 04 03:14:30 clrz19-17 kernel: process_one_work+0x182/0x3a0
Jun 04 03:14:30 clrz19-17 kernel: worker_thread+0x18b/0x330
Jun 04 03:14:30 clrz19-17 kernel: ? __pfx_worker_thread+0x10/0x10
Jun 04 03:14:30 clrz19-17 kernel: kthread+0xf2/0x120
Jun 04 03:14:30 clrz19-17 kernel: ? __pfx_kthread+0x10/0x10
Jun 04 03:14:30 clrz19-17 kernel: ret_from_fork+0x47/0x70
Jun 04 03:14:30 clrz19-17 kernel: ? __pfx_kthread+0x10/0x10
Jun 04 03:14:30 clrz19-17 kernel: ret_from_fork_asm+0x1b/0x30
Jun 04 03:14:30 clrz19-17 kernel: </TASK>
Jun 04 03:14:30 clrz19-17 kernel: ---[ end trace 0000000000000000 ]---