We recently upgraded several 2-node Proxmox clusters from kernel 6.8.12-20-pve to 6.8.12-29-pve and started seeing instability on one node in multiple clusters.
Environment:
<span>DMAR: ERROR: DMA PTE for vPFN already set</span>
followed by traces involving:
<span>intel_iommu_map_pages<br>iommu_dma_map_sg<br>scsi_dma_map<br>megasas_build_and_issue_cmd_fusion</span>
and later:
<span>megaraid_sas: resetting fusion adapter scsi0</span>
Interestingly, the issue was observed across multiple servers with nearly identical hardware after the kernel upgrade.
As a mitigation we added:
<span>intel_iommu=on iommu=pt</span>
to the kernel command line.
Since applying this change:
Any feedback or similar experiences would be appreciated.
Environment:
- Proxmox VE 8.x
- Dell PowerEdge R350
- Dell PERC H345 RAID controller (megaraid_sas)
- 2-node clusters with Corosync + QDevice + GlusterFS
- Intel VT-d enabled
- Corosync node drops and rejoins
- Unexpected node reboots in some locations
- Filesystem recovery required after reboot on a few hosts
- MegaRAID controller resets reported by the kernel
<span>DMAR: ERROR: DMA PTE for vPFN already set</span>
followed by traces involving:
<span>intel_iommu_map_pages<br>iommu_dma_map_sg<br>scsi_dma_map<br>megasas_build_and_issue_cmd_fusion</span>
and later:
<span>megaraid_sas: resetting fusion adapter scsi0</span>
Interestingly, the issue was observed across multiple servers with nearly identical hardware after the kernel upgrade.
As a mitigation we added:
<span>intel_iommu=on iommu=pt</span>
to the kernel command line.
Since applying this change:
- No new DMA PTE errors have been observed
- No new MegaRAID controller resets have been observed
- Clusters have remained stable
- Dell R350 (or similar 15G Dell servers)
- PERC H345 / MegaRAID controllers
- Proxmox 8.x kernels in the 6.8 series
- Intel IOMMU / VT-d enabled
Any feedback or similar experiences would be appreciated.