First, I was convinced that the hardware must be okay, because it was already changed in March 2024.
In March, I also experienced random crashes, but the server was still running ESXi, which isn't officially supported on this hardware.
So, I decided to migrate to Proxmox. I had crashes while installing Proxmox, so the hardware was changed, and I had no more issues with this server until last week.
Finally, after another complete hardware change yesterday, I had no more crashes since more than 16 hours.
So maybe there is an issue with the Asus Pro WS
W680 Boards, with the chipset and/or with the i9-13900 (non-K) if the hardware is getting older.
Status update:
#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)
#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
#17: Revert some of the changes, disable ASPM in the BIOS (still crashes)
-> #18: Let Hetzner change the complete hardware, revert most of the changes <- working
BIOS is 2008, standard intel-microcode package from debian stable, kernel is now 6.8.4-2-pve with the initial used cmdline:
Code:
pcie_aspm.policy=performance split_lock_detect=off