Proxmox "random" stuck/halt on 12900HK Erying Similar board

bocatadejamon

New Member
Apr 25, 2024
7
0
1
Hi,

I will try to explain the problem as best as i can:

i have a board with a 12900hk, the board it's a similar board to the erying one.

r/Proxmox - Proxmox random stuck/halt on 12900HK Erying Similar board

In this board i only have connected:

PSU Corsair RM850 850W

2 x 16gb ram corsair 3200 DDR4

RJ45 cable

1 NVME CT1000P2 connected on the bottom NVME slot, close to the pci slot

1 USB with Unraid that i'm currently not using


Tests that i made:

memtest completed on ram and all was fine

The board didn't halt/got stuck on unraid or windows 10

I have virtualization and IOMMU enabled on BIOS.

I tried both 1gbps interface and also the 2.5gb one, both of them have the same behavior


I was using the board with unraid but i don't like at all the VM management, so i switched the OS to proxmox 7.4

It was a mess, it crashed a lot, like each 20 minutes the whole system crash and it would halt.

i was also seeing a lot of errors related to the PCI ASPM (8gb of log errors in 5-6 minutes) and i got a "fix" here by adding pcie_aspm=off

完全体的全能主机,大小核的终极方案 - 3.使用篇 - 知乎

But that didn't solved the problems

So i thought that it might be related to the fact that it's an older version and changed it to Proxmox 8.1 but i'm actually seeing the same.

When the servers halt there are no logs on journalctl and no messages on dmesg and the only way to recover it is to force shutdown by holding the power button.

I also had a ping with a keyboard+screen directly connected to the server and when it halt the cli won't respond at all and the screen won't to anything

1º I install fresh Proxmox 8.1 ext4 type

2º I manually copy the .raw vm's drives to the server, configure them and start them

2.1 º The vm's are actually light, it's HomeAssistant, Klipper and 2 Ubuntu servers, each one with 1 core and 2gb of ram

3º i let the server and it was able to stay "alive" around 2 days

4º i enabled IOMMU following this guide https://www.servethehome.com/how-to-pass-through-pcie-nics-with-proxmox-ve-on-intel-and-amd/ this was on Wednesday

I also added pcie_aspm=off as the errors that i saw on proxmox 7.4 and also pcie_port_pm=off

5º The server was working fine until today at 2 am, when it got stuck again (attached file)

6º I tried to change drivers of the network interfaces, as it's using the rtl8169 driver just in case, as a previosly with other boards had problems with this but i couldn't make it work following this guide https://www.reddit.com/r/Proxmox/co...8169_nic_dell_micro_formfactors_in/?rdt=51878

The drivers weren't working and i had to manually reverse it back to rtl8169 as proxmox wasn't seeing the network interfaces

7º Right now i'm trying with the iommu disabled if that could be the case

Any ideas?

I want to throw the board out of the window
 

Attachments

  • proxmox 2am crash.txt
    25.9 KB · Views: 2
  • lspci.txt
    1.5 KB · Views: 0
So, i tried and updated to proxmox 8.2 to test the kernel i case it would fix my problems, but it's still happening

Today 1 got a new crash around 10:55 am

Right now i'm making tests with the cstates
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on pcie_aspm=off pcie_port_pm=off intel_idle.max_cstate=3"
GRUB_CMDLINE_LINUX=""
 

Attachments

  • proxmox new crash.txt
    788.6 KB · Views: 0
I changed the S.O to Unraid (i was planning to run proxmox and have a VM with unraid) and in unraid i'm running perfectly, no problems even with some VM's so it's very weird
On Unraid i'm running same microcodes, same max cstate (9) and same bios, i didn't changed any option

Any idea?
I would really like to run proxmox as i don't like the VM options on Unraid
 
So i came back to proxmox after placing another board on my unraid nas server

I installed Windows 10 on the board and had running Prime95 for 1 and half hours with no problems
Brand new 750W PSU

After that:

Fresh Proxmox 8.2 install on a 1 tb nvme

After finishing install
Configure syslog server to the nas
apt update && apt install rsyslog


Apply modprobe for the temp sensors:
modprobe nct6775
Disabled enterprise repos and enabled the non suscription ones
Install lm-sensors

Until now all normal, it works fine.

I enable IOMMU using


nano /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"

And added modules:
nano /etc/modules
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

I rebooted and around 10 minutes later the proxmox server started to crash and in each boot around 5-20 min it crashes, no logs in journal, no logs on syslog

Now i'm trying the same with another NVME drive
 
Last edited:
OKay, so i installed Proxmox 8.2 in another NVMe, a 250gb one

Installed perfectly using ext4,

Booted as normal
Removed enterprise repos
Enabled syslog and configured it to the nas
Installed lm-sensors
Apply modprobe for the temp sensors:
modprobe nct6775
Made an apt-get update & upgrade just in case

After that i created a new VM and installed a OS in it

For now it has been 1 hor and 30 minutes and the system it's stable (i didn't enabled IOMMU)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!