[SOLVED] Proxmox System Crashing Every 3-4 Days - Need Help with Debugging

randomuser1990

New Member
Feb 20, 2023
21
3
3
Hello everyone,

I recently set up a new system with Proxmox and encountered some issues.
The system memory was tested for 72 hours using memtest86 without any errors. So I don't expect the issue to be memory.

I have two virtual machines running on Proxmox, one for Ubuntu 22.04 and the other for TrueNas. The Ubuntu VM has no PCIE passthrough, while the TrueNas VM has a controller to the HDDs passthrough. However, the system has been crashing every 3-4 days and I am unable to access the Proxmox UI or any of the VMs.

After a crash has occurred, I plugged the server into a display and saw the following messages:

[348271.123124] INFO: Task kworker/2:3:633181 blocked for more than 604 seconds. [number] Tainted P W O 5.15.74-1-pve #1 [number] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disable this message.

I am not sure what the right procedure is for finding the root cause of the issue. I would appreciate any advice or guidance on how to debug this issue. Please let me know if you need more information about my setup to help me find a solution.

Thank you in advance for your help.
 
Last edited:
You could try a different, more recent kernel version and see if that changes anything:
apt install pve-kernel-6.1

Also check if there are any BIOS updates available.

Hope this helps!
 
You could try a different, more recent kernel version and see if that changes anything:
apt install pve-kernel-6.1

Also check if there are any BIOS updates available.

Hope this helps!
I tried this and had a crash today while actually using the server.
I managed to connect it to a screen and see what was happening and it seems like its the network card that goes down. It makes sense with the previous crashes actually because when I restarted it each time I can see that the CPU and Memory graphs has been normal but network had not been utilized.
Not sure how I can fix this.
This was on 6.1 kernel.


lspci | egrep -i --color 'network|ethernet'
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)

And it seems like its using the r6189 driver

find /sys | grep drivers.*02:00
/sys/bus/pci/drivers/r8169/0000:02:00.0

Correct driver seems to be r8125: https://www.realtek.com/en/componen...0-1000m-gigabit-ethernet-pci-express-software

How do I install this in proxmox?
 

Attachments

  • 331913675_1050817672544715_3389255078636554395_n.jpg
    331913675_1050817672544715_3389255078636554395_n.jpg
    439.8 KB · Views: 21
I have two virtual machines running on Proxmox, one for Ubuntu 22.04 and the other for TrueNas. The Ubuntu VM has no PCIE passthrough, while the TrueNas VM has a controller to the HDDs passthrough. However, the system has been crashing every 3-4 days and I am unable to access the Proxmox UI or any of the VMs.
[348271.123124] INFO: Task kworker/2:3:633181 blocked for more than 604 seconds. [number] Tainted P W O 5.15.74-1-pve #1 [number] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disable this message.
blocked tasks are subject to a unacknowledged IO request, usually disk but not always. it also isnt necessarily the cause of the crash. My money would be on the truenas vm- turn it off and see if the problem persists. if it does, you have a hardware issue; if it doesnt, the simplest solution is to have proxmox act as your file server and dont bother with truenas at all.
 
  • Like
Reactions: randomuser1990
blocked tasks are subject to a unacknowledged IO request, usually disk but not always. it also isnt necessarily the cause of the crash. My money would be on the truenas vm- turn it off and see if the problem persists. if it does, you have a hardware issue; if it doesnt, the simplest solution is to have proxmox act as your file server and dont bother with truenas at all.
Im pretty sure its the network card now, will try to get the correct network driver installed but getting a few errors that are new to me when installing the driver from realtek.
 
I get the following error when trying to install it:

$ ./autorun.sh
Check older driver and unload it.
rmmod r8169
make[2]: *** /lib/modules/6.1.10-1-pve/build: No such file or directory. Stop.
make[1]: *** [Makefile:196: clean] Error 2
 
The driver does not work for >5.19 kernel so reverted back to 5.15 and have now installed it, will report back in a few days if it works.
 
Installing the correct network driver fixed the issue.
Could you please mark the thread as solved? You can do so by editing the original post and selecting the 'Solved' prefix before the subject line.
In that way, it will be easier for other users who run into the same problem.

Thanks in advance!
 
  • Like
Reactions: randomuser1990

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!