Why server crawled to halt and died?

Adakis

New Member
May 6, 2024
13
0
1
Hi,

Weird thing happened on my Proxmox server (8.2.7) hosted with OVH. I was noticing that connections to both the Proxmox server itself (as well as any VMs on virtual IPs behind my pfSense acted like they had network congestion). I eventually lost the ability to even SSH or HTTPs into Proxmox, so I went to the virtual KVM with the provider and saw that Proxmox was at the regular boot screen. When I typed in "root" and hit enter, I was never prompted for a password. The cursor would just blink and then eventually I'd get a timeout message as if I didn't type the password in quick enough. I tried a few more times but eventually the server just acted like it was fully locked up.

I issued a reset through the IPMI interface. The server came up with several pages of this:

1734474227903.png

But then it did show the regular "Welcome to proxmox" message. I'm now back in the GUI and starting up VMs and things all seem to be coming up fine.

My question: do you have any suggestions on things I might do/check to figure out why this happened in the first places? All services were fully up and operational without issue all day today, and I didn't make any Proxmox, VM or firewall upgrades. It was just "business as usual" until...well, it wasn't :-(

Thanks!
 
ENOSPC means that there is no space on the pve os drive, maybe "/" or /boot.
 
I was worried about that too but seems to be ok?

1734475245109.png

I was also reading other posts where people were trying things like journalctl --since "2024-12-17 16:00" --no-pager and unfortunately it looks like this info was only captured from when I went into the virtual KVM and rebooted. The last entry was in September when I first set the box up.
 
In reviewing CPU usage I see there was a big climb in CPU for several days! Any chance there are more detailed logs that might show what was chewing up all that CPU at that time?

Over the past week...

1734494725017.png

Over the past day...
1734494652259.png
 
The first error is from you Intel network card. Have a look at this explanation to eventually fix it. It is about disabling hardware offloading for all vlan related parts: https://wcgw.ghost.io/journalctl-and-obscure-proxmox-errors/

For the second question:
There is no log available tracing which process used how much cpu by default. The statistics are only for cpu usage in total. Meaning, either you use some extra monitoring tool for that or just for troubleshooting help yourself with some script running in the background, e. g. like described here: https://unix.stackexchange.com/ques...-when-it-is-high-or-touching-certain-treshold
 
Thanks much @fba ! I'll definitely try adding this to my LAN interface:

offload-rxvlan off
offload-txvlan off
offload-tso off
offload-rx-vlan-filter off

I've got some work today where my environment can't be down, but will schedule downtime and get this done. As far as I can see this was a long, SLOW trend towards maxing out processor but I'll try to report back either way when I know if the change seemed to help. I setup a little script to write out process info to a text file when it gets too high. Thanks again!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!