Why server crawled to halt and died?

Adakis · Dec 17, 2024

Hi,

Weird thing happened on my Proxmox server (8.2.7) hosted with OVH. I was noticing that connections to both the Proxmox server itself (as well as any VMs on virtual IPs behind my pfSense acted like they had network congestion). I eventually lost the ability to even SSH or HTTPs into Proxmox, so I went to the virtual KVM with the provider and saw that Proxmox was at the regular boot screen. When I typed in "root" and hit enter, I was never prompted for a password. The cursor would just blink and then eventually I'd get a timeout message as if I didn't type the password in quick enough. I tried a few more times but eventually the server just acted like it was fully locked up.

I issued a reset through the IPMI interface. The server came up with several pages of this:

But then it did show the regular "Welcome to proxmox" message. I'm now back in the GUI and starting up VMs and things all seem to be coming up fine.

My question: do you have any suggestions on things I might do/check to figure out why this happened in the first places? All services were fully up and operational without issue all day today, and I didn't make any Proxmox, VM or firewall upgrades. It was just "business as usual" until...well, it wasn't :-(

Thanks!

waltar · Dec 17, 2024

ENOSPC means that there is no space on the pve os drive, maybe "/" or /boot.

Adakis · Dec 17, 2024

I was worried about that too but seems to be ok?

I was also reading other posts where people were trying things like journalctl --since "2024-12-17 16:00" --no-pager and unfortunately it looks like this info was only captured from when I went into the virtual KVM and rebooted. The last entry was in September when I first set the box up.

Adakis · Dec 18, 2024

In reviewing CPU usage I see there was a big climb in CPU for several days! Any chance there are more detailed logs that might show what was chewing up all that CPU at that time?

Over the past week...

Over the past day...

fba · Dec 18, 2024

The first error is from you Intel network card. Have a look at this explanation to eventually fix it. It is about disabling hardware offloading for all vlan related parts: https://wcgw.ghost.io/journalctl-and-obscure-proxmox-errors/

For the second question:
There is no log available tracing which process used how much cpu by default. The statistics are only for cpu usage in total. Meaning, either you use some extra monitoring tool for that or just for troubleshooting help yourself with some script running in the background, e. g. like described here: https://unix.stackexchange.com/ques...-when-it-is-high-or-touching-certain-treshold

Adakis · Dec 18, 2024

Thanks much @fba ! I'll definitely try adding this to my LAN interface:

offload-rxvlan off
offload-txvlan off
offload-tso off
offload-rx-vlan-filter off

I've got some work today where my environment can't be down, but will schedule downtime and get this done. As far as I can see this was a long, SLOW trend towards maxing out processor but I'll try to report back either way when I know if the change seemed to help. I setup a little script to write out process info to a text file when it gets too high. Thanks again!

Search

Search

Why server crawled to halt and died?

Adakis

New Member

waltar

Renowned Member

Adakis

New Member

Adakis

New Member

fba

Active Member

Adakis

New Member

We value your privacy