Proxmox host dies randomly

Taledo

Member
Nov 20, 2020
72
5
13
53
Hello all,

A few weeks ago, I've finished the second version of the home lab, mainly out of power efficiency concern.

I now have 2 nodes in a cluster. One is a Gigabyte BRIX nuc like, and the other one is custom-built. That last one is giving me trouble.

The symptoms are as follows :

— The PVE crashes silently and randomly. The screen goes black, no keyboard interaction is possible, shutdown button doesn't work, no networking at all.
— Last logs don't show any sign of a failure :

1663693764461.png

— At startup, no logs are present in the timeframe between the crash and the reboot.

Memtest is good.


Config wise :
MB : H470M-HDV/M.2 (Bios is 1.40, latest available)
CPU : Intel(R) Pentium(R) Gold G6405 CPU @ 4.10GHz
Memory : 2*16Gb GKILL 2666Mhz
Storage : 1 nvme 500G SSD for OS and VMs
1 Sata 2Tb SSD for VMs

Load wise, here are the graphs from my librenms :

1663694275693.png

Pveversion :

Code:
root@Balthasar-2:~# pveversion 
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)

I'm at a loss here. Help is appreciated.

Cheers,

Taledo
 

Attachments

  • 1663693909082.png
    1663693909082.png
    123.1 KB · Views: 14
  • 1663694263200.png
    1663694263200.png
    190 KB · Views: 20
Updating this :
I've tried installing kdump, but I'm not seeing any dumps after a crash.
Plugging a screen to it just results in a black screen.
C states have been disabled.

Tempted to try a debian live on a USB drive, still any assistance would be appreciated.
 
This looks like some hardware issue or some incompatibility among the kernel and your hardware.

Regarding the hardware try to monitor CPU, ram, disk, chipset and videocard temperature and overall status, maybe something is overheating.
On the kernel side, you may try v5.19:

https://forum.proxmox.com/threads/opt-in-linux-5-19-kernel-for-proxmox-ve-7-x-available.115090/

The option to try the same hardware with some other distro is a good option too.

Edit: just saw those temp readings in the log you posted above: 60+ degrees for a drive is "way too much".
 
Last edited:
I've tried installing kdump, but I'm not seeing any dumps after a crash.
Have you also configured it correctly and tested by crashing your machine deliberately?

I can also recommend setting up netconsole and send the output to your other host, even a "real" serial console will work.
 
Thanks, all, for your inputs.

Firstly, I've had two crashes in just 2 days with cstates disabled. It might be a coincidence.

I'm already running 5.19, sadly didn't do anything.

While 60°C for a drive might not be optimal, I do think that it wouldn't cause a crash, especially as it isn't the system drive, still, I've moved it to the front of the case in a more open space.

For kdump, it does create a log when I crash it manually. I'll try setting up netconsole when I have time in the coming days.

If all fails, I'll try debian 11 first, and then something else to see if I can pinpoint anything.


Again, thanks everyone for helping!
 
Hey all, It's me again.

Something must have happened with the updates, and I feel I'm chasing ghosts at this point.

A bit after my last message, I stopped getting crashes. Nothing else changed on the host. In fact, I started a minecraft server back in January, and everything had worked perfectly until the end of March when 7.4 rolled out, and I updated my cluster. Since then, I've been getting crashes again. It could be a coincidence, but I find it hard to believe that after nearly three months of peace, crashes would reappear randomly hours after a system update.

A tl;dr of the issue :

The host crashes, hard. No logs local or remote, no kdump, no graphic output, no keyboard numlock. Completely dead until I hard reset it.

Config :

Code:
         .://:`              `://:.            root@Balthasar-2
       `hMMMMMMd/          /dMMMMMMh`          ----------------
        `sMMMMMMMd:      :mMMMMMMMs`           OS: Proxmox VE 7.4-3 x86_64
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`   Kernel: 5.19.17-2-pve
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`   Uptime: 21 mins
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`     Packages: 737 (dpkg)
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.       Shell: bash 5.1.4
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.         Resolution: 1024x768
        -+ooooooo/.`sMMs`./ooooooo+-           Terminal: /dev/pts/0
          :oooooooo/`..`/oooooooo:             CPU: Intel Pentium Gold G6405 (4) @ 4.100GHz
          :oooooooo/`..`/oooooooo:             GPU: Intel Device 9ba8
        -+ooooooo/.`sMMs`./ooooooo+-           Memory: 4842MiB / 30996MiB
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`
        `sMMMMMMMm:      :dMMMMMMMs`
       `hMMMMMMd/          /dMMMMMMh`
         `://:`              `://:`

If anyone has an idea, I'll take it.

Cheers
 
Hey,

Yeah, after posting that, I saw the new 6.2 Kernel and switched to it. I will see if it does make a difference.
 
Hello,

I appreciate your patience with this.

I have remote logs via rsyslog. I had tried running netconsole, but didn't get anything.

It crashed again this morning.

No logs, no nothing.


I've just ordered usb thumb drives to burn a debian iso and see if I can get the same behaviour.
 
I know this will sound really silly, and it is a shot in the dark, but try running your RAM at a clock speed of about 2400Mhz [it's not too much of a step down], and see what happens.
Given it is a host lockup, it would be interesting to see if that helps and it wouldn't hurt to try.
 
You know, at this point, I wouldn't put a memory speed issue in the silly category.

I've set it to 2400MTs, It's an Intel Pentium Gold G6405 anyway, don't think that's going to make much of a difference. :D


Will report back on the next crash or in a while.
 
For kdump, it does create a log when I crash it manually. I'll try setting up netconsole when I have time in the coming days.
So still no dump after the crashes? That is very odd. Is your machine just rebooting or hanging and you have to reset it manually?
 
No dumps. Kdump works, I crashed the PVE on purpose a while back and it did create a dump.

It's crashing hard : no keyboard input, black screen. No network, though I didn't check if there was network traffic on the switch after a crash. If it does crash again, I'll check it out.

Afaik, the ram is good as it passed a memtest. The fact that it ran for 2 months without issues and crashed 5 days after a dist upgrade boggle my mind. If it were hardware, surely it would have crashed by now, right?
 
*looks to his left at a custom Proxmox Cube* Yes and no, I have a box here which ran quite well, and then it started having ZFS kernel panics under load. All equipment passed testing, including memtest. So, I took a few educated guesses, and replaced the 32GB RAM with 64GB higher-quality sticks... and also inserted a Enterprise SSD, partitioned for a read and ZIL cache. It's run rock-solid ever since.

So, with the hard-lock, have you tried tapping the num-lock top see if it goes on and off? That'll tell you if the kernel is still talking.

With another machine, I also had hard-locking due to an issue with the USB ports of all things, but on a soft-reset it'd run rock-solid with the faulty ports offline...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!