Proxmox host dies randomly

Taledo · Sep 20, 2022

Hello all,

A few weeks ago, I've finished the second version of the home lab, mainly out of power efficiency concern.

I now have 2 nodes in a cluster. One is a Gigabyte BRIX nuc like, and the other one is custom-built. That last one is giving me trouble.

The symptoms are as follows :

— The PVE crashes silently and randomly. The screen goes black, no keyboard interaction is possible, shutdown button doesn't work, no networking at all.
— Last logs don't show any sign of a failure :

— At startup, no logs are present in the timeframe between the crash and the reboot.

Memtest is good.

Config wise :
MB : H470M-HDV/M.2 (Bios is 1.40, latest available)
CPU : Intel(R) Pentium(R) Gold G6405 CPU @ 4.10GHz
Memory : 2*16Gb GKILL 2666Mhz
Storage : 1 nvme 500G SSD for OS and VMs
1 Sata 2Tb SSD for VMs

Load wise, here are the graphs from my librenms :

Pveversion :

Code:

root@Balthasar-2:~# pveversion 
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)

I'm at a loss here. Help is appreciated.

Cheers,

Taledo

Taledo · Nov 9, 2022

Updating this :
I've tried installing kdump, but I'm not seeing any dumps after a crash.
Plugging a screen to it just results in a black screen.
C states have been disabled.

Tempted to try a debian live on a USB drive, still any assistance would be appreciated.

VictorSTS · Nov 9, 2022

This looks like some hardware issue or some incompatibility among the kernel and your hardware.

Regarding the hardware try to monitor CPU, ram, disk, chipset and videocard temperature and overall status, maybe something is overheating.
On the kernel side, you may try v5.19:

https://forum.proxmox.com/threads/opt-in-linux-5-19-kernel-for-proxmox-ve-7-x-available.115090/

The option to try the same hardware with some other distro is a good option too.

Edit: just saw those temp readings in the log you posted above: 60+ degrees for a drive is "way too much".

LnxBil · Nov 9, 2022

Taledo said:
I've tried installing kdump, but I'm not seeing any dumps after a crash.

Have you also configured it correctly and tested by crashing your machine deliberately?

I can also recommend setting up netconsole and send the output to your other host, even a "real" serial console will work.

Taledo · Nov 10, 2022

Thanks, all, for your inputs.

Firstly, I've had two crashes in just 2 days with cstates disabled. It might be a coincidence.

I'm already running 5.19, sadly didn't do anything.

While 60°C for a drive might not be optimal, I do think that it wouldn't cause a crash, especially as it isn't the system drive, still, I've moved it to the front of the case in a more open space.

For kdump, it does create a log when I crash it manually. I'll try setting up netconsole when I have time in the coming days.

If all fails, I'll try debian 11 first, and then something else to see if I can pinpoint anything.

Again, thanks everyone for helping!

Taledo · Apr 10, 2023

Hey all, It's me again.

Something must have happened with the updates, and I feel I'm chasing ghosts at this point.

A bit after my last message, I stopped getting crashes. Nothing else changed on the host. In fact, I started a minecraft server back in January, and everything had worked perfectly until the end of March when 7.4 rolled out, and I updated my cluster. Since then, I've been getting crashes again. It could be a coincidence, but I find it hard to believe that after nearly three months of peace, crashes would reappear randomly hours after a system update.

A tl;dr of the issue :

The host crashes, hard. No logs local or remote, no kdump, no graphic output, no keyboard numlock. Completely dead until I hard reset it.

Config :

Code:

         .://:`              `://:.            root@Balthasar-2
       `hMMMMMMd/          /dMMMMMMh`          ----------------
        `sMMMMMMMd:      :mMMMMMMMs`           OS: Proxmox VE 7.4-3 x86_64
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`   Kernel: 5.19.17-2-pve
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`   Uptime: 21 mins
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`     Packages: 737 (dpkg)
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.       Shell: bash 5.1.4
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.         Resolution: 1024x768
        -+ooooooo/.`sMMs`./ooooooo+-           Terminal: /dev/pts/0
          :oooooooo/`..`/oooooooo:             CPU: Intel Pentium Gold G6405 (4) @ 4.100GHz
          :oooooooo/`..`/oooooooo:             GPU: Intel Device 9ba8
        -+ooooooo/.`sMMs`./ooooooo+-           Memory: 4842MiB / 30996MiB
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`
        `sMMMMMMMm:      :dMMMMMMMs`
       `hMMMMMMd/          /dMMMMMMh`
         `://:`              `://:`

If anyone has an idea, I'll take it.

Cheers

leesteken · Apr 10, 2023

The optional kernel 5.19 (which did not come automatically with Proxmox 7.4) should no longer be used because it gets no updates. Update to the latest optional kernel and maybe it also fixes the crashes.

Taledo · Apr 10, 2023

Hey,

Yeah, after posting that, I saw the new 6.2 Kernel and switched to it. I will see if it does make a difference.

LnxBil · Apr 12, 2023

Taledo said:
The host crashes, hard. No logs local or remote, no kdump, no graphic output, no keyboard numlock. Completely dead until I hard reset it.

Remote logs via netconsole?

Taledo · Apr 16, 2023

Hello,

I appreciate your patience with this.

I have remote logs via rsyslog. I had tried running netconsole, but didn't get anything.

It crashed again this morning.

No logs, no nothing.

I've just ordered usb thumb drives to burn a debian iso and see if I can get the same behaviour.

Nuke Bloodaxe · Apr 16, 2023

I know this will sound really silly, and it is a shot in the dark, but try running your RAM at a clock speed of about 2400Mhz [it's not too much of a step down], and see what happens.
Given it is a host lockup, it would be interesting to see if that helps and it wouldn't hurt to try.

Taledo · Apr 16, 2023

You know, at this point, I wouldn't put a memory speed issue in the silly category.

I've set it to 2400MTs, It's an Intel Pentium Gold G6405 anyway, don't think that's going to make much of a difference.

Will report back on the next crash or in a while.

LnxBil · Apr 17, 2023

Taledo said:
For kdump, it does create a log when I crash it manually. I'll try setting up netconsole when I have time in the coming days.

So still no dump after the crashes? That is very odd. Is your machine just rebooting or hanging and you have to reset it manually?

Taledo · Apr 17, 2023

No dumps. Kdump works, I crashed the PVE on purpose a while back and it did create a dump.

It's crashing hard : no keyboard input, black screen. No network, though I didn't check if there was network traffic on the switch after a crash. If it does crash again, I'll check it out.

Afaik, the ram is good as it passed a memtest. The fact that it ran for 2 months without issues and crashed 5 days after a dist upgrade boggle my mind. If it were hardware, surely it would have crashed by now, right?

Nuke Bloodaxe · Apr 18, 2023

*looks to his left at a custom Proxmox Cube* Yes and no, I have a box here which ran quite well, and then it started having ZFS kernel panics under load. All equipment passed testing, including memtest. So, I took a few educated guesses, and replaced the 32GB RAM with 64GB higher-quality sticks... and also inserted a Enterprise SSD, partitioned for a read and ZIL cache. It's run rock-solid ever since.

So, with the hard-lock, have you tried tapping the num-lock top see if it goes on and off? That'll tell you if the kernel is still talking.

With another machine, I also had hard-locking due to an issue with the USB ports of all things, but on a soft-reset it'd run rock-solid with the faulty ports offline...

Search

Search

Proxmox host dies randomly

Taledo

Member

Attachments

Taledo

Member

VictorSTS

Renowned Member

LnxBil

Distinguished Member

Taledo

Member

Taledo

Member

leesteken

Distinguished Member

Taledo

Member

LnxBil

Distinguished Member

Taledo

Member

Nuke Bloodaxe

Active Member

Taledo

Member

LnxBil

Distinguished Member

Taledo

Member

Nuke Bloodaxe

Active Member