Proxmox host strage behavior (need help troubleshooting)

hotelrwanda

Member
Apr 20, 2022
16
2
8
Hello everyone.

I have a small cluster, 2 HP mini pcs running v7.3.
They are both running latest debian updates, latest BIOS firmware.

One of the devices, a Prodesk with an i5 9500T cpu has a weird behavior, which only started recently.
It's difficul for me to explain the problem...

If I leave it without video or keyboard, just power and lan, after a random period of time it looses network (the leds on the lan card turn off) and it gets really hot (assuming from the cpu). After a while it reboots, I get access via ssh for a few seconds, and then the cycle repeats.

I took it out of the "rack" and brought it to my desk, with a monitor and keyboard attached, and this doesn't happen. It worked with no interruption for 24h. If I unplug the monitor and keyboard, it does the same thing it did in the rack.
I also tried to just turn off the monitor, and it worked fine, until I turned the monitor back on. When I turned it on, there was no output, pressed enter on the keyboard, and the system rebooted.

I can't find anything wrong in any log file, everything looks okay, no error message, nothing.
The logs just stop at the moment of the reboot...

This machine worked with no issue for almost 2 years, and besides the regular updates, nothing changed.

Also strage, nkt sure if related, the web ui screen is just white on this machine. I can controll it from the other one in the cluster though. I guess this is a different issue though.

Can you help me troubleshoot the situation?
Where do I start?
 
Do you run a third host as a qdevice? A PVE cluster needs atleast 3 nodes or 2 nodes + qdevice. Otherwise there is no quorum when you shut down one node or there are network problems and the whole cluster will fail. You will then see stuff like webUI not usable, nodes reboot on their own and so on.
 
I've been running a 2 node cluster for a year, I know the limitations, but I don't see how this is related to my problem...
 
Yes, limitations like a self rebooting node, which is part of what you described:

And why did this start one year in? And why is it working just find with a monitor and keyboard attached?

I have been running 2 other 2-node clusters in production at work for many years, this never happened to me.

Do you have some documentation/reference for the random reboots when running a 2 node cluster?
 
Sorry @Dunuin, I came out a bit harsh.
It's just that the situation is annoying af, that's all. Es tut mir leid.
I will try to set up a rPi as a qdevice, as it's just gathering dust.
 
No problem. Not sure if that will fix it. Just an idea, as fencing can reboot a node when quorum is lost and you shouldn't run a cluster with just two voters. 3 voters are bare minimum if you don't want the whole cluster to go down once a single node got some problems (even too much traffic on the network that corosync is communicating on can cause reboots, that why you usually should use a dedicated NIC/network just for corosync). So I wanted to point that out.

I have been running 2 other 2-node clusters in production at work for many years, this never happened to me.
Such setups really shouldn't be used in production. At least not without a qdevice.

Have a look at fencing and split brain situations:
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_fencing
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_quorum
 
Last edited:
I didn't set up a qdevice yet, but as a note, I was running the host on my desk for almost a full day, no issues. The moment I turned off the monitor connected to it (just turned it off, didn't unplug it), the computer restarted. Nothing in the logs, just a "clean cut"...
 
I did a few more tests, every time the same result. If the machine has a monitor connected, everything works just fine. The moment I turn off OR disconnect the monitor, the machine halts, and goes into a reboot loop, with the fan blasting and getting really hot.
I had a display port "Headless Ghost Display Emulator Dummy Plug" (don't know the exact name) dongle lying around, and with that one connected, it works fine. I can connect and reconnect a monitor on the second display port just fine.

This is extremely strange, since this machine has been working perfectly fine for almost 2 years, in a cluster with another machine for one year, getting regular updates every few days.

So for now I can consider this thread closed, but if anyone has any idea why this happened, or knows of a fix I can try, so I don't have to use the dongle, I would very much appreciate it.

EDIT: I forgot to mention, with the fake monitor connected (the ghost dongle), the web interface started working again.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!