Random crashing/freezing

BigErchie

New Member
Oct 10, 2020
11
0
1
43
Hi all,

I'm looking for some ideas as to what direction to take my troubleshooting of a new issue I've started seeing in Proxmox as per the attached screenshot.

Every now and then, my Proxmox server completely freezes and when I view the screen I see the attached. I have been running the community edition on 2 micro PCs (HP Elitedesk 800 G1 mini and a Lenovo M93p Tiny both 16GB RAM) and these had been running perfectly fine but I decided to consolidtae these into an HP Prodesk 400 G4 Mini running an i5-8500T with 32GB RAM and it's not run stable since I installed it. Not the most ideal I am realising because of the Realtek network card.

To give some insight, I am running a number of containers but one is a Plex container running Ubuntu LTS with Quicksync passed through and I'm not sure if it's this causing the issue as it's literally the only thing different from running across the 2 previous machines.

I'm totally stumped and would appreciate if anyone can direct me on some next steps to troubleshoot or what logs I should be looking at to get some further information as I have stretched out the little knowledge I have so far.
 

Attachments

  • 20201006_141647.jpg
    20201006_141647.jpg
    378.5 KB · Views: 22

ertanerbek

Member
Mar 29, 2019
91
6
13
41
You screen was gone and ext4 system experiacne some problem.. With your screenshot like memory hardware problem also heat will cause this problem..
 

Denny

Active Member
Jul 28, 2016
86
20
28
58
Logs should give you some clue as to what has happened.

Is the machine at all responsive?
Can you get num-lock or cap-lock to change state?

I sort of doubt it but if by some miracle you can interact on the commandline with it in this condition try "dmesg -T" to see what may have happened last.
 

BigErchie

New Member
Oct 10, 2020
11
0
1
43
Logs should give you some clue as to what has happened.

Is the machine at all responsive?
Can you get num-lock or cap-lock to change state?

I sort of doubt it but if by some miracle you can interact on the commandline with it in this condition try "dmesg -T" to see what may have happened last.
Unfortunately, the machine is completely non responsive and I'm struggling pinpointing what happened just prior to the crash. The logs that I know of don't really show much.

I guess it could be a memory issue I suppose but would Memtest not show that in it's tests or is it possible that all tests pass and to still have memory issues?
 

Denny

Active Member
Jul 28, 2016
86
20
28
58
Perhaps adding some monitoring tools could help you identify the issue. You can try adding netdata.
Code:
apt install netdata
.
You will have to edit the config file to listen on the needed ip addresss (or just 0.0.0.0 for all)
It is reachable from https://your-ip-address:19999/
I would recommend changing the retention time to something longer than the default 2 hours. Documentation
Physical sensors can be added via plugin Machine sensor data plugin
This can also be scraped by Prometheus Netdata Prometheus Documentation
 

Denny

Active Member
Jul 28, 2016
86
20
28
58
Also, I might suggest setting up an syslog-ng server on another computer and send your logs to it via rsyslog. This might allow the server to gasp out a final message in a situation where the disk subsystem has already expired.
 

BigErchie

New Member
Oct 10, 2020
11
0
1
43
Thanks @Denny, I'll certainly give these a shot. I was always wary of installing additional software on the Proxmox host itself but I guess something like Netdata will be OK?

Setting up a syslog server may be beyond my knowledge but i'll certainly give it a shot, if nothing else I'll learn some extra skills at least :D
 

Denny

Active Member
Jul 28, 2016
86
20
28
58
I use Netdata extensively where I work. I also have it on all three of my Proxmox nodes at home. I haven't had any problems to speak of.
 
  • Like
Reactions: BigErchie

SilverNodashi

Active Member
Jul 30, 2017
128
4
38
43
I found that one of my "servers" which doesn't have ECC RAM also sometimes do (or used to) do this, so I put some extra fans on my the memory modules and now it doesn't crash like this anymore.

Your RAM might pass memtest, but it's probably too little RAM for the processes on the cluster.
 

BigErchie

New Member
Oct 10, 2020
11
0
1
43
I found that one of my "servers" which doesn't have ECC RAM also sometimes do (or used to) do this, so I put some extra fans on my the memory modules and now it doesn't crash like this anymore.

Your RAM might pass memtest, but it's probably too little RAM for the processes on the cluster.
Thanks @SilverNodashi , I never really break past 50% RAM at the moment to be honest so not sure that's it but now you mention it, it does seem to run hotter with 2x16GB sticks than the single 8GB I have in it now so I wonder if the RAM is actually struggling (not that I imagine it matters but it's Integral branded with Nanya chips). I think I'll return it anyway as it's still within their 100 day return window and see if I can source other RAM.

In the meantime I'll run with the 8GB just now and perhaps swap the slots to determine if it's actually the mobo that's at fault. I'll also chuck in a single 16GB at some point to see if that makes it crash again and see if I can't narrow it down. My gut instinct is it's the RAM but I have no proof.

Cheers again!
 

SilverNodashi

Active Member
Jul 30, 2017
128
4
38
43
Thanks @SilverNodashi , I never really break past 50% RAM at the moment to be honest so not sure that's it but now you mention it, it does seem to run hotter with 2x16GB sticks than the single 8GB I have in it now so I wonder if the RAM is actually struggling (not that I imagine it matters but it's Integral branded with Nanya chips). I think I'll return it anyway as it's still within their 100 day return window and see if I can source other RAM.

In the meantime I'll run with the 8GB just now and perhaps swap the slots to determine if it's actually the mobo that's at fault. I'll also chuck in a single 16GB at some point to see if that makes it crash again and see if I can't narrow it down. My gut instinct is it's the RAM but I have no proof.

Cheers again!
Is the RAM ECC or not? Like I said, only my one machine, which doesn't have ECC RAM has this behaviour, till I added extra cooling to the RAM. memtest didn't show any errors either.
 

BigErchie

New Member
Oct 10, 2020
11
0
1
43
Is the RAM ECC or not? Like I said, only my one machine, which doesn't have ECC RAM has this behaviour, till I added extra cooling to the RAM. memtest didn't show any errors either.
Nah, just standard RAM but have never had any issues like this in my other 2 hosts...ever.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!