Sporadic node "crashes", no SSH, no logs

Metatrone

New Member
Nov 26, 2022
4
0
1
I run Proxmox on a miniPC (intel n5105, 16GB ram,eth connected, 2x2tb disks in a zfs pool) as a home server with nothing fancy - a HomeAssistant VM and couple of lxc containers with Samba and a media server and some web services on them.
At seemingly random times the node would become unresponsive - webUI does not resolve and SSH is not possible, but would respond to ping.
There are no errors in the syslog, in fact there are no records in the logs it seams after such an event happens until I hard restart the machine.
These events are not associated with peak usage of any of the services and are not regular in any way, sometime it would go a couple of weeks before happening, sometimes just a few hours.
There is no resource shortage ever on anything, no load peaks above 50% on CPU and RAM.
Weirdly enough I just found out that the Samba is working and services on that container can be resolved, while other services are inaccessible. I connected a monitor to the server and it spams a PCIe bus error: severity uncorrected (non-fatal) ACSViol (First).

The lack of logs has me stumped. I'm not a particularly confident admin user but, I feel like I've read through all the possible search threads here and am completely out of ideas on what to try. Any thoughts are welcome.
 
In which case it is unlikely that your fs turns read only and you are just running from memory for a long time before the host crashes.
is there a specific uptime that you pass before the issue occurs?
 
In which case it is unlikely that your fs turns read only and you are just running from memory for a long time before the host crashes.
is there a specific uptime that you pass before the issue occurs?
I'd say most of the uptime is at least an hour, I think it was 20 minutes or so one time, but that seemed to be an exception. I don't have any external logs setup on it so my data is not reliable and I've just been testing it out as a solution for weeks, but there is nothing critical running on it atm so it usually is at least a few hours before someone notices.
 
an hour - that is nothing. I had expected days...
I think an external log-server would ease things a lot.
What happens right before the crash? Anything?
 
an hour - that is nothing. I had expected days...
I think an external log-server would ease things a lot.
What happens right before the crash? Anything?
It usually is days, a week+ sometimes. The logs are always the same - before cutting off it's a the hourly cron session check and then 'session closed for user root' is always the last line. (I don't have any cron jobs set up other than mount @reboot for an external disk, which I tried removig)
 
If the behaviour is so unpredictable then it is really hard to get to the bottom of it.
You only can rule out things one by one.

From the hardware side it could be a bad power supply, which reacts on environmental changes (e.g. heat) and/or what comes from the socket.
It also can be an issue with your memory, but typically this leads to a kernel panic which you should be able to see on the console.
Then there might be the disk/ssd, but I would expect some more stable behaviour in this case, but again you never know.
I'd expect the file system to go read only...

Theoretically it also can be an out of memory condition. Have you limited the zfs memory usage?

Start investigating 1by1. Take your time. Be patient.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!