Server crashed / not accesible at random time

Sep 9, 2021
41
4
8
48
Hello,
I do have a problem with my new machine.
I changed the server a few weeks ago and it crashed a few times "out of the blue".
By crashed I mean is not accesible from outside and, as I dont have a monitor attached to see the console, I am just restarting.

Small history:
1. Initialy thought I "fixed" the issue by using ethtool -K eno1 tso off gso off and updating the interface setup, based on the errors I had and the solution I found on the internet.
2. Later another crash from plex - it seems transcoding on Intel Gen12 CPUs has a defect that was fixed in kernel 5.17.RC1, so I disable the HW transcoding until pve picks that up.
3. Today it happened again ... somewhere between 17:00-18:00 the same thing happened. From mc server seems 17:35.

The only warning i see is the one list in the last printsceen.

What should I look for ?

PS:
I need to know if I have to return it within the 30 days return policy :(

.........

Syslog shows only this before my forced restart (restart at 18:24)
1653775559155.png
.........


From my HA dashboard I see that at 17:33 the sensor stoped to report the temperature (same temp for almoust 1h until restart), so this sems to be the moment of crash.
The same 'stop' I get from all the sensors.
1653775664999.png
.........


All PVE graphs and all VM/CT graphs have a blank between 17:30 and 19:00.1653775801856.png

.........


I see this after the restart and from time to time:

1653776183890.png

I also seen this in the morning ~ 10:00 AM... but I am pretty sure the server was up at that time (I see activity on my kid's mc server )
pve kernel: port 1 entering bloking state
 
Last edited: