Fresh node apparently random reboot

wilcomir

New Member
Oct 16, 2022
15
1
3
Hello,

I have made a fresh install of pve 7.4 on a mini pc, and I am experiencing random reboots; here is an example from last night:

Code:
> last -xF reboot shutdown | head
reboot   system boot  5.15.102-1-pve   Wed Mar 29 06:25:50 2023   still running
reboot   system boot  5.15.102-1-pve   Wed Mar 29 06:15:02 2023   still running
reboot   system boot  5.15.102-1-pve   Wed Mar 29 04:00:29 2023   still running
reboot   system boot  5.15.102-1-pve   Tue Mar 28 23:13:16 2023   still running
shutdown system down  5.15.102-1-pve   Tue Mar 28 23:13:00 2023 - Tue Mar 28 23:13:16 2023  (00:00)
[more lines cut]

I manually rebooted the machine yesterday at 23:13, but then it rebooted itself three times during the night.

Here is an excerpt of /var/log/syslog:

Code:
Mar 29 06:15:08 mars systemd[1]: Startup finished in 4.159s (firmware) + 6.842s (loader) + 3.684s (kernel) + 6.008s (userspace) = 20.695s.
Mar 29 06:15:14 mars chronyd[888]: Selected source 37.247.53.178 (2.debian.pool.ntp.org)
Mar 29 06:15:14 mars chronyd[888]: System clock TAI offset set to 37 seconds
Mar 29 06:15:33 mars systemd[1]: systemd-fsckd.service: Succeeded.
Mar 29 06:17:01 mars CRON[1387]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar 29 06:25:51 mars kernel: [    0.000000] Linux version 5.15.102-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) ()

I do not see anything odd.

Things I have tried/noticed so far:

I have a samsung ssd which apparently might not play nice with linux, spewing errors like
failed command: READ FPDMA QUEUED. I have found a "fix" which is echo 1 > /sys/block/sda/device/queue_depth, basically turning off the queue feature for the SSD. This is applied at each boot via cron, and since I enabled this fix the SSD errors disappeared, but the reboots stayed.

I suspected a PSU/cpu related hw failure, so I ran a few stress tests with stress-ng -a 0 --class cpu --metrics --timeout 60, but the thing barely gets up to 60 degC and does not reboot.

One thing I am leaving out at the moment is that I get ACPI errors at boot, such as
Code:
ACPI BIOS Error (bug): Could not resolve symbol [\_SB.UBTC.RUCC], AE_NOT_FOUND (20210730/psargs-330)
As I understand this can happen, and since all the hw I need is working I do not think this is worth pursuing.

I guess my next step is launching a memtest - I will do that and report back, but is there anything else anyone has to suggest?

Thanks!
V
 
I ran the memtest, and everything is fine with the memory. I examined the journalctl logs, and there is nothing that suggests why a reboot should have happened, as an example from this morning:

Code:
> journalctl -k -b -2 | tail
Mar 29 06:15:05 mars kernel: vmbr0: port 1(enp2s0) entered disabled state
Mar 29 06:15:06 mars kernel: bpfilter: Loaded bpfilter_umh pid 1053
Mar 29 06:15:06 mars unknown: Started bpfilter
Mar 29 06:15:07 mars kernel: r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
Mar 29 06:15:07 mars kernel: vmbr0: port 1(enp2s0) entered blocking state
Mar 29 06:15:07 mars kernel: vmbr0: port 1(enp2s0) entered forwarding state
Mar 29 06:15:07 mars kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready

and a couple of minutes later it restarted:

Code:
> journalctl --list-boots
...
 -2 bfe8381a3d4e441ca76c807dc48bcf2c Wed 2023-03-29 06:15:02 CEST—Wed 2023-03-29 06:17:01 CEST
 -1 b1a069824e4445a6ad07ab97f2f82ab7 Wed 2023-03-29 06:25:50 CEST—Wed 2023-03-29 09:00:15 CEST
...
 
Update motherboard BIOS? Load BIOS defaults? Replace PSU? Check power from UPS or power grid? Replace CPU? It does not look like a configuration mistake or software issue.
 
Hello leesteken,
Many thanks for your inputs, I appreciate your help.

The platform is a sbc - think of a NUC knock off; I have little options in debugging the hardware. I do think that this is not PSU related as I did a 2hrs stress test at 100% cpu and it did not flinch.

I have left it alone for the day and it rebooted a few times and then just froze and I had to manually reboot it.

I am writing this off as hw issue and will avoid this brand in the future.

Cheers,
V
 
I have not made any progress, I am sending back the machine and getting a NUC. I saw your other thread and I must say that the symptoms look just like what I was seeing - I never saw the crash live as you did. I do have three other nodes that I updated and they are running just fine… I’ll keep you posted in case the NUC has any issue, but I think you are onto something - if it was running before the update, most probably it’s not an hw issue.
 
I am suddenly experiencing the same issue on 3 node cluster, where so far, 1 of them have rebooted randomly 3 times and the another one once. Any help would be appreciated. I am running on NUC.
 
@wilcomir , are you using thunderbolt by any chance? I was experiencing many reboots since the update, and once I removed the thunderbolt connections, the reboots stopped. I am not sure if this is a kernel issue though or an overheating issue with the NUC as summer is coming in and the rooms are warmer.
 
I
I am suddenly experiencing the same issue on 3 node cluster, where so far, 1 of them have rebooted randomly 3 times and the another one once. Any help would be appreciated. I am running on NUC.
M experiencing the same on my single node since my last update to 7.4 and kernel 5.15. Upgraded to 6.2 and still getting reboots.

Did you find any solutions?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!