Fresh node apparently random reboot

wilcomir

New Member
Oct 16, 2022
15
1
3
Hello,

I have made a fresh install of pve 7.4 on a mini pc, and I am experiencing random reboots; here is an example from last night:

Code:
> last -xF reboot shutdown | head
reboot   system boot  5.15.102-1-pve   Wed Mar 29 06:25:50 2023   still running
reboot   system boot  5.15.102-1-pve   Wed Mar 29 06:15:02 2023   still running
reboot   system boot  5.15.102-1-pve   Wed Mar 29 04:00:29 2023   still running
reboot   system boot  5.15.102-1-pve   Tue Mar 28 23:13:16 2023   still running
shutdown system down  5.15.102-1-pve   Tue Mar 28 23:13:00 2023 - Tue Mar 28 23:13:16 2023  (00:00)
[more lines cut]

I manually rebooted the machine yesterday at 23:13, but then it rebooted itself three times during the night.

Here is an excerpt of /var/log/syslog:

Code:
Mar 29 06:15:08 mars systemd[1]: Startup finished in 4.159s (firmware) + 6.842s (loader) + 3.684s (kernel) + 6.008s (userspace) = 20.695s.
Mar 29 06:15:14 mars chronyd[888]: Selected source 37.247.53.178 (2.debian.pool.ntp.org)
Mar 29 06:15:14 mars chronyd[888]: System clock TAI offset set to 37 seconds
Mar 29 06:15:33 mars systemd[1]: systemd-fsckd.service: Succeeded.
Mar 29 06:17:01 mars CRON[1387]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar 29 06:25:51 mars kernel: [    0.000000] Linux version 5.15.102-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) ()

I do not see anything odd.

Things I have tried/noticed so far:

I have a samsung ssd which apparently might not play nice with linux, spewing errors like
failed command: READ FPDMA QUEUED. I have found a "fix" which is echo 1 > /sys/block/sda/device/queue_depth, basically turning off the queue feature for the SSD. This is applied at each boot via cron, and since I enabled this fix the SSD errors disappeared, but the reboots stayed.

I suspected a PSU/cpu related hw failure, so I ran a few stress tests with stress-ng -a 0 --class cpu --metrics --timeout 60, but the thing barely gets up to 60 degC and does not reboot.

One thing I am leaving out at the moment is that I get ACPI errors at boot, such as
Code:
ACPI BIOS Error (bug): Could not resolve symbol [\_SB.UBTC.RUCC], AE_NOT_FOUND (20210730/psargs-330)
As I understand this can happen, and since all the hw I need is working I do not think this is worth pursuing.

I guess my next step is launching a memtest - I will do that and report back, but is there anything else anyone has to suggest?

Thanks!
V
 
I ran the memtest, and everything is fine with the memory. I examined the journalctl logs, and there is nothing that suggests why a reboot should have happened, as an example from this morning:

Code:
> journalctl -k -b -2 | tail
Mar 29 06:15:05 mars kernel: vmbr0: port 1(enp2s0) entered disabled state
Mar 29 06:15:06 mars kernel: bpfilter: Loaded bpfilter_umh pid 1053
Mar 29 06:15:06 mars unknown: Started bpfilter
Mar 29 06:15:07 mars kernel: r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
Mar 29 06:15:07 mars kernel: vmbr0: port 1(enp2s0) entered blocking state
Mar 29 06:15:07 mars kernel: vmbr0: port 1(enp2s0) entered forwarding state
Mar 29 06:15:07 mars kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready

and a couple of minutes later it restarted:

Code:
> journalctl --list-boots
...
 -2 bfe8381a3d4e441ca76c807dc48bcf2c Wed 2023-03-29 06:15:02 CEST—Wed 2023-03-29 06:17:01 CEST
 -1 b1a069824e4445a6ad07ab97f2f82ab7 Wed 2023-03-29 06:25:50 CEST—Wed 2023-03-29 09:00:15 CEST
...
 
Update motherboard BIOS? Load BIOS defaults? Replace PSU? Check power from UPS or power grid? Replace CPU? It does not look like a configuration mistake or software issue.
 
Hello leesteken,
Many thanks for your inputs, I appreciate your help.

The platform is a sbc - think of a NUC knock off; I have little options in debugging the hardware. I do think that this is not PSU related as I did a 2hrs stress test at 100% cpu and it did not flinch.

I have left it alone for the day and it rebooted a few times and then just froze and I had to manually reboot it.

I am writing this off as hw issue and will avoid this brand in the future.

Cheers,
V
 
I have not made any progress, I am sending back the machine and getting a NUC. I saw your other thread and I must say that the symptoms look just like what I was seeing - I never saw the crash live as you did. I do have three other nodes that I updated and they are running just fine… I’ll keep you posted in case the NUC has any issue, but I think you are onto something - if it was running before the update, most probably it’s not an hw issue.
 
I am suddenly experiencing the same issue on 3 node cluster, where so far, 1 of them have rebooted randomly 3 times and the another one once. Any help would be appreciated. I am running on NUC.
 
@wilcomir , are you using thunderbolt by any chance? I was experiencing many reboots since the update, and once I removed the thunderbolt connections, the reboots stopped. I am not sure if this is a kernel issue though or an overheating issue with the NUC as summer is coming in and the rooms are warmer.
 
I
I am suddenly experiencing the same issue on 3 node cluster, where so far, 1 of them have rebooted randomly 3 times and the another one once. Any help would be appreciated. I am running on NUC.
M experiencing the same on my single node since my last update to 7.4 and kernel 5.15. Upgraded to 6.2 and still getting reboots.

Did you find any solutions?