Suddenly reboot/restart

AdryMS · Mar 17, 2021

I got 3 nodes on my cluster
using 6.3.6 ver proxmox
1 of them sometimes restart randomly, i already check syslog and /var/log/message
my syslog after booting (because syslog before reboot won't show's up)

Code:

Mar 17 06:34:35 SVR-22 kernel: Linux version 5.4.103-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.103-1 (Sun, 07 Mar 2021 15:55:09 +0100) ()
Mar 17 06:34:35 SVR-22 kernel: Command line: initrd=\EFI\proxmox\5.4.103-1-pve\initrd.img-5.4.103-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Mar 17 06:34:35 SVR-22 kernel: KERNEL supported cpus:
Mar 17 06:34:35 SVR-22 kernel:   Intel GenuineIntel
Mar 17 06:34:35 SVR-22 kernel:   AMD AuthenticAMD
Mar 17 06:34:35 SVR-22 kernel:   Hygon HygonGenuine
Mar 17 06:34:35 SVR-22 kernel:   Centaur CentaurHauls
Mar 17 06:34:35 SVR-22 kernel:   zhaoxin   Shanghai
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
Mar 17 06:34:35 SVR-22 kernel: BIOS-provided physical RAM map:
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000057fff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000000059000-0x000000000009efff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000000100000-0x000000003fffffff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000040000000-0x00000000403fffff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000040400000-0x00000000c09e1fff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c09e2000-0x00000000c09e2fff] ACPI NVS
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c09e3000-0x00000000c09e3fff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c09e4000-0x00000000c8b64fff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c8b65000-0x00000000c905ffff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c9060000-0x00000000c9153fff] usable

already check watchdog by cli using
ipmitool mc watchdog get

but i think i don't enable watchdog
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory

and i find some suspicious setting
on node keep reboot got file /etc/modprobe.d/zfs.conf but on other normal nodes dosen't have file zfs.conf

on my node im using pc, current spec

GIGABYTE H310M

I7-8700

32 GB DDR4

2 Unit x 500GB SSD

1 RAID

any idea?

aaron · Mar 19, 2021

Do you use the PVE HA stack?

If you do, do you see anything in the /var/log/syslog regarding corosync or knet? Messages that one node left the cluster for example?

Do you have a dedicated physical network just for the PVE Cluster traffic (corosync)? If not, it is possible that other services use up the available bandwidth which will cause a higher latency for corosync packets. If the latency is too high and there is no other corosync link available that can be used, the node will lose the connection to the cluster and if HA is used (has or had HA guests since the last reboot), the node will fence (hard reset) itself if it cannot reestablish the connection to the cluster fast enough.

AdryMS · Mar 19, 2021

aaron said:
Do you use the PVE HA stack?

If you do, do you see anything in the /var/log/syslog regarding corosync or knet? Messages that one node left the cluster for example?

Do you have a dedicated physical network just for the PVE Cluster traffic (corosync)? If not, it is possible that other services use up the available bandwidth which will cause a higher latency for corosync packets. If the latency is too high and there is no other corosync link available that can be used, the node will lose the connection to the cluster and if HA is used (has or had HA guests since the last reboot), the node will fence (hard reset) itself if it cannot reestablish the connection to the cluster fast enough.

Thanks for your reply staff

Currently im not using HA, the nodes just reboot randomly, and for latency it was should be not a problem, because all my nodes was 1 switch and network traffic was stable too, the nodes was keep booting at least 1x perday

aaron · Mar 19, 2021

Are the random reboots still happening? If so, I would check the hardware. A RAM/memory check, the power supply,.... those issues might be a bit hard to locate if it is a hardware problem.

AdryMS · Mar 19, 2021

aaron said:
Are the random reboots still happening? If so, I would check the hardware. A RAM/memory check, the power supply,.... those issues might be a bit hard to locate if it is a hardware problem.

Yep, i already changed the psu, does ram can be issue? Following the summary ram usage before and after getting off

aaron · Mar 19, 2021

It is possible that some RAM dimms might be faulty. Running a memtest should show you if there is some faulty memory.

Search

Search

Suddenly reboot/restart

AdryMS

New Member

aaron

Proxmox Staff Member

AdryMS

New Member

aaron

Proxmox Staff Member

AdryMS

New Member

Attachments

aaron

Proxmox Staff Member