Suddenly reboot/restart

AdryMS

New Member
Mar 11, 2021
14
5
1
27
I got 3 nodes on my cluster
using 6.3.6 ver proxmox
1 of them sometimes restart randomly, i already check syslog and /var/log/message
my syslog after booting (because syslog before reboot won't show's up)
Code:
Mar 17 06:34:35 SVR-22 kernel: Linux version 5.4.103-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.103-1 (Sun, 07 Mar 2021 15:55:09 +0100) ()
Mar 17 06:34:35 SVR-22 kernel: Command line: initrd=\EFI\proxmox\5.4.103-1-pve\initrd.img-5.4.103-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Mar 17 06:34:35 SVR-22 kernel: KERNEL supported cpus:
Mar 17 06:34:35 SVR-22 kernel:   Intel GenuineIntel
Mar 17 06:34:35 SVR-22 kernel:   AMD AuthenticAMD
Mar 17 06:34:35 SVR-22 kernel:   Hygon HygonGenuine
Mar 17 06:34:35 SVR-22 kernel:   Centaur CentaurHauls
Mar 17 06:34:35 SVR-22 kernel:   zhaoxin   Shanghai
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
Mar 17 06:34:35 SVR-22 kernel: x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
Mar 17 06:34:35 SVR-22 kernel: BIOS-provided physical RAM map:
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000057fff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000000059000-0x000000000009efff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000000100000-0x000000003fffffff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000040000000-0x00000000403fffff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x0000000040400000-0x00000000c09e1fff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c09e2000-0x00000000c09e2fff] ACPI NVS
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c09e3000-0x00000000c09e3fff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c09e4000-0x00000000c8b64fff] usable
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c8b65000-0x00000000c905ffff] reserved
Mar 17 06:34:35 SVR-22 kernel: BIOS-e820: [mem 0x00000000c9060000-0x00000000c9153fff] usable

already check watchdog by cli using
ipmitool mc watchdog get

but i think i don't enable watchdog
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory

and i find some suspicious setting
on node keep reboot got file /etc/modprobe.d/zfs.conf but on other normal nodes dosen't have file zfs.conf

on my node im using pc, current spec
GIGABYTE H310MI7-870032 GB DDR42 Unit x 500GB SSD1 RAID

any idea?
 
Last edited:
Do you use the PVE HA stack?

If you do, do you see anything in the /var/log/syslog regarding corosync or knet? Messages that one node left the cluster for example?

Do you have a dedicated physical network just for the PVE Cluster traffic (corosync)? If not, it is possible that other services use up the available bandwidth which will cause a higher latency for corosync packets. If the latency is too high and there is no other corosync link available that can be used, the node will lose the connection to the cluster and if HA is used (has or had HA guests since the last reboot), the node will fence (hard reset) itself if it cannot reestablish the connection to the cluster fast enough.
 
Do you use the PVE HA stack?

If you do, do you see anything in the /var/log/syslog regarding corosync or knet? Messages that one node left the cluster for example?

Do you have a dedicated physical network just for the PVE Cluster traffic (corosync)? If not, it is possible that other services use up the available bandwidth which will cause a higher latency for corosync packets. If the latency is too high and there is no other corosync link available that can be used, the node will lose the connection to the cluster and if HA is used (has or had HA guests since the last reboot), the node will fence (hard reset) itself if it cannot reestablish the connection to the cluster fast enough.
Thanks for your reply staff

Currently im not using HA, the nodes just reboot randomly, and for latency it was should be not a problem, because all my nodes was 1 switch and network traffic was stable too, the nodes was keep booting at least 1x perday
 
Are the random reboots still happening? If so, I would check the hardware. A RAM/memory check, the power supply,.... those issues might be a bit hard to locate if it is a hardware problem.
 
Are the random reboots still happening? If so, I would check the hardware. A RAM/memory check, the power supply,.... those issues might be a bit hard to locate if it is a hardware problem.
Yep, i already changed the psu, does ram can be issue? Following the summary ram usage before and after getting off
 

Attachments

  • CYMERA_20210319_221942.jpg
    CYMERA_20210319_221942.jpg
    80.3 KB · Views: 2
It is possible that some RAM dimms might be faulty. Running a memtest should show you if there is some faulty memory.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!