[SOLVED] Proxmox Server random reboot

Edison

New Member
Feb 24, 2023
3
0
1
I used version 5.x before, and then restarted randomly, and I couldn’t find any reason. Later, I upgraded to 7.x, and there was no problem at the beginning. Recently, I found that there was at least one random restart every day, and I couldn’t find the reason. , please help



Code:
proxmox-ve: 7.3-1 (running kernel: 5.15.74-1-pve)
pve-manager: 7.3-3 (running version: 7.3-3/c3928077)
pve-kernel-5.15: 7.2-14
pve-kernel-helper: 7.2-14
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-8
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.2-12
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-1
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.5-6
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-1
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1
 

Attachments

Hello,

Thank you for the output of syslog.

There is no there isn't a clear indication of the cause of the reboot. I would instead check:
- dmesg
- Try updating the BIOS firmware (sometimes the issue might be related to hardware) or the hardware temperature
- If you monitor the PVE, can you see before the restart if there is high I/O?
- Lastly, checking the power issues, maybe the power supply problem?
 
Hello,

Thank you for the output of syslog.

There is no there isn't a clear indication of the cause of the reboot. I would instead check:
- dmesg
- Try updating the BIOS firmware (sometimes the issue might be related to hardware) or the hardware temperature
- If you monitor the PVE, can you see before the restart if there is high I/O?
- Lastly, checking the power issues, maybe the power supply problem?
Thank you for your suggestion, I will try it for a while according to your suggestion and give feedback.
 
Hello,

I think I'm facing the same issue on only one node.

- It's a new node (since 7 days) in a production cluster (3 members + this one) so there is no VMs running on it (yet)
- No high I/O before the reboot
- No information related inside dmesg
- The node is using the latest packages available (Community Edition)

Last crash occured yesterday April 3th at 19:30 (Europe/Paris) :
ocr_reboot.png
 
Last edited:
Do you enable the HA Proxy?
At the time when a node got rebooted, did you see anything in the Syslog/jornalctl related to corosync?
 
No HA Proxy enabled on my side.
Sorry, I've read logs twice and no error about corosync in them.

The only parameter that was different is about the time zone : 3 nodes under Europe/Paris and the last one (crash) under UTC.
I've changed the setting this morning but "I don't think" that it could lead to a crash.

I've just migrate a VM on this crashed node in order to check if the reboot occurs again in a few hours.
 
Sorry, I meant HA Availability, not HA proxy.

Well the next check is to see if the Hardware issue like power supply, and I would also check if there is a BIOS update.
 
Update :
* No issue from RAM / CPU.
* Always checking if it's a PSU issue

Another difference I see is that the node is listed with it's public address instead of it's private one :

nodes.png
cluster_nodes.png

But it shouldn't affect ?
 
Of course :
JSON:
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    name: oc0
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.0.10
  }
  node {
    name: oc1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.0.1
  }
  node {
    name: oc2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.0.2
  }
  node {
    name: ocr
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.0.11
  }
}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: OC
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
Hello,

Thank you for the output of syslog.

There is no there isn't a clear indication of the cause of the reboot. I would instead check:
- dmesg
- Try updating the BIOS firmware (sometimes the issue might be related to hardware) or the hardware temperature
- If you monitor the PVE, can you see before the restart if there is high I/O?
- Lastly, checking the power issues, maybe the power supply problem?
After checking that all the suggestions you provided were ineffective, I shifted my focus to hardware and eventually discovered that the problem was with the SATA data cable. After replacing it, the problem never occurred again. Finally, I want to thank you again for your help.
 
Glad to read that you fix the issue yourself!

I will set your thread as [SOLVED] to help other people who have the similar issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!