Node randomly reboots

Jan 4, 2022
3
0
6
Hi everyone.

First, some information about the setup we are running:

• 4 x Proxmox nodes (version 8.3.2) with Ceph installed – cluster without HA

• Separate networks for Ceph (2 x 10GB), Corosync (1GB), and Backup (1GB) - 2 switches (10GB & 1GB)

• 1 x Proxmox Backup Server



Each server is backed up using separate jobs.

We have the issue that several servers randomly reboot. Here is the log from the server "node4" that rebooted. I can’t find anything useful there.

Jan 22 21:17:01 node04 CRON[497988]: pam_unix(cron:session): session closed for user root
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 62
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 59
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 59
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 61
Jan 22 21:40:10 node04 pmxcfs[1662]: [dcdb] notice: data verification successful
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 64
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 63
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 62
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 63
Jan 22 22:00:04 node04 pmxcfs[1662]: [status] notice: received log
Jan 22 22:17:01 node04 CRON[541085]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 22 22:17:01 node04 CRON[541086]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 22 22:17:01 node04 CRON[541085]: pam_unix(cron:session): session closed for user root
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 61
Jan 22 22:30:02 node04 pmxcfs[1662]: [status] notice: received log
Jan 22 22:40:10 node04 pmxcfs[1662]: [dcdb] notice: data verification successful
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 60
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 59
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 58
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 60
Jan 22 23:00:05 node04 pmxcfs[1662]: [status] notice: received log
-- Reboot --
Jan 22 23:05:28 node04 kernel: Linux version 6.8.12-5-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-5 (2024-12-03T10:26Z) ()
Jan 22 23:05:28 node04 kernel: Command line: initrd=\EFI\proxmox\6.8.12-5-pve\initrd.img-6.8.12-5-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Jan 22 23:05:28 node04 kernel: KERNEL supported cpus:
Jan 22 23:05:28 node04 kernel: Intel GenuineIntel
Jan 22 23:05:28 node04 kernel: AMD AuthenticAMD
Jan 22 23:05:28 node04 kernel: Hygon HygonGenuine
Jan 22 23:05:28 node04 kernel: Centaur CentaurHauls
Jan 22 23:05:28 node04 kernel: zhaoxin Shanghai
Jan 22 23:05:28 node04 kernel: BIOS-provided physical RAM map

There was no backup job on this Server during this time. Node3 was backed up when node4 rebooted. Very strange,

Where can I look for further information, and what can I do about this?

Thanks in advance.
Holger
 
Hello LOGINTechBlog! Just to get an overview of the situation:
  1. Is this a new server? If not, did you change something before the issues started to happen?
  2. Random restarts might be caused by faulty hardware. You can try running memtest to see if you have faulty RAM.
  3. Updating BIOS might help in some situations.
 
Thank you for your reply!

The Server-Hardware is this:

Supermicro AS1015-A MT - 192GB RAM ECC - 4 x Samsung 4TB SSD - AMD Ryzen 9 7900X
2 x LAN on Board - Dual 10GB LAN Intel X550-T2 & 2xUSB-C Network Adapter

On this machine are running 3 VM's

Windows SQL Server
agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: x86-64-v2-AES
efidisk0: ceph01:vm-105-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: pc-q35-9.0
memory: 32768
meta: creation-qemu=9.0.2,ctime=1732791155
name: SQL01
net0: virtio=BC:24:11:6F:24:C5,bridge=V100,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: ceph01:vm-105-disk-1,discard=on,iothread=1,size=100G,ssd=1
scsi1: ceph01:vm-105-disk-3,discard=on,iothread=1,size=500G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=0904e55b-5936-4eac-8405-62db0f283c76
sockets: 1
tpmstate0: ceph01:vm-105-disk-2,size=4M,version=v2.0
vga: virtio
vmgenid: 091a65ea-1ec6-40de-8a93-353402086da3


SEL Oracle Server
agent: 1
boot: order=scsi0;ide2;net0
cores: 24
cpu: x86-64-v2-AES
ide2: none,media=cdrom
memory: 98308
meta: creation-qemu=9.0.2,ctime=1733906905
name: WAWI
net0: virtio=BC:24:11:E4:11:2F,bridge=V100,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: ceph01:vm-107-disk-0,discard=on,iothread=1,size=150G,ssd=1
scsi1: ceph01:vm-107-disk-1,discard=on,iothread=1,size=500G,ssd=1
scsi2: ceph01:vm-107-disk-2,discard=on,iothread=1,size=500G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=5373641e-519a-4af3-8391-013af28538d2
sockets: 1
vmgenid: a9ca2e74-5a65-4cc0-970b-be331c8e424a

Windows Server
agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v2-AES
efidisk0: ceph01:vm-108-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: pc-q35-9.0
memory: 8192
meta: creation-qemu=9.0.2,ctime=1732791155
name: DMS
net0: virtio=BC:24:11:B4:FA:95,bridge=V100,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: ceph01:vm-108-disk-1,discard=on,iothread=1,size=100G,ssd=1
scsi1: ceph01:vm-108-disk-3,discard=on,iothread=1,size=1000G,ssd=1
scsi2: ceph01:vm-108-disk-4,discard=on,iothread=1,size=50G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=f3ba5161-4aa6-43ea-b14c-878ac3f5a96e
sockets: 1
tpmstate0: ceph01:vm-108-disk-2,size=4M,version=v2.0
vga: virtio
vmgenid: 9dad127e-a59a-4a33-8c2b-8d042696fafa


zpool status may not be too helpful because all vms or in the Ceph storage

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:00:12 with 0 errors on Sun Jan 12 00:24:13 2025
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.002538b141b4c2f4-part3 ONLINE 0 0 0
nvme-eui.002538b141b4c2e8-part3 ONLINE 0 0 0

errors: No known data errors




 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!