Hi everyone.
First, some information about the setup we are running:
• 4 x Proxmox nodes (version 8.3.2) with Ceph installed – cluster without HA
• Separate networks for Ceph (2 x 10GB), Corosync (1GB), and Backup (1GB) - 2 switches (10GB & 1GB)
• 1 x Proxmox Backup Server
Each server is backed up using separate jobs.
We have the issue that several servers randomly reboot. Here is the log from the server "node4" that rebooted. I can’t find anything useful there.
Jan 22 21:17:01 node04 CRON[497988]: pam_unix(cron:session): session closed for user root
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 62
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 59
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 59
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 61
Jan 22 21:40:10 node04 pmxcfs[1662]: [dcdb] notice: data verification successful
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 64
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 63
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 62
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 63
Jan 22 22:00:04 node04 pmxcfs[1662]: [status] notice: received log
Jan 22 22:17:01 node04 CRON[541085]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 22 22:17:01 node04 CRON[541086]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 22 22:17:01 node04 CRON[541085]: pam_unix(cron:session): session closed for user root
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 61
Jan 22 22:30:02 node04 pmxcfs[1662]: [status] notice: received log
Jan 22 22:40:10 node04 pmxcfs[1662]: [dcdb] notice: data verification successful
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 60
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 59
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 58
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 60
Jan 22 23:00:05 node04 pmxcfs[1662]: [status] notice: received log
-- Reboot --
Jan 22 23:05:28 node04 kernel: Linux version 6.8.12-5-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-5 (2024-12-03T10:26Z) ()
Jan 22 23:05:28 node04 kernel: Command line: initrd=\EFI\proxmox\6.8.12-5-pve\initrd.img-6.8.12-5-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Jan 22 23:05:28 node04 kernel: KERNEL supported cpus:
Jan 22 23:05:28 node04 kernel: Intel GenuineIntel
Jan 22 23:05:28 node04 kernel: AMD AuthenticAMD
Jan 22 23:05:28 node04 kernel: Hygon HygonGenuine
Jan 22 23:05:28 node04 kernel: Centaur CentaurHauls
Jan 22 23:05:28 node04 kernel: zhaoxin Shanghai
Jan 22 23:05:28 node04 kernel: BIOS-provided physical RAM map
There was no backup job on this Server during this time. Node3 was backed up when node4 rebooted. Very strange,
Where can I look for further information, and what can I do about this?
Thanks in advance.
Holger
First, some information about the setup we are running:
• 4 x Proxmox nodes (version 8.3.2) with Ceph installed – cluster without HA
• Separate networks for Ceph (2 x 10GB), Corosync (1GB), and Backup (1GB) - 2 switches (10GB & 1GB)
• 1 x Proxmox Backup Server
Each server is backed up using separate jobs.
We have the issue that several servers randomly reboot. Here is the log from the server "node4" that rebooted. I can’t find anything useful there.
Jan 22 21:17:01 node04 CRON[497988]: pam_unix(cron:session): session closed for user root
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 62
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 59
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 59
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 61
Jan 22 21:40:10 node04 pmxcfs[1662]: [dcdb] notice: data verification successful
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 64
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 63
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 62
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 63
Jan 22 22:00:04 node04 pmxcfs[1662]: [status] notice: received log
Jan 22 22:17:01 node04 CRON[541085]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 22 22:17:01 node04 CRON[541086]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 22 22:17:01 node04 CRON[541085]: pam_unix(cron:session): session closed for user root
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 61
Jan 22 22:30:02 node04 pmxcfs[1662]: [status] notice: received log
Jan 22 22:40:10 node04 pmxcfs[1662]: [dcdb] notice: data verification successful
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 60
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 59
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 58
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 60
Jan 22 23:00:05 node04 pmxcfs[1662]: [status] notice: received log
-- Reboot --
Jan 22 23:05:28 node04 kernel: Linux version 6.8.12-5-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-5 (2024-12-03T10:26Z) ()
Jan 22 23:05:28 node04 kernel: Command line: initrd=\EFI\proxmox\6.8.12-5-pve\initrd.img-6.8.12-5-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Jan 22 23:05:28 node04 kernel: KERNEL supported cpus:
Jan 22 23:05:28 node04 kernel: Intel GenuineIntel
Jan 22 23:05:28 node04 kernel: AMD AuthenticAMD
Jan 22 23:05:28 node04 kernel: Hygon HygonGenuine
Jan 22 23:05:28 node04 kernel: Centaur CentaurHauls
Jan 22 23:05:28 node04 kernel: zhaoxin Shanghai
Jan 22 23:05:28 node04 kernel: BIOS-provided physical RAM map
There was no backup job on this Server during this time. Node3 was backed up when node4 rebooted. Very strange,
Where can I look for further information, and what can I do about this?
Thanks in advance.
Holger