Node randomly reboots

daros

Renowned Member
Jul 22, 2014
55
2
73
Hello,

We got an 6 node cluster, 1 node randomly reboots (twice this week) after working for months/years correctly.
i could not find anything in the logs that can point me to an reason what happens.

Specs:
48 x Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz (2 Sockets)
288gb ram

Proxmox
Linux 5.4.114-1-pve #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200)
pve-manager/6.4-8/185e14db

Logs are attached, syslog and messages log.

Can someone help me? Or point me in the direction to look for?
I already disabled the PBS the node could not find.
 

Attachments

  • syslog-proxs-06-210715.txt
    939.9 KB · Views: 7
  • messages-prox-s06-210715.txt
    212 KB · Views: 4
Please provide the syslog from one of the other cluster nodes as well.
 
Hello Mira, see attached.
 

Attachments

  • syslog-prox-s01-210715.txt
    29.9 KB · Views: 2
So the node just didn't respond anymore and ~4 minutes later was back up.
Looks like a hard reset, rather than a simple reboot. Do you have additional logging/IPMI logs?
 
Hello,

I just looked in IPMI en found this:
Code:
 35 | 2021/07/13 08:30:23 | Memory
    | Assertion:Uncorrectable ECC / other uncorrectable memory error @DIMME1(CPU1)
 36 | 2021/07/13 08:31:04 | BIOS OEM (Memory Error)
    | Assertion:(runtime) Failing DIMM: DIMM location (P1-DIMME1)
 37 | 2021/07/13 08:31:21 | BIOS OEM (Memory Error)
    | Assertion:(runtime) Failing DIMM: DIMM location (P1-DIMME1)
 38 | 2021/07/13 08:31:32 | BIOS OEM (Memory Error)
    | Assertion:(runtime) Failing DIMM: DIMM location (P1-DIMME1)
 39 | 2021/07/13 08:33:36 | Processor
    | Assertion:CATERR has occurred
 40 | 2021/07/13 08:36:34 | BIOS OEM (Memory Error)
    | Assertion:Failing DIMM: DIMM location (Correctable memory component found) (P1-DIMME1)
 41 | 2021/07/15 05:55:49 | Memory
    | Assertion:Uncorrectable ECC / other uncorrectable memory error @DIMME1(CPU1)
 42 | 2021/07/15 05:56:29 | BIOS OEM (Memory Error)
    | Assertion:(runtime) Failing DIMM: DIMM location (P1-DIMME1)
 43 | 2021/07/15 05:56:46 | BIOS OEM (Memory Error)
    | Assertion:(runtime) Failing DIMM: DIMM location (P1-DIMME1)
 44 | 2021/07/15 05:56:57 | BIOS OEM (Memory Error)
    | Assertion:(runtime) Failing DIMM: DIMM location (P1-DIMME1)

Also when i look in proxmox found out the total ram is changed.
Schermafbeelding 2021-07-16 om 20.30.52.png
 
Last edited:
Code:
35 | 2021/07/13 08:30:23 | Memory
    | Assertion:Uncorrectable ECC / other uncorrectable memory error @DIMME1(CPU1)
 36 | 2021/07/13 08:31:04 | BIOS OEM (Memory Error)
    | Assertion:(runtime) Failing DIMM: DIMM location (P1-DIMME1)

So it seems to be an issue with your memory.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!