Node randomly freeze

Peppe2201

Member
Aug 28, 2018
8
1
8
21
Hi guys,
for a few weeks now I have been having strange problems on two nodes on a production cluster (all the nodes are at the same version of the kernel/proxmox), randomly, but about once a week, the node suddenly freezes and it appears offline in the cluster, connecting to the IPMI I see the proxmox login screen completely blocked, I have to force restart the server to restart it.
In the syslog at the boot after the block I see the error that I attach at the bottom in syslog

Can you help me troubleshoot this problem?

CPU: Xeon E-2288G
RAM: 128GB DDR4
BOOT: 120GB SSD
CEPH_OSD1: 1TB NVME
CEPH_OSD2: 1TB NVME

Linux 5.4.73-1-pve #1 SMP PVE 5.4.73-1 (Mon, 16 Nov 2020 10:52:16 +0100)

pve-manager/6.2-15/48bd51b6

Dec 02 06:47:31 nodo4 kernel: BERT: Error records from previous boot:
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: event severity: fatal
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: Error 0, type: fatal
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: section type: unknown, 81212a96-09ed-4996-9471-8d729c8e69ed
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: section length: 0xc20
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000000: 00000001 00000000 00000000 01003001 .............0..
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000010: 00000000 00000000 00000000 00000000 ................
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000020: 01003001 00000006 9bcec5a3 00000eab .0..............
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000030: 00000002 00000032 09160029 800007ff ....2...).......
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000040: 02002080 7f05fe03 00000000 00000000 . ..............
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000050: 22000000 01200000 80002033 00004000 ...".. .3 ...@..
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000060: 05000400 c7000400 ff2fff04 ef01f08b ........../.....
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000070: 750e6073 9a022000 00110000 00002000 s`.u. ....... ..
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000080: 04000000 00000000 00502080 000ecf02 ......... P.....
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 00000090: 00035d00 00000001 53f3f170 1020003c .]......p..S<. .
Dec 02 06:47:31 nodo4 kernel: [Hardware Error]: 000000a0: 00000086 f7f3ff00 d63717bc fffffe87 ..........7.....
 

ebgp

New Member
Dec 7, 2020
2
0
1
21
Hi Peppe2201,

I've also been having similar problems with nodes on a production cluster which had been perfectly stable with months of uptime until 2020-10-26, and after that has been freezing / hanging at various intervals between as little as four hours and as much as nearly two weeks. Has your experience been similar? Was there a particular date the problem started happening?

I found your post by searching for "750e6073 9a022000 00110000 00002000", which is a common line in all of my BERT errors, and your post was the only google match. Can I ask what hardware you're using - if we can share detailed hardware information perhaps we could narrow things down? I'm getting this with Xeon-E 2288G systems, all identically configured Supermicro SYS-5039MC-H8TRF/X11SCD-F, BIOS 1.5 (but the problem was also occuring with BIOS 1.4 and BIOS 1.3), with 128GiB RAM and 2x NVMe drives (1x Samsung 3.2TB PM1735 HHHL AIC Fw:EPK98B5Q + 1x Samsung 3.84TB PM983 M.2 Fw:EDB7502Q).

Having moved virtual machines around between nodes, the virtual machine that seems to be triggering this problem is a speedtest.net server that we run. So it may be network IO related. We're using Supermicro AOC-CTG-i2S MicroLP 2x10GE, which lspci reports as:
01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

Thanks
James
 

Peppe2201

Member
Aug 28, 2018
8
1
8
21
Hi ebgp,
same problem, since august we have not encountered any kind of problem.
My servers are based on SYS-5039MC-H8TRF blade server on X11SCD-F motherboard, all node in my cluster are with this hardware but only two are suffering from this problem.
We are using this nic on the servers:
02:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
Other server with the same NIC doesn't have this strange problem

Giuseppe
 

ebgp

New Member
Dec 7, 2020
2
0
1
21
Hi Giuseppe,

Thanks, so we have the same Base System (Chassis/Motherboard), CPUs, and NICs.

Can you share what NVMe SSDs you're using in case there's a commonality there too? (details on mine are in the message above)

Are you booting in EFI or Legacy BIOS mode? (EFI here)

Is there much network traffic when the system hangs occur? (We're often >1Gbps of speedtest traffic)

Is it possible that the reason you're only experiencing the problem on two of your nodes is that the problem is triggered by specific virtual machines? (Like our speedtest server)

Have you engaged Supermicro support yet? (We're trying to get a BERT log decode from them)

Thanks
James
 

Peppe2201

Member
Aug 28, 2018
8
1
8
21
My SSDs are two Crucial ct1000p1ssd8 for each node, I have to check if I'm booting EFI, I don't remember it now.
The problem arises both with traffic (an average of 1 gbit) and without any type of traffic with no load at all.
It seems unlikely to me that the problem is a virtual machine, having in the cluster several hundred machines managed by the HA with the same priority on all nodes.
We opened a request today to supermicro, as soon as I have an answer I also update here on the forum.

Thanks
 

Peppe2201

Member
Aug 28, 2018
8
1
8
21
Hi ebgp,
Supermicro responded to our ticket, it seems to be a motherboard problem, we will replace the blade in the next few days, hoping this will solve the problem
 
  • Like
Reactions: JoseLuisTorres

JoseLuisTorres

New Member
Jan 18, 2021
1
0
1
41
Hi Peppe2201

I have exactly the same problem
My servers are based on SYS-5039MC-H8TRF blade server on X11SCD-F motherboard

with the replacement of the motherboard has the problem been solved?
is it the same motherboard model?

thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!