Daily Host Crash after Upgrade to PVE 8.1.4

TanDE · Feb 21, 2024

Hello everyone,

there are some other reports in the forum, but I don't know if they are directly related. We have started to upgrade our 10 Node cluster to PVE 8.1. Currently there are 2 nodes on 8.1, a SuperMicro and a DELL PowerEdge R740xd. The DELL system crahshed once a day, we then receive the FENCING message by e-mail. There is no reason for this in the logs! Shortly before the server reboots, there are no signs in the log. The time is also always different (10:47 / 10:56 / 04:26).

PVE 7 Nodes: Version 7.4.16
PVE 8 Nodes: Version 8.1.4
CEPH 17.2.5 / 17.2.6
Kernel Version on PVE8 Nodes: Linux 6.5.11-8-pve (2024-01-30T12:27Z)

Before the upgrade to Version 8, there were no problems in this direction.
Does anyone have similar problems and/or an idea how to troubleshoot?

Best regards
Tan

aaron · Feb 21, 2024

Have you tried to update the BIOS and firmware on the Dell system and its NICs? With the newer kernel, there might be some issues if the firmware is too old. Definitely worth a try.

TanDE · Feb 21, 2024

aaron said:
Have you tried to update the BIOS and firmware on the Dell system and its NICs? With the newer kernel, there might be some issues if the firmware is too old. Definitely worth a try.

Hi, I already had the idea but haven't done it yet as the problem only became known this morning. Thanks for the tip. I will tackle it straight away.
I must also correct myself, there are indeed entries in the log that refer to the network. Since we did not have these problems before, I am assuming a kernel / driver / bios problem. The NICs are Broadom BCM57414 10/25Gb dual port cards.

aaron · Feb 21, 2024

The screenshot is from the node that reboots itself? If so, then it leems to be a network issue and HA is doing what it is supposed to do, fence the node once it cannot reestablish the Corosync connection to the quorate part of the cluster.

Since both Corosync links go down, how is the network set up? Maybe there is a way to improve something on that side as well.
Maybe post the contents of /etc/network/interfaces and /etc/pve/corosync.conf. Ideally within [CODE][/CODE] tags.

TanDE · Feb 21, 2024

aaron said:
The screenshot is from the node that reboots itself? If so, then it leems to be a network issue and HA is doing what it is supposed to do, fence the node once it cannot reestablish the Corosync connection to the quorate part of the cluster.

Since both Corosync links go down, how is the network set up? Maybe there is a way to improve something on that side as well.
Maybe post the contents of /etc/network/interfaces and /etc/pve/corosync.conf. Ideally within [CODE][/CODE] tags.

Yes, I understand that, but the hosts last had an uptime of 174 days with 7.4 without any problems.

That HA evacuates / restarts the host here to avoid a split-brain is understandable. What interests me is why the host has problems with the network and no longer sees the other cluster nodes. Our setup looks like this (probably not best practice in all places, but this is how we planned and set it up 4 years ago):

TanDE · Feb 21, 2024

Ok, apparently something has changed in the OVS LACP implementation. Our switches regularly (only on PVE8.1) disconnect the bonds for a short time, probably due to missing LACP heartbeats. In this case, FAST (1s timeout) is configured on both sides.

Switch Interface Configuration:

Code:

...
channel-group XX mode active
flowcontrol receive off
mtu 9216 
lacp rate fast
...

As we have had problems with OVS during upgrades in the past (OVS services restarting while Updating, host fencing, etc.), it might be worth considering switching to Linux Bridge. We do not use any other OVS features.

What do you think? OVS to Linux Bridge?
I might need some help for a "stable" configuration / migration

TanDE · Feb 21, 2024

Does anyone have experience with OVS and LACP?

On the switch you can clearly see that the LACP heartbeats no longer come in every second (fast) from the PVE host but every 30s (slow), although the OVS bond is configured to "other_config:lacp-time=fast".

PVE 8.1 with OVS and LACP fast

PVE 7.4 with OVS with LACP fast

This also explains why the bonds are sporadically terminated/ungrouped.

Has anything changed in the current OVS config syntax?

aaron · Feb 21, 2024

TanDE said:
We do not use any other OVS features.

What do you think? OVS to Linux Bridge?

I, personally, think this is a great idea.

It shouldn't be too hard. I don't know the network config details, but create a regular bond on the physical interfaces, choose LACP with a hash policy your switches support, and then a vmbr using that bond as bridge-port.

TanDE · Feb 23, 2024

Unfortunately, the problem still exists:

Nodes restart unexpectedly without any entries in the log. This morning a host (pve09) restarted without any load (only 4 VMs active).

Hardware: DELL PowerEdge R740xd
Firmware: fully updated (Bios, NIC, iDRAC, etc.)
Proxmox: PVE 8.1.4 with Kernel 6.5.11-8-pve
Network: Linux Bridges / Bonds

I have completely removed OVS in the last few days and migrated to native Linux Bridge.

Now I don't know what to do! All updates have been installed. It is reasonable enterprise hardware. I now completely rule out network problems, among other things because no other node in the cluster has complained before and "only" recognize that host 9 is down.

How can such an abrupt host restart be debugged further? Nothing is logged in the iDRAC, apart from the restart.

Btw: The host restarts only occur in the combination PVE8.1 and DELL PowerEdge.

aaron · Feb 23, 2024

To rule out any corosync networking issues, and if you got at least one NIC free, consider adding another Corosync link, using different hardware.

For example, a simple cheap switch that only connects the nodes on that additonal NIC only. Then configure IP addresses on a new subnet and add it as additional Corosync network. https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

If the "old" Corosync Links are down, you can ssh into the node from another one in the cluster via the "new" Corosync IPs and investigate.

If it still reboots out of the blue, then it most likely is not Corosync related and it becomes trickier. Memtests, booting older kernels, updating to more current kernels if possible, ...

TanDE · Feb 24, 2024

Unfortunately I don't have a free NIC, but I rule out network problems as there are no signs of them. Corosync also has two separate links running over physically separate NICs and switches. No Corosync entries in the logs shortly before reboot.

3 DELL R740 nodes are currently on PVE 8.1 with current bios. These restart randomly, although no VM (but CEPH) is running on these systems and they are not in HA (State: Idle). No host fencing from HA!

I can also completely rule out hardware defects (RAM, etc.), as the behavior occurs directly after upgrading from PVE 7 to 8.1 on 3 separate (identical) servers.

I found this in the release notes under Known Issues:

Could this be the reason for the server reboots?

Is there already a newer testing version where the problem may have been fixed?

I have now pinned affected servers to kernel 6.2.16-20-pve as a test and am monitoring the situation.

Daily Host Crash after Upgrade to PVE 8.1.4

TanDE

Member

aaron

Proxmox Staff Member

TanDE

Member

aaron

Proxmox Staff Member

TanDE

Member

TanDE

Member

TanDE

Member

aaron

Proxmox Staff Member

TanDE

Member

Attachments

aaron

Proxmox Staff Member

TanDE

Member

We value your privacy