Daily Host Crash after Upgrade to PVE 8.1.4

Sep 5, 2022
30
3
13
Hello everyone,

there are some other reports in the forum, but I don't know if they are directly related. We have started to upgrade our 10 Node cluster to PVE 8.1. Currently there are 2 nodes on 8.1, a SuperMicro and a DELL PowerEdge R740xd. The DELL system crahshed once a day, we then receive the FENCING message by e-mail. There is no reason for this in the logs! Shortly before the server reboots, there are no signs in the log. The time is also always different (10:47 / 10:56 / 04:26).

PVE 7 Nodes: Version 7.4.16
PVE 8 Nodes: Version 8.1.4
CEPH 17.2.5 / 17.2.6
Kernel Version on PVE8 Nodes: Linux 6.5.11-8-pve (2024-01-30T12:27Z)

Before the upgrade to Version 8, there were no problems in this direction.
Does anyone have similar problems and/or an idea how to troubleshoot?

Best regards
Tan
 
Have you tried to update the BIOS and firmware on the Dell system and its NICs? With the newer kernel, there might be some issues if the firmware is too old. Definitely worth a try.
 
  • Like
Reactions: TanDE
Have you tried to update the BIOS and firmware on the Dell system and its NICs? With the newer kernel, there might be some issues if the firmware is too old. Definitely worth a try.
Hi, I already had the idea but haven't done it yet as the problem only became known this morning. Thanks for the tip. I will tackle it straight away.
I must also correct myself, there are indeed entries in the log that refer to the network. Since we did not have these problems before, I am assuming a kernel / driver / bios problem. The NICs are Broadom BCM57414 10/25Gb dual port cards.

1708502728766.png
 
The screenshot is from the node that reboots itself? If so, then it leems to be a network issue and HA is doing what it is supposed to do, fence the node once it cannot reestablish the Corosync connection to the quorate part of the cluster.

Since both Corosync links go down, how is the network set up? Maybe there is a way to improve something on that side as well.
Maybe post the contents of /etc/network/interfaces and /etc/pve/corosync.conf. Ideally within [CODE][/CODE] tags.
 
The screenshot is from the node that reboots itself? If so, then it leems to be a network issue and HA is doing what it is supposed to do, fence the node once it cannot reestablish the Corosync connection to the quorate part of the cluster.

Since both Corosync links go down, how is the network set up? Maybe there is a way to improve something on that side as well.
Maybe post the contents of /etc/network/interfaces and /etc/pve/corosync.conf. Ideally within [CODE][/CODE] tags.
Yes, I understand that, but the hosts last had an uptime of 174 days with 7.4 without any problems.

That HA evacuates / restarts the host here to avoid a split-brain is understandable. What interests me is why the host has problems with the network and no longer sees the other cluster nodes. Our setup looks like this (probably not best practice in all places, but this is how we planned and set it up 4 years ago):

1708510854794.png
 
Ok, apparently something has changed in the OVS LACP implementation. Our switches regularly (only on PVE8.1) disconnect the bonds for a short time, probably due to missing LACP heartbeats. In this case, FAST (1s timeout) is configured on both sides.

1708512754968.png

Switch Interface Configuration:
Code:
...
channel-group XX mode active
flowcontrol receive off
mtu 9216 
lacp rate fast
...

As we have had problems with OVS during upgrades in the past (OVS services restarting while Updating, host fencing, etc.), it might be worth considering switching to Linux Bridge. We do not use any other OVS features.

What do you think? OVS to Linux Bridge?
I might need some help for a "stable" configuration / migration :)
 
Last edited:
  • Like
Reactions: ucholak
Does anyone have experience with OVS and LACP?

On the switch you can clearly see that the LACP heartbeats no longer come in every second (fast) from the PVE host but every 30s (slow), although the OVS bond is configured to "other_config:lacp-time=fast".

PVE 8.1 with OVS and LACP fast
1708514338949.png

PVE 7.4 with OVS with LACP fast
1708514707415.png

This also explains why the bonds are sporadically terminated/ungrouped.

Has anything changed in the current OVS config syntax?
 
We do not use any other OVS features.

What do you think? OVS to Linux Bridge?
I, personally, think this is a great idea. ;)

It shouldn't be too hard. I don't know the network config details, but create a regular bond on the physical interfaces, choose LACP with a hash policy your switches support, and then a vmbr using that bond as bridge-port.
 
  • Like
Reactions: TanDE
Unfortunately, the problem still exists:

Nodes restart unexpectedly without any entries in the log. This morning a host (pve09) restarted without any load (only 4 VMs active).

Hardware: DELL PowerEdge R740xd
Firmware: fully updated (Bios, NIC, iDRAC, etc.)
Proxmox: PVE 8.1.4 with Kernel 6.5.11-8-pve
Network: Linux Bridges / Bonds

I have completely removed OVS in the last few days and migrated to native Linux Bridge.

Now I don't know what to do! All updates have been installed. It is reasonable enterprise hardware. I now completely rule out network problems, among other things because no other node in the cluster has complained before and "only" recognize that host 9 is down.

How can such an abrupt host restart be debugged further? Nothing is logged in the iDRAC, apart from the restart.

Btw: The host restarts only occur in the combination PVE8.1 and DELL PowerEdge.
 

Attachments

  • other_node.png
    other_node.png
    144.9 KB · Views: 3
  • crashed_host.png
    crashed_host.png
    128.1 KB · Views: 2
To rule out any corosync networking issues, and if you got at least one NIC free, consider adding another Corosync link, using different hardware.

For example, a simple cheap switch that only connects the nodes on that additonal NIC only. Then configure IP addresses on a new subnet and add it as additional Corosync network. https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

If the "old" Corosync Links are down, you can ssh into the node from another one in the cluster via the "new" Corosync IPs and investigate.

If it still reboots out of the blue, then it most likely is not Corosync related and it becomes trickier. Memtests, booting older kernels, updating to more current kernels if possible, ...
 
Unfortunately I don't have a free NIC, but I rule out network problems as there are no signs of them. Corosync also has two separate links running over physically separate NICs and switches. No Corosync entries in the logs shortly before reboot.

3 DELL R740 nodes are currently on PVE 8.1 with current bios. These restart randomly, although no VM (but CEPH) is running on these systems and they are not in HA (State: Idle). No host fencing from HA!

1708782260545.png

I can also completely rule out hardware defects (RAM, etc.), as the behavior occurs directly after upgrading from PVE 7 to 8.1 on 3 separate (identical) servers.

I found this in the release notes under Known Issues:
1708781936874.png
Could this be the reason for the server reboots?

Is there already a newer testing version where the problem may have been fixed?

I have now pinned affected servers to kernel 6.2.16-20-pve as a test and am monitoring the situation.
 
Last edited:
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!