HA Cluster crashes when upgrading nodes from PVE 7 to 8

aPollO

Renowned Member
Mar 6, 2014
162
14
83
Cottbus, Germany
Hi Guys,

i'm running a four-node cluster running with the latest PVE 7 release and CEPH 17.

Code:
pveversion:
pve-manager/7.4-19/f98bf8d4 (running kernel: 5.15.158-2-pve)

ceph --version:
ceph version 17.2.7 (29dffbfe59476a6bb5363cf5cc629089b25654e3) quincy (stable)

I want to upgrade this cluster to the latest PVE 8 release. I followed the instrctions here: https://pve.proxmox.com/wiki/Upgrade_from_7_to_8

To upgrade the first node, I performed the following steps:

  1. Stopped or migrated all VMs running on Node A to Node B, C or D
  2. Set noout flag on Ceph storage
  3. Replaced bullseye with bookworm in the sources.list files
  4. Run apt update && apt dist-upgrade
Everything went as planned until suddenly the SSH connection to Node A was lost. Shortly after, I noticed that not only was Node A unreachable, but all four nodes were. They had all restarted expect Node A. But Node A has lost it's network. So i was able to proceed the Upgrade directly attached to the servers console. The other Nodes bootet up again.
OK whatever. I noticed that the openvswitch package got a new version. Could it be that this caused the network connection to be lost?

Unfotunately, the network configuration is not perfect....I guess.

I have only one link for the cluster:
1740676356261.png

Heres the configuration:
Code:
auto enp67s0f0
iface enp67s0f0 inet manual

auto enp67s0f1
iface enp67s0f1 inet manual


auto bond302
iface bond302 inet manual
    ovs_bonds enp67s0f0 enp67s0f1
    ovs_type OVSBond
    ovs_bridge vmbr302
    ovs_mtu 9000
    ovs_options lacp=active other_config:lacp-time=fast bond_mode=balance-tcp
    pre-up ( ip link set enp67s0f0 mtu 9000 && ip link set enp67s0f1 mtu 9000 )


auto vmbr302
iface vmbr302 inet static
    address 192.168.100.133/26
    ovs_type OVSBridge
    ovs_ports bond302
    ovs_mtu 9000


For ceph it is very similar:
Code:
auto enp1s0f0
iface enp1s0f0 inet manual

auto enp1s0f1
iface enp1s0f1 inet manual


auto bond192
iface bond192 inet manual
        ovs_bonds enp1s0f0 enp1s0f1
        ovs_type OVSBond
        ovs_bridge vmbr192
        ovs_mtu 9000
        ovs_options lacp=active other_config:lacp-time=fast bond_mode=balance-tcp
        pre-up ( ip link set enp1s0f0 mtu 9000 && ip link set enp1s0f1 mtu 9000 )

auto vmbr192
iface vmbr192 inet static
        address 192.168.100.103/26
        ovs_type OVSBridge
        ovs_ports bond192
        ovs_mtu 9000

The LAN Network is also configured in a very similar way.

I suspect that the openvswitch triggert an restart, causing the network connection to be lost. Could this really have happened? I also suspect that this might have caused Node A to restart. But why did all other nodes restarted as well?

Should I remove the LACP from the cluster network, assign ip addresses directly to the interfaces, and add a second link to the cluster configuration?

What logs would be helpful to determine why the cluster crashed? The only thing I found is that nodes left the cluster and quorum was lost. See Log in attachment from Node D.
I assumed that something strange had happened. So I proceeded to upgrade Node B. But the exact same thing happened again—everything was forcefully restarted.

I would really appreciate some help before upgrading any more nodes. I'm afraid the entire cluster will crash again.
 

Attachments

Last edited:
Hi,

Is the 192.168.100.x is only for the Corosync? or also shared with the VMs and Ceph? May you provide more with syslog entries, this can help us identify what happens exactly.
 
  • Like
Reactions: aPollO
Thanks for your reply.

Okay let my try to explain the network to answer your questions.

The VM/Management Network

  • OVS Bond (enp49s0f0 + enp49s0f1)
  • Network 19.168.1.0/24
  • Range 19.168.1.1 - 254
  • Host has e.g. 19.168.1.100
The Corosync Network
  • OVS Bond (enp2s0f0 + enp2s0f1)
  • Network 192.168.100.128/26
  • Range 192.168.100.129 - 190
  • Host has e.g. 192.168.100.130
The Ceph Network
  • OVS Bond (enp19s0f0 + enp19s0f1)
  • Network 192.168.100.64/26
  • Range 192.168.100.65 - 126
  • Host has e.g 192.168.100.100

That means also:
  • all network have an OVS Bridge on an OVS Bond
  • all networks use LACP with tcp-balance
  • all physical Interfaces have 10 GbE
  • VM/Management Network is on a seperate switch
  • Ceph+Corosync using the same switches but with own ports, cables and physical interfaces
  • Ceph and Corosync are in diffrent port-based VLANs which means no VLAN on PVE Host, only on switch site

I attached two syslog files. This is just a short excerpt from right when the problem happened. If you need more logs, let me know which ones.
node-a is the one where the upgrade was in progress. node-d up and running, unexpectedly got killed during the update progress of node-a (like the other nodes b and c)
I see that the was no quorum. But i don't understand how this happend? I didn't touched the switches or network interfaces an node-b, node-c or node-d.

As I mentioned in the first post, node-a was not rebooted. However, it lost network connectivity. I could then see on the local console that the network was still being displayed by tools like 'ip a'. But a connection was no longer possible, e.g., via ping. I restarted this node manually to get back a working connection.
 

Attachments

What are the NICs - if you have Intel NIC and it looks like you’re doing LACP, there is a bug in some of the hardware (I believe it is the 700-series) that offloads the LACP frames in the NIC and never passes them to the OS, this isn’t apparent until you lose a connection (network restart, one side of datacenter goes down etc) and the bond stops running properly.

Also make sure your switch can handle your “fast” timing and bonding mode (and the config is the same, and the switch follows the standard when it comes to timings), some don’t and it works well until one NIC goes down and the traffic needs to be rerouted, your Corosync loses connection for too long, triggering the dead man switch. What I’ve also seen is where all the ports are in a single LACP group on the switch side, which works until 1 node goes down and then the other ports on other machines get the notification from the switch that their bond is ‘broken’ and they all reconfigure at the same time.

I would make sure your switches are working properly and do some testing by taking out one of the switches or unplugging the cable and see whether there is an issue reconverging. The other issue I’ve seen once is when I used these embedded SD cards as boot devices, they were simply too slow so when there was a spike (when you do apt upgrade or one node goes down and a bunch of disk activity happened due to logs) Corosync couldn’t write fast enough to disk and timed out. This can also happen with cheap SSD as boot device or when one of a pair in RAID1 isn’t functioning properly.
 
Last edited:
Code:
node-a: ~/ $ lspci | grep Ether
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
02:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
02:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
13:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
13:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
31:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
31:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)

At switch site the ports a configured like this
switch_config.png

1/0 means first switch and 2/0 means second switch. They are stacked and work like one switch. Model is Netgear M4300-X12F12. I will do some tests with disabling ports/disconnecting cable and see what happens.

The pve system is installed on RAID 1 with Samsung MZ-ILT9600. That should be fast enough, I think, especially since no VMs were running on the node when the upgrade was performed.


I found something strange. Looks like some missconfiguration to me.
Code:
Mar 03 12:07:37 node-b corosync[20798]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Mar 03 12:07:37 node-b corosync[20798]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run b>
Mar 03 12:42:31 node-b corosync[20798]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Mar 03 12:42:31 node-b corosync[20798]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run b>
Mar 03 13:14:17 node-b corosync[20798]:   [KNET  ] link: host: 4 link: 0 is down
Mar 03 13:14:17 node-b corosync[20798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Mar 03 13:14:17 node-b corosync[20798]:   [KNET  ] host: host: 4 has no active links
Mar 03 13:14:18 node-b corosync[20798]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
Mar 03 13:14:18 node-b corosync[20798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Mar 03 13:18:54 node-b corosync[20798]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Mar 03 13:18:54 node-b corosync[20798]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run b>
Mar 03 13:48:43 node-b corosync[20798]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 03 13:53:48 node-b corosync[20798]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Mar 03 13:53:48 node-b corosync[20798]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run b>
Mar 03 14:28:42 node-b corosync[20798]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Mar 03 14:28:42 node-b corosync[20798]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run b>
Mar 03 15:03:37 node-b corosync[20798]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Mar 03 15:03:37 node-b corosync[20798]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run b>
Mar 03 15:38:31 node-b corosync[20798]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Mar 03 15:38:31 node-b corosync[20798]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run b>
Mar 03 16:13:25 node-b corosync[20798]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Mar 03 16:13:25 node-b corosync[20798]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run b>
Mar 03 16:48:19 node-b corosync[20798]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Mar 03 16:48:19 node-b corosync[20798]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run b>
Mar 03 17:22:01 node-b corosync[20798]:   [KNET  ] link: host: 4 link: 0 is down
 
Last edited:
Yes is support Frames with up to 9216 bytes and this is configured to the ports. The entries from KNET or Corosync regarding the MTU misconfiguration were only due to me temporarily reducing the MTU to 1500 for testing. However, this was after my issue with the crash. So, it has nothing to do with the problem from the first post.
Sorry for the confusion I caused.

Yesterday i was able to update the last node in the cluster to Proxmox 8. I stopped pve-ha-lrm and pve-ha-crm services on alle nodes before starting the update of the last node.
Once again, the network connection was lost during the update. I was unable to connect via SSH and had to use the local console again. I really have no idea why this is happening or how to prevent it.
But this time, the cluster didn't crash because I had stopped all VMs and HA services beforehand as a precaution.

I have another cluster with three nodes that I need to update. I would prefer to do this update without downtime. However, I first need to find out what is causing the network loss. Any ideas?

I found this. ovs-vswitchd stopps during the update. Is this a normal behaviour?

Code:
Mar 03 20:49:06 node-d.local systemd[1]: Stopping openvswitch-switch.service - Open vSwitch...
Mar 03 20:49:06 node-d.local systemd[1]: openvswitch-switch.service: Deactivated successfully.
Mar 03 20:49:06 node-d.local systemd[1]: Stopped openvswitch-switch.service - Open vSwitch.
Mar 03 20:49:06 node-d.local systemd[1]: Stopping ovs-vswitchd.service - Open vSwitch Forwarding Unit...
Mar 03 20:49:06 node-d.local (ovs-ctl)[1789805]: ovs-vswitchd.service: Failed to locate executable /usr/share/openvswitch/scripts/ovs-ctl: No such file or directory
Mar 03 20:49:06 node-d.local (ovs-ctl)[1789805]: ovs-vswitchd.service: Failed at step EXEC spawning /usr/share/openvswitch/scripts/ovs-ctl: No such file or directory
Mar 03 20:49:06 node-d.local systemd[1]: ovs-vswitchd.service: Control process exited, code=exited, status=203/EXEC
Mar 03 20:49:07 node-d.local systemd[1]: ovs-vswitchd.service: Failed with result 'exit-code'.
Mar 03 20:49:07 node-d.local systemd[1]: Stopped ovs-vswitchd.service - Open vSwitch Forwarding Unit.
Mar 03 20:49:07 node-d.local systemd[1]: ovs-vswitchd.service: Consumed 28.562s CPU time.
Mar 03 20:49:07 node-d.local systemd[1]: Stopping ovsdb-server.service - Open vSwitch Database Unit...
Mar 03 20:49:07 node-d.local (ovs-ctl)[1789808]: ovsdb-server.service: Failed to locate executable /usr/share/openvswitch/scripts/ovs-ctl: No such file or directory
Mar 03 20:49:07 node-d.local (ovs-ctl)[1789808]: ovsdb-server.service: Failed at step EXEC spawning /usr/share/openvswitch/scripts/ovs-ctl: No such file or directory
Mar 03 20:49:07 node-d.local systemd[1]: ovsdb-server.service: Control process exited, code=exited, status=203/EXEC
Mar 03 20:49:07 node-d.local systemd[1]: ovsdb-server.service: Failed with result 'exit-code'.
Mar 03 20:49:07 node-d.local systemd[1]: Stopped ovsdb-server.service - Open vSwitch Database Unit.
Mar 03 20:49:08 node-d.local systemd[1]: Stopping slapd.service - LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)...
Mar 03 20:49:08 node-d.local slapd[1778757]: daemon: shutdown requested and initiated.
Mar 03 20:49:08 node-d.local slapd[1778757]: slapd shutdown: waiting for 0 operations/tasks to finish
Mar 03 20:49:08 node-d.local slapd[1778757]: DIGEST-MD5 common mech free
Mar 03 20:49:08 node-d.local slapd[1778757]: DIGEST-MD5 common mech free
Mar 03 20:49:08 node-d.local slapd[1778757]: slapd stopped.
Mar 03 20:49:08 node-d.local slapd[1789953]: Stopping OpenLDAP: slapd.
Mar 03 20:49:08 node-d.local systemd[1]: slapd.service: Deactivated successfully.
Mar 03 20:49:08 node-d.local systemd[1]: Stopped slapd.service - LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Mar 03 20:49:08 node-d.local slapcat[1789965]: DIGEST-MD5 common mech free
Mar 03 20:49:08 node-d.local slapcat[1789988]: DIGEST-MD5 common mech free
Mar 03 20:49:09 node-d.local systemd[1]: plymouth-quit-wait.service: Deactivated successfully.
Mar 03 20:49:09 node-d.local systemd[1]: Stopped plymouth-quit-wait.service - Hold until boot process finishes up.
Mar 03 20:49:09 node-d.local systemd[1]: plymouth-quit.service: Deactivated successfully.
Mar 03 20:49:09 node-d.local systemd[1]: Stopped plymouth-quit.service - Terminate Plymouth Boot Screen.
Mar 03 20:49:09 node-d.local systemd[1]: plymouth-read-write.service: Deactivated successfully.
Mar 03 20:49:09 node-d.local systemd[1]: Stopped plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data.
Mar 03 20:49:09 node-d.local systemd[1]: plymouth-start.service: Deactivated successfully.
Mar 03 20:49:09 node-d.local systemd[1]: Stopped plymouth-start.service - Show Plymouth Boot Screen.
Mar 03 20:49:09 node-d.local systemd[1]: systemd-ask-password-plymouth.path: Deactivated successfully.
Mar 03 20:49:09 node-d.local systemd[1]: Stopped systemd-ask-password-plymouth.path - Forward Password Requests to Plymouth Directory Watch.
Mar 03 20:49:12 node-d.local corosync[2850]:   [KNET  ] link: host: 2 link: 0 is down
Mar 03 20:49:12 node-d.local corosync[2850]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 03 20:49:12 node-d.local corosync[2850]:   [KNET  ] host: host: 2 has no active links
Mar 03 20:49:16 node-d.local corosync[2850]:   [KNET  ] link: host: 3 link: 0 is down
Mar 03 20:49:16 node-d.local corosync[2850]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 03 20:49:16 node-d.local corosync[2850]:   [KNET  ] host: host: 3 has no active links
Mar 03 20:49:18 node-d.local corosync[2850]:   [KNET  ] link: host: 1 link: 0 is down
Mar 03 20:49:18 node-d.local corosync[2850]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 03 20:49:18 node-d.local corosync[2850]:   [KNET  ] host: host: 1 has no active links
Mar 03 20:49:19 node-d.local corosync[2850]:   [TOTEM ] Token has not been received in 3225 ms
Mar 03 20:49:20 node-d.local corosync[2850]:   [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
Mar 03 20:49:25 node-d.local corosync[2850]:   [QUORUM] Sync members[1]: 4
Mar 03 20:49:25 node-d.local corosync[2850]:   [QUORUM] Sync left[3]: 1 2 3
Mar 03 20:49:25 node-d.local corosync[2850]:   [TOTEM ] A new membership (4.5bfa) was formed. Members left: 1 2 3
Mar 03 20:49:25 node-d.local corosync[2850]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Mar 03 20:49:25 node-d.local pmxcfs[2742]: [dcdb] notice: members: 4/2742
Mar 03 20:49:25 node-d.local corosync[2850]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 03 20:49:25 node-d.local corosync[2850]:   [QUORUM] Members[1]: 4
Mar 03 20:49:25 node-d.local corosync[2850]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 03 20:49:25 node-d.local pmxcfs[2742]: [status] notice: node lost quorum
Mar 03 20:49:25 node-d.local pmxcfs[2742]: [status] notice: members: 4/2742
 

Attachments