Hi Guys,
i'm running a four-node cluster running with the latest PVE 7 release and CEPH 17.
I want to upgrade this cluster to the latest PVE 8 release. I followed the instrctions here: https://pve.proxmox.com/wiki/Upgrade_from_7_to_8
To upgrade the first node, I performed the following steps:
OK whatever. I noticed that the openvswitch package got a new version. Could it be that this caused the network connection to be lost?
Unfotunately, the network configuration is not perfect....I guess.
I have only one link for the cluster:

Heres the configuration:
For ceph it is very similar:
The LAN Network is also configured in a very similar way.
I suspect that the openvswitch triggert an restart, causing the network connection to be lost. Could this really have happened? I also suspect that this might have caused Node A to restart. But why did all other nodes restarted as well?
Should I remove the LACP from the cluster network, assign ip addresses directly to the interfaces, and add a second link to the cluster configuration?
What logs would be helpful to determine why the cluster crashed? The only thing I found is that nodes left the cluster and quorum was lost. See Log in attachment from Node D.
I assumed that something strange had happened. So I proceeded to upgrade Node B. But the exact same thing happened again—everything was forcefully restarted.
I would really appreciate some help before upgrading any more nodes. I'm afraid the entire cluster will crash again.
i'm running a four-node cluster running with the latest PVE 7 release and CEPH 17.
Code:
pveversion:
pve-manager/7.4-19/f98bf8d4 (running kernel: 5.15.158-2-pve)
ceph --version:
ceph version 17.2.7 (29dffbfe59476a6bb5363cf5cc629089b25654e3) quincy (stable)
I want to upgrade this cluster to the latest PVE 8 release. I followed the instrctions here: https://pve.proxmox.com/wiki/Upgrade_from_7_to_8
To upgrade the first node, I performed the following steps:
- Stopped or migrated all VMs running on Node A to Node B, C or D
- Set
noout
flag on Ceph storage - Replaced bullseye with bookworm in the sources.list files
- Run
apt update && apt dist-upgrade
OK whatever. I noticed that the openvswitch package got a new version. Could it be that this caused the network connection to be lost?
Unfotunately, the network configuration is not perfect....I guess.
I have only one link for the cluster:

Heres the configuration:
Code:
auto enp67s0f0
iface enp67s0f0 inet manual
auto enp67s0f1
iface enp67s0f1 inet manual
auto bond302
iface bond302 inet manual
ovs_bonds enp67s0f0 enp67s0f1
ovs_type OVSBond
ovs_bridge vmbr302
ovs_mtu 9000
ovs_options lacp=active other_config:lacp-time=fast bond_mode=balance-tcp
pre-up ( ip link set enp67s0f0 mtu 9000 && ip link set enp67s0f1 mtu 9000 )
auto vmbr302
iface vmbr302 inet static
address 192.168.100.133/26
ovs_type OVSBridge
ovs_ports bond302
ovs_mtu 9000
For ceph it is very similar:
Code:
auto enp1s0f0
iface enp1s0f0 inet manual
auto enp1s0f1
iface enp1s0f1 inet manual
auto bond192
iface bond192 inet manual
ovs_bonds enp1s0f0 enp1s0f1
ovs_type OVSBond
ovs_bridge vmbr192
ovs_mtu 9000
ovs_options lacp=active other_config:lacp-time=fast bond_mode=balance-tcp
pre-up ( ip link set enp1s0f0 mtu 9000 && ip link set enp1s0f1 mtu 9000 )
auto vmbr192
iface vmbr192 inet static
address 192.168.100.103/26
ovs_type OVSBridge
ovs_ports bond192
ovs_mtu 9000
The LAN Network is also configured in a very similar way.
I suspect that the openvswitch triggert an restart, causing the network connection to be lost. Could this really have happened? I also suspect that this might have caused Node A to restart. But why did all other nodes restarted as well?
Should I remove the LACP from the cluster network, assign ip addresses directly to the interfaces, and add a second link to the cluster configuration?
What logs would be helpful to determine why the cluster crashed? The only thing I found is that nodes left the cluster and quorum was lost. See Log in attachment from Node D.
I assumed that something strange had happened. So I proceeded to upgrade Node B. But the exact same thing happened again—everything was forcefully restarted.
I would really appreciate some help before upgrading any more nodes. I'm afraid the entire cluster will crash again.
Attachments
Last edited: