Proxmox cluster dies after changing a switch port

proxorh · Feb 21, 2025

Hi,

I have a 2 node proxmox cluster with a qdevice for quroum, all connected to the same switch and working well.

Code:

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.10.11 (local)
0x00000002          1    A,V,NMW 192.168.10.12
0x00000000          1            Qdevice

I shutdown one node (192.168.10.12) and the cluster still works well:

Code:

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.10.11 (local)
0x00000002          1         NR 192.168.10.12
0x00000000          1            Qdevice

I then move 192.168.10.12 to another switch and turn it up. 192.168.10.11 can ping 192.168.10.12 and vice versa, and on the proxmox UI the 192.168.10.12 node shows a question mark, and cluster looks well:

Code:

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.10.11 (local)
0x00000002          1    A,V,NMW 192.168.10.12
0x00000000          1            Qdevice

After about a minute the cluster dies. The active master node 192.168.10.11 drops the ssh connection, is unreachable and GUI is not responding. The cluster is still up via pvecm status as shown above, but the node is dead.

After a few minutes I am able to ssh again to the node but the GUI is still down and I have to poweroff both nodes using systemctl --force --force poweroff. When I turn on 192.168.10.11, the GUI comes back to life and I see 192.168.10.12 as down (red X), but the cluster status doesn't show it:

Code:

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.10.11 (local)
0x00000000          1            Qdevice

When I connect 192.168.10.12 back on the same switch, everything works well! Both nodes are recognized and all is good.

Attached is the syslog of 192.168.10.11. I didn't find anything meaningful, but I don't know what to look for.

Would appreciate any hint or further debugging I should do, because I am quite stuck.
Thank you!

zodiac · Feb 22, 2025

Do you have a network loop? Maybe try enabling STP and see if one of your links gets flagged

tcabernoch · Feb 22, 2025

d00d. You crossed the streams!

Do you have NIC bonding or multiple network bridges setup?

proxorh · Feb 22, 2025

@tcabernoch that was a good one

No network loop, ping is stable between nodes with not packetloss.
I also only have vmbr0 on both nodes, it is configured as vlan aware and the switch ports allow vlan tagged packets and are both on the same vlan.

MasterTH · Feb 22, 2025

can you please post your configuration from that specific ports at switch level
which switches are you using?

proxorh · Feb 23, 2025

I'm using ubiquiti switches, ports are on Auto, default vlan, allow vlan tagging traffic. There is no physical loop in the connection. Although on auto and same vlan all around, is there any special traffic that proxmox uses? I saw the in pve 5 it used to be multicast but since then it became unicast, and as I said before, I have stable pings between the nodes.

The main node 192.168.10.11 and the qdevice are connected on the same switch and they didn't move. As soon as I connect the 2nd node on a remote switch the main node dies! I don't understand why considering it has the qdevice next to it.

MasterTH · Feb 23, 2025

what uplink type are you using? is the port configured with stp or something. hop into the cli and post the uplinks interface configuration on both sides.

afaik multicast is not used for cluster communication anymore (but i'm maybe wrong)

proxorh · Feb 23, 2025

In ubiquiti unifi switches there is trunk. You allow vlan tagging on the port and it becomes a trunk.
But why do you think it is a network issue? aren't the stable pings between nodes enough the dismiss it as a network issue? If there were any loops I would have seen packet loss, but I haven't.

MasterTH · Feb 24, 2025

stable pings is not everything. cluster communication is more then a ping. and since the network (esp. the uplink) is the only thing that changes - this is (IMO) your issue

proxorh · Feb 25, 2025

Thanks @MasterTH , but still stuck though since I did all the network debugging I could do that ended with perfect communication. even iperf works well both ways.

Proxmox cluster dies after changing a switch port

proxorh

New Member

Attachments

zodiac

Active Member

tcabernoch

Well-Known Member

proxorh

New Member

MasterTH

Renowned Member

proxorh

New Member

MasterTH

Renowned Member

proxorh

New Member

MasterTH

Renowned Member

proxorh

New Member

We value your privacy