Proxmox cluster dies after changing a switch port

proxorh

New Member
Feb 21, 2025
5
0
1
Hi,

I have a 2 node proxmox cluster with a qdevice for quroum, all connected to the same switch and working well.
Code:
Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.10.11 (local)
0x00000002          1    A,V,NMW 192.168.10.12
0x00000000          1            Qdevice

I shutdown one node (192.168.10.12) and the cluster still works well:
Code:
Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.10.11 (local)
0x00000002          1         NR 192.168.10.12
0x00000000          1            Qdevice


I then move 192.168.10.12 to another switch and turn it up. 192.168.10.11 can ping 192.168.10.12 and vice versa, and on the proxmox UI the 192.168.10.12 node shows a question mark, and cluster looks well:
Code:
Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.10.11 (local)
0x00000002          1    A,V,NMW 192.168.10.12
0x00000000          1            Qdevice


After about a minute the cluster dies. The active master node 192.168.10.11 drops the ssh connection, is unreachable and GUI is not responding. The cluster is still up via pvecm status as shown above, but the node is dead.

After a few minutes I am able to ssh again to the node but the GUI is still down and I have to poweroff both nodes using systemctl --force --force poweroff. When I turn on 192.168.10.11, the GUI comes back to life and I see 192.168.10.12 as down (red X), but the cluster status doesn't show it:
Code:
Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.10.11 (local)
0x00000000          1            Qdevice

When I connect 192.168.10.12 back on the same switch, everything works well! Both nodes are recognized and all is good.

Attached is the syslog of 192.168.10.11. I didn't find anything meaningful, but I don't know what to look for.

Would appreciate any hint or further debugging I should do, because I am quite stuck.
Thank you!
 

Attachments

Do you have a network loop? Maybe try enabling STP and see if one of your links gets flagged
 
@tcabernoch that was a good one :D

No network loop, ping is stable between nodes with not packetloss.
I also only have vmbr0 on both nodes, it is configured as vlan aware and the switch ports allow vlan tagged packets and are both on the same vlan.
 
I'm using ubiquiti switches, ports are on Auto, default vlan, allow vlan tagging traffic. There is no physical loop in the connection. Although on auto and same vlan all around, is there any special traffic that proxmox uses? I saw the in pve 5 it used to be multicast but since then it became unicast, and as I said before, I have stable pings between the nodes.

The main node 192.168.10.11 and the qdevice are connected on the same switch and they didn't move. As soon as I connect the 2nd node on a remote switch the main node dies! I don't understand why considering it has the qdevice next to it.
 
what uplink type are you using? is the port configured with stp or something. hop into the cli and post the uplinks interface configuration on both sides.

afaik multicast is not used for cluster communication anymore (but i'm maybe wrong)
 
In ubiquiti unifi switches there is trunk. You allow vlan tagging on the port and it becomes a trunk.
But why do you think it is a network issue? aren't the stable pings between nodes enough the dismiss it as a network issue? If there were any loops I would have seen packet loss, but I haven't.
 
stable pings is not everything. cluster communication is more then a ping. and since the network (esp. the uplink) is the only thing that changes - this is (IMO) your issue
 
Thanks @MasterTH , but still stuck though since I did all the network debugging I could do that ended with perfect communication. even iperf works well both ways.