Proxmox reboot after spanning tree issue

max.nolent · Aug 14, 2020

Hello Everyone,

I have a cluster of 5 proxmox nodes, all nodes are in Datacenter, we have an issue when we have a spanning recalculation. This morning, we lost a 10gbs on the switch connected to the proxmox. After link down, the switch calculate spanning tree then we lost 4 of our 5 nodes.

Everytime that we have a spanning tree, somes of our node restart. Do you any idea ?

Proxmox 5.4-3

Stoiko Ivanov · Aug 14, 2020

On a hunch - do you have HA resources configured in that cluster? - If yes and if your nodes lost network connectivity to each other (as I could imagine happens during a STP recalculation) - then I guess they fenced themselves since they lost quorum - check the reference documenatation:
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html

max.nolent said:
Proxmox 5.4-3

PVE 5.X is EOL - please consider upgrading

max.nolent · Aug 14, 2020

Yes, i have HA ressource Between my first and third node but my first four nodes has restart, is there any log when node restart because of lost connection ?

max.nolent · Aug 14, 2020

From my logs :

Aug 14 11:14:56 proxmox3 corosync[3011]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 14 11:14:56 proxmox3 corosync[3011]: [QUORUM] Members[1]: 1
Aug 14 11:14:56 proxmox3 corosync[3011]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 14 11:14:56 proxmox3 pmxcfs[2961]: [status] notice: node lost quorum
Aug 14 11:14:56 proxmox3 pmxcfs[2961]: [dcdb] crit: received write while not quorate - trigger resync
Aug 14 11:14:56 proxmox3 pmxcfs[2961]: [dcdb] crit: leaving CPG group
Aug 14 11:14:57 proxmox3 pmxcfs[2961]: [dcdb] notice: start cluster connection
Aug 14 11:14:57 proxmox3 pmxcfs[2961]: [dcdb] notice: members: 1/2961
Aug 14 11:14:57 proxmox3 pmxcfs[2961]: [dcdb] notice: all data is up to date

From here, i lost my quorum and my node is Alone.

STP recalculation : then

Few minutes laters :

Aug 14 11:15:24 proxmox3 corosync[3011]: notice [QUORUM] Members[5]: 1 2 3 5 4
Aug 14 11:15:24 proxmox3 corosync[3011]: notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 14 11:15:24 proxmox3 pmxcfs[2961]: [dcdb] notice: starting data syncronisation
Aug 14 11:15:24 proxmox3 corosync[3011]: [QUORUM] Members[5]: 1 2 3 5 4
Aug 14 11:15:24 proxmox3 corosync[3011]: [MAIN ] Completed service synchronization, ready to provide service.

My quorum is now up to date.
After that, i lost the quorm a second time, my 10gbs link comes up

Aug 14 11:15:43 proxmox3 corosync[3011]: notice [TOTEM ] A processor failed, forming new configuration.
Aug 14 11:15:43 proxmox3 corosync[3011]: [TOTEM ] A processor failed, forming new configuration.
Aug 14 11:15:46 proxmox3 corosync[3011]: notice [TOTEM ] A new membership (100.101.20.2:9696) was formed. Members left: 2 3 5 4
Aug 14 11:15:46 proxmox3 corosync[3011]: notice [TOTEM ] Failed to receive the leave message. failed: 2 3 5 4
Aug 14 11:15:46 proxmox3 corosync[3011]: warning [CPG ] downlist left_list: 4 received
Aug 14 11:15:46 proxmox3 corosync[3011]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 14 11:15:46 proxmox3 corosync[3011]: notice [QUORUM] Members[1]: 1

New STP recalculation :
Aug 14 11:16:29 proxmox3 corosync[3011]: notice [QUORUM] This node is within the primary component and will provide service.
Aug 14 11:16:29 proxmox3 corosync[3011]: notice [QUORUM] Members[5]: 1 2 3 5 4
Aug 14 11:16:29 proxmox3 corosync[3011]: notice [MAIN ] Completed service synchronization, ready to provide service.
Aug 14 11:16:29 proxmox3 pmxcfs[2961]: [dcdb] notice: starting data syncronisation
Aug 14 11:16:29 proxmox3 corosync[3011]: [QUORUM] This node is within the primary component and will provide service.
Aug 14 11:16:29 proxmox3 corosync[3011]: [QUORUM] Members[5]: 1 2 3 5 4
Aug 14 11:16:29 proxmox3 corosync[3011]: [MAIN ] Completed service synchronization, ready to provide service.

But the node restart at 14:17:00
Aug 14 11:16:29 proxmox3 pmxcfs[2961]: [status] notice: received all states
Aug 14 11:16:29 proxmox3 pmxcfs[2961]: [status] notice: all data is up to date
Aug 14 11:16:29 proxmox3 pmxcfs[2961]: [status] notice: dfsm_deliver_queue: queue length 50
Aug 14 11:16:32 proxmox3 pve-ha-lrm[3150]: successfully acquired lock 'ha_agent_proxmox3_lock'
Aug 14 11:16:32 proxmox3 pve-ha-lrm[3150]: status change lost_agent_lock => active
Aug 14 11:16:37 proxmox3 pve-ha-crm[3107]: status change wait_for_quorum => slave
Aug 14 11:17:00 proxmox3 systemd[1]: Starting Proxmox VE replication runner...

Aug 14 11:20:30 proxmox3 systemd[1]: Starting Flush Journal to Persistent Storage...

Search

Search

Proxmox reboot after spanning tree issue

max.nolent

Member

Stoiko Ivanov

Proxmox Staff Member

max.nolent

Member

max.nolent

Member