Configuration:
OS: Proxmox VE 6.2-4
Ethernet: Each server has its own static ip on the same VLAN
I have setup three servers with Proxmox VE and link all them into a Proxmox cluster.
I could see frequent déconnections of one server. It cannot be unaccessible on ssh and webgui. I need to wait some times and I cannot know when I could access to it. Due to that, servers are not been healthy during a little time and cannot guarantee the high availability of my infrastructure.
As you can see with syslog below, it is the trace when the problem occurred.
Moreover, the next logs are issued the linux kernel. I notice that my ethernet port vmbr0 is blocked during a little and it is the same things on my two other servers.
What is the possible solutions faced with this problem ?
If you need more information or explication , I can share you more details
OS: Proxmox VE 6.2-4
Ethernet: Each server has its own static ip on the same VLAN
I have setup three servers with Proxmox VE and link all them into a Proxmox cluster.
I could see frequent déconnections of one server. It cannot be unaccessible on ssh and webgui. I need to wait some times and I cannot know when I could access to it. Due to that, servers are not been healthy during a little time and cannot guarantee the high availability of my infrastructure.
As you can see with syslog below, it is the trace when the problem occurred.
Bash:
Feb 23 21:13:00 walle systemd[1]: Started Proxmox VE replication runner.
Feb 23 21:13:17 walle corosync[30374]: [KNET ] link: host: 1 link: 0 is down
Feb 23 21:13:17 walle corosync[30374]: [KNET ] link: host: 2 link: 0 is down
Feb 23 21:13:17 walle corosync[30374]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 23 21:13:17 walle corosync[30374]: [KNET ] host: host: 1 has no active links
Feb 23 21:13:17 walle corosync[30374]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 23 21:13:17 walle corosync[30374]: [KNET ] host: host: 2 has no active links
Feb 23 21:13:17 walle corosync[30374]: [TOTEM ] Token has not been received in 1237 ms
Feb 23 21:13:18 walle corosync[30374]: [TOTEM ] A processor failed, forming new configuration.
Feb 23 21:13:19 walle corosync[30374]: [TOTEM ] A new membership (3.2c5) was formed. Members left: 1 2
Feb 23 21:13:19 walle corosync[30374]: [TOTEM ] Failed to receive the leave message. failed: 1 2
Feb 23 21:13:19 walle corosync[30374]: [CPG ] downlist left_list: 2 received
Feb 23 21:13:20 walle pmxcfs[876]: [dcdb] notice: members: 3/876
Feb 23 21:13:20 walle corosync[30374]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 23 21:13:20 walle pmxcfs[876]: [status] notice: members: 3/876
Feb 23 21:13:20 walle corosync[30374]: [QUORUM] Members[1]: 3
Feb 23 21:13:20 walle corosync[30374]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 23 21:13:20 walle pmxcfs[876]: [status] notice: node lost quorum
Feb 23 21:13:20 walle pmxcfs[876]: [dcdb] crit: received write while not quorate - trigger resync
Feb 23 21:13:20 walle pmxcfs[876]: [dcdb] crit: leaving CPG group
Feb 23 21:13:20 walle pve-ha-lrm[1050]: unable to write lrm status file - unable to open file '/etc/pve/nodes/walle/lrm_status.tmp.1050' - Permission denied
Feb 23 21:13:20 walle pmxcfs[876]: [dcdb] notice: start cluster connection
Feb 23 21:13:20 walle pmxcfs[876]: [dcdb] crit: cpg_join failed: 14
Feb 23 21:13:20 walle pmxcfs[876]: [dcdb] crit: can't initialize service
Feb 23 21:13:26 walle pmxcfs[876]: [dcdb] notice: members: 3/876
Feb 23 21:13:26 walle pmxcfs[876]: [dcdb] notice: all data is up to date
Feb 23 21:13:41 walle corosync[30374]: [KNET ] rx: host: 1 link: 0 is up
Feb 23 21:13:41 walle corosync[30374]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 23 21:13:46 walle corosync[30374]: [TOTEM ] A new membership (3.2d1) was formed. Members
Feb 23 21:13:46 walle corosync[30374]: [CPG ] downlist left_list: 0 received
Feb 23 21:13:46 walle corosync[30374]: [QUORUM] Members[1]: 3
Feb 23 21:13:46 walle corosync[30374]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 23 21:13:50 walle corosync[30374]: [TOTEM ] A new membership (3.2dd) was formed. Members
Feb 23 21:13:50 walle corosync[30374]: [CPG ] downlist left_list: 0 received
Feb 23 21:13:50 walle corosync[30374]: [QUORUM] Members[1]: 3
Feb 23 21:13:50 walle corosync[30374]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 23 21:13:54 walle corosync[30374]: [TOTEM ] A new membership (3.2e9) was formed. Members
Feb 23 21:13:54 walle corosync[30374]: [CPG ] downlist left_list: 0 received
Feb 23 21:13:54 walle corosync[30374]: [QUORUM] Members[1]: 3
Feb 23 21:13:54 walle corosync[30374]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 23 21:13:58 walle corosync[30374]: [TOTEM ] A new membership (3.2f5) was formed. Members
Feb 23 21:13:58 walle corosync[30374]: [CPG ] downlist left_list: 0 received
Feb 23 21:13:58 walle corosync[30374]: [QUORUM] Members[1]: 3
Feb 23 21:13:58 walle corosync[30374]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 23 21:14:00 walle systemd[1]: Starting Proxmox VE replication runner...
Feb 23 21:14:00 walle pvesr[3072]: trying to acquire cfs lock 'file-replication_cfg' ...
Feb 23 21:14:00 walle corosync[30374]: [KNET ] rx: host: 2 link: 0 is up
Feb 23 21:14:00 walle corosync[30374]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 23 21:14:01 walle pvesr[3072]: trying to acquire cfs lock 'file-replication_cfg' ...
Feb 23 21:14:01 walle corosync[30374]: [TOTEM ] A new membership (3.301) was formed. Members
Feb 23 21:14:01 walle corosync[30374]: [CPG ] downlist left_list: 0 received
Feb 23 21:14:01 walle corosync[30374]: [QUORUM] Members[1]: 3
Feb 23 21:14:01 walle corosync[30374]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 23 21:14:01 walle corosync[30374]: [TOTEM ] A new membership (1.305) was formed. Members joined: 1 2
Feb 23 21:14:01 walle corosync[30374]: [CPG ] downlist left_list: 0 received
Feb 23 21:14:01 walle corosync[30374]: [CPG ] downlist left_list: 0 received
Feb 23 21:14:01 walle corosync[30374]: [CPG ] downlist left_list: 0 received
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: members: 1/880, 2/874, 3/876
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: starting data syncronisation
Feb 23 21:14:01 walle pmxcfs[876]: [status] notice: members: 1/880, 2/874, 3/876
Feb 23 21:14:01 walle pmxcfs[876]: [status] notice: starting data syncronisation
Feb 23 21:14:01 walle corosync[30374]: [QUORUM] This node is within the primary component and will provide service.
Feb 23 21:14:01 walle corosync[30374]: [QUORUM] Members[3]: 1 2 3
Feb 23 21:14:01 walle corosync[30374]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 23 21:14:01 walle pmxcfs[876]: [status] notice: node has quorum
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: received sync request (epoch 1/880/00000033)
Feb 23 21:14:01 walle pmxcfs[876]: [status] notice: received sync request (epoch 1/880/00000033)
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: received all states
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: leader is 1/880
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: synced members: 1/880, 2/874
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: waiting for updates from leader
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: dfsm_deliver_queue: queue length 2
Feb 23 21:14:01 walle pmxcfs[876]: [status] notice: received all states
Feb 23 21:14:01 walle pmxcfs[876]: [status] notice: all data is up to date
Feb 23 21:14:01 walle pmxcfs[876]: [status] notice: dfsm_deliver_queue: queue length 22
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: update complete - trying to commit (got 3 inode updates)
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: all data is up to date
Feb 23 21:14:01 walle pmxcfs[876]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 2
Feb 23 21:14:02 walle systemd[1]: pvesr.service: Succeeded.
Feb 23 21:14:02 walle systemd[1]: Started Proxmox VE replication runner.
Moreover, the next logs are issued the linux kernel. I notice that my ethernet port vmbr0 is blocked during a little and it is the same things on my two other servers.
Bash:
[10446.819147] device tap200i0 entered promiscuous mode
[10446.845782] fwbr200i0: port 1(fwln200i0) entered blocking state
[10446.845783] fwbr200i0: port 1(fwln200i0) entered disabled state
[10446.845999] device fwln200i0 entered promiscuous mode
[10446.846039] fwbr200i0: port 1(fwln200i0) entered blocking state
[10446.846040] fwbr200i0: port 1(fwln200i0) entered forwarding state
[10446.848652] vmbr0: port 2(fwpr200p0) entered blocking state
[10446.848653] vmbr0: port 2(fwpr200p0) entered disabled state
[10446.848708] device fwpr200p0 entered promiscuous mode
[10446.848742] vmbr0: port 2(fwpr200p0) entered blocking state
[10446.848743] vmbr0: port 2(fwpr200p0) entered forwarding state
[10446.851348] fwbr200i0: port 2(tap200i0) entered blocking state
[10446.851349] fwbr200i0: port 2(tap200i0) entered disabled state
[10446.851402] fwbr200i0: port 2(tap200i0) entered blocking state
[10446.851402] fwbr200i0: port 2(tap200i0) entered forwarding state
[11920.774482] fwbr200i0: port 2(tap200i0) entered disabled state
[11920.790334] fwbr200i0: port 1(fwln200i0) entered disabled state
[11920.790362] vmbr0: port 2(fwpr200p0) entered disabled state
[11920.790504] device fwln200i0 left promiscuous mode
[11920.790505] fwbr200i0: port 1(fwln200i0) entered disabled state
[11920.808731] device fwpr200p0 left promiscuous mode
[11920.808732] vmbr0: port 2(fwpr200p0) entered disabled state
[12084.953756] device tap200i0 entered promiscuous mode
[12084.974905] fwbr200i0: port 1(fwln200i0) entered blocking state
[12084.974907] fwbr200i0: port 1(fwln200i0) entered disabled state
[12084.974968] device fwln200i0 entered promiscuous mode
[12084.974991] fwbr200i0: port 1(fwln200i0) entered blocking state
[12084.974991] fwbr200i0: port 1(fwln200i0) entered forwarding state
[12084.977765] vmbr0: port 2(fwpr200p0) entered blocking state
[12084.977766] vmbr0: port 2(fwpr200p0) entered disabled state
[12084.977823] device fwpr200p0 entered promiscuous mode
[12084.977844] vmbr0: port 2(fwpr200p0) entered blocking state
[12084.977844] vmbr0: port 2(fwpr200p0) entered forwarding state
[12084.980497] fwbr200i0: port 2(tap200i0) entered blocking state
[12084.980498] fwbr200i0: port 2(tap200i0) entered disabled state
[12084.980551] fwbr200i0: port 2(tap200i0) entered blocking state
[12084.980552] fwbr200i0: port 2(tap200i0) entered forwarding state
[15289.303686] device tap100i0 entered promiscuous mode
[15289.332817] fwbr100i0: port 1(fwln100i0) entered blocking state
[15289.332818] fwbr100i0: port 1(fwln100i0) entered disabled state
[15289.332861] device fwln100i0 entered promiscuous mode
[15289.332898] fwbr100i0: port 1(fwln100i0) entered blocking state
[15289.332899] fwbr100i0: port 1(fwln100i0) entered forwarding state
[15289.336183] vmbr0: port 3(fwpr100p0) entered blocking state
[15289.336184] vmbr0: port 3(fwpr100p0) entered disabled state
[15289.336227] device fwpr100p0 entered promiscuous mode
[15289.336241] vmbr0: port 3(fwpr100p0) entered blocking state
[15289.336242] vmbr0: port 3(fwpr100p0) entered forwarding state
[15289.339306] fwbr100i0: port 2(tap100i0) entered blocking state
[15289.339307] fwbr100i0: port 2(tap100i0) entered disabled state
[15289.339360] fwbr100i0: port 2(tap100i0) entered blocking state
[15289.339361] fwbr100i0: port 2(tap100i0) entered forwarding state
[34044.270329] perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
What is the possible solutions faced with this problem ?
If you need more information or explication , I can share you more details