Hi,
I have a cluster setup with 2 nodes (proxmox01 and proxmox02). Proxmox01 is my primary node, then Proxmox02 is my second.
What I noticed is everytime proxmox02 gets a bit loaded (system load reach around 2.0) because of the "kvm" process, it automatically get disconnected from the cluster. Luckily, all guest VMs running on this node are still online (and reachable), it just disconnect itself from the cluster -- when I do "pvecm status" there is only one member of the cluser, or if you check the PVE management web UI, you'll see proxmox02 in red mark.
I dug through the corosync and pve-cluster log and so far this is the only error messages that I can correlate with the issue:
--------------------------------
Oct 05 09:59:11 proxmox02 corosync[65732]: [TOTEM ] A processor failed, forming new configuration.
Oct 05 09:59:12 proxmox02 corosync[65732]: [TOTEM ] A new membership (192.168.0.1:43544) was formed. Members joined: 1 left: 1
Oct 05 09:59:12 proxmox02 corosync[65732]: [TOTEM ] Failed to receive the leave message. failed: 1
Oct 05 09:59:15 proxmox02 corosync[65732]: [TOTEM ] FAILED TO RECEIVE
Oct 05 09:59:16 proxmox02 corosync[65732]: [TOTEM ] A new membership (192.168.0.2:43548) was formed. Members left: 1
Oct 05 09:59:16 proxmox02 corosync[65732]: [TOTEM ] Failed to receive the leave message. failed: 1
Oct 05 09:59:16 proxmox02 corosync[65732]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 05 09:59:16 proxmox02 corosync[65732]: [QUORUM] Members[1]: 2
Oct 05 09:59:16 proxmox02 corosync[65732]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:27 proxmox02 corosync[100488]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:28 proxmox02 pmxcfs[100284]: [main] notice: teardown filesystem
Oct 5 15:21:29 proxmox02 corosync[100488]: [TOTEM ] A new membership (192.168.0.2:51724) was formed. Members
Oct 5 15:21:29 proxmox02 corosync[100488]: [QUORUM] Members[1]: 2
Oct 5 15:21:29 proxmox02 corosync[100488]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Transport endpoint is not connected
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Connection refused
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Connection refused
--------------------------------
It also comes back up and automatically rejoin the cluser as soon as the server load gets normal.
Is there any underlying issue that I should look further? anyone had this issue before?
TIA!
I have a cluster setup with 2 nodes (proxmox01 and proxmox02). Proxmox01 is my primary node, then Proxmox02 is my second.
What I noticed is everytime proxmox02 gets a bit loaded (system load reach around 2.0) because of the "kvm" process, it automatically get disconnected from the cluster. Luckily, all guest VMs running on this node are still online (and reachable), it just disconnect itself from the cluster -- when I do "pvecm status" there is only one member of the cluser, or if you check the PVE management web UI, you'll see proxmox02 in red mark.
I dug through the corosync and pve-cluster log and so far this is the only error messages that I can correlate with the issue:
--------------------------------
Oct 05 09:59:11 proxmox02 corosync[65732]: [TOTEM ] A processor failed, forming new configuration.
Oct 05 09:59:12 proxmox02 corosync[65732]: [TOTEM ] A new membership (192.168.0.1:43544) was formed. Members joined: 1 left: 1
Oct 05 09:59:12 proxmox02 corosync[65732]: [TOTEM ] Failed to receive the leave message. failed: 1
Oct 05 09:59:15 proxmox02 corosync[65732]: [TOTEM ] FAILED TO RECEIVE
Oct 05 09:59:16 proxmox02 corosync[65732]: [TOTEM ] A new membership (192.168.0.2:43548) was formed. Members left: 1
Oct 05 09:59:16 proxmox02 corosync[65732]: [TOTEM ] Failed to receive the leave message. failed: 1
Oct 05 09:59:16 proxmox02 corosync[65732]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 05 09:59:16 proxmox02 corosync[65732]: [QUORUM] Members[1]: 2
Oct 05 09:59:16 proxmox02 corosync[65732]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:27 proxmox02 corosync[100488]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:28 proxmox02 pmxcfs[100284]: [main] notice: teardown filesystem
Oct 5 15:21:29 proxmox02 corosync[100488]: [TOTEM ] A new membership (192.168.0.2:51724) was formed. Members
Oct 5 15:21:29 proxmox02 corosync[100488]: [QUORUM] Members[1]: 2
Oct 5 15:21:29 proxmox02 corosync[100488]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Transport endpoint is not connected
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Connection refused
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Connection refused
--------------------------------
It also comes back up and automatically rejoin the cluser as soon as the server load gets normal.
Is there any underlying issue that I should look further? anyone had this issue before?
TIA!