Hello, everybody.
There was a crash on our cluster today.
In the beginning, one node lost connection to the cluster, and then all the others rebooted.
The attachment contains logs from the first failed node and from the other node.
Please help me figure out the reason for the cluster failure. I am ready to provide all the necessary additional information.
There was a crash on our cluster today.
In the beginning, one node lost connection to the cluster, and then all the others rebooted.
Code:
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [TOTEM ] A new membership (8.17f8) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [TOTEM ] A new membership (8.17fc) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [TOTEM ] A new membership (8.1800) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [TOTEM ] A new membership (8.1804) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [TOTEM ] A new membership (8.1808) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [TOTEM ] A new membership (8.180c) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Sync members[10]: 1 2 3 4 6 7 8 9 10 11
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [QUORUM] Sync joined[9]: 1 2 3 4 6 7 9 10 11
Jan 28 03:56:33 tvr-pve-09 corosync[3318]: [TOTEM ] A new membership (1.1810) was formed. Members joined: 1 2 3 4 6 7 9 10 11
Jan 28 03:56:34 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 10
Jan 28 03:56:35 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 20
Jan 28 03:56:36 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 30
Jan 28 03:56:37 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 40
Jan 28 03:56:38 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 50
Jan 28 03:56:39 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 60
Jan 28 03:56:40 tvr-pve-09 pve-ha-lrm[3526]: loop take too long (33 seconds)
Jan 28 03:56:40 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 70
Jan 28 03:56:41 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 80
Jan 28 03:56:42 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 90
Jan 28 03:56:43 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 100
Jan 28 03:56:43 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retried 100 times
Jan 28 03:56:43 tvr-pve-09 pmxcfs[3208]: [status] crit: cpg_send_message failed: 6
Jan 28 03:56:44 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 10
Jan 28 03:56:45 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 20
Jan 28 03:56:46 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 30
Jan 28 03:56:47 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 40
Jan 28 03:56:48 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 50
Jan 28 03:56:49 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 60
Jan 28 03:56:50 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 70
Jan 28 03:56:51 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 80
Jan 28 03:56:52 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 90
Jan 28 03:56:53 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 100
Jan 28 03:56:53 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retried 100 times
Jan 28 03:56:53 tvr-pve-09 pmxcfs[3208]: [status] crit: cpg_send_message failed: 6
Jan 28 03:56:53 tvr-pve-09 pve-ha-lrm[3526]: lost lock 'ha_agent_tvr-pve-09_lock - cfs lock update failed - Permission denied
Jan 28 03:56:54 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 10
The attachment contains logs from the first failed node and from the other node.
Please help me figure out the reason for the cluster failure. I am ready to provide all the necessary additional information.
Attachments
Last edited: