Our system was stable in the last few months, but after upgrade 7 to 7.1-8 we have 3-4 random crash every day
(we had issue two months ago with corosync stability but after replacing the switch the cluster worked very well under high load. without any issue)
This Monday i took advantage of the power outage (building maintenance ) and made an upgrade on the cluster (11 servers, 4 of them with ceph)
After the upgrade and dist-upgrade i rebooted all servers. everything seems working and stable except there are some random crashes that initiate full cluster reboot
I have attacked the corosync logs . and i can attach any other logs needed to understand the issue
I have checked the switch logs. no issues at all running fine.
Before the crash all the cluster and services work well, ceph work (100%osd are up, 100% monitors are up). connects to PBS , all lxc and VM are UP and running. we are still tring to figure out what scenario cause it, but the problem that there were few crashes at a time without any load all is idle
Error started at around 15:43:59 (second log line)
this cluster failure the node pve-ws2 did not crash that i attacked its logs:
corosync syslog:
(we had issue two months ago with corosync stability but after replacing the switch the cluster worked very well under high load. without any issue)
This Monday i took advantage of the power outage (building maintenance ) and made an upgrade on the cluster (11 servers, 4 of them with ceph)
After the upgrade and dist-upgrade i rebooted all servers. everything seems working and stable except there are some random crashes that initiate full cluster reboot
I have attacked the corosync logs . and i can attach any other logs needed to understand the issue
I have checked the switch logs. no issues at all running fine.
Before the crash all the cluster and services work well, ceph work (100%osd are up, 100% monitors are up). connects to PBS , all lxc and VM are UP and running. we are still tring to figure out what scenario cause it, but the problem that there were few crashes at a time without any load all is idle
Error started at around 15:43:59 (second log line)
this cluster failure the node pve-ws2 did not crash that i attacked its logs:
corosync syslog:
Rich (BB code):
Dec 22 15:01:06 pve-ws2 corosync[1156]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 22 15:43:59 pve-ws2 corosync[1156]: [KNET ] link: host: 1 link: 0 is down
Dec 22 15:43:59 pve-ws2 corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:43:59 pve-ws2 corosync[1156]: [KNET ] host: host: 1 has no active links
Dec 22 15:44:02 pve-ws2 corosync[1156]: [TOTEM ] Token has not been received in 6637 ms
Dec 22 15:44:04 pve-ws2 corosync[1156]: [TOTEM ] A processor failed, forming new configuration: token timed out (8850ms), waiting 10620ms for consensus.
Dec 22 15:44:15 pve-ws2 corosync[1156]: [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:15 pve-ws2 corosync[1156]: [QUORUM] Sync joined[3]: 3 4 6
Dec 22 15:44:15 pve-ws2 corosync[1156]: [QUORUM] Sync left[5]: 1 3 4 5 6
Dec 22 15:44:15 pve-ws2 corosync[1156]: [TOTEM ] A new membership (2.46ec) was formed. Members joined: 3 4 6 left: 1 3 4 5 6
Dec 22 15:44:15 pve-ws2 corosync[1156]: [TOTEM ] Failed to receive the leave message. failed: 1 3 4 5 6
Dec 22 15:44:15 pve-ws2 corosync[1156]: [TOTEM ] Retransmit List: 1
Dec 22 15:44:18 pve-ws2 corosync[1156]: [KNET ] rx: host: 1 link: 0 is up
Dec 22 15:44:18 pve-ws2 corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:44:26 pve-ws2 corosync[1156]: [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]: [QUORUM] Sync joined[2]: 3 4
Dec 22 15:44:26 pve-ws2 corosync[1156]: [QUORUM] Sync left[4]: 1 3 4 5
Dec 22 15:44:26 pve-ws2 corosync[1156]: [TOTEM ] A new membership (2.46f4) was formed. Members joined: 3 4 left: 3 4
Dec 22 15:44:26 pve-ws2 corosync[1156]: [TOTEM ] Failed to receive the leave message. failed: 3 4
Dec 22 15:44:26 pve-ws2 corosync[1156]: [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]: [QUORUM] Sync joined[8]: 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]: [QUORUM] Sync left[10]: 1 3 4 5 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]: [TOTEM ] A new membership (2.46f8) was formed. Members joined: 3 4 6 7 8 9 10 11 left: 3 4 6 7 8 9 10 11
Dec 22 15:44:26 pve-ws2 corosync[1156]: [TOTEM ] Failed to receive the leave message. failed: 3 4 6 7 8 9 10 11
Dec 22 15:44:37 pve-ws2 corosync[1156]: [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:37 pve-ws2 corosync[1156]: [TOTEM ] A new membership (2.46fc) was formed. Members
Dec 22 15:44:47 pve-ws2 corosync[1156]: [QUORUM] Sync members[9]: 2 3 4 6 7 8 9 10 11
Dec 22 15:44:47 pve-ws2 corosync[1156]: [TOTEM ] A new membership (2.4700) was formed. Members
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] link: host: 10 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] link: host: 9 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] link: host: 7 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] link: host: 1 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] link: host: 5 link: 0 is down
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 10 has no active links
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 9 has no active links
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 7 has no active links
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 1 has no active links
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec 22 15:45:01 pve-ws2 corosync[1156]: [KNET ] host: host: 5 has no active links
Dec 22 15:45:04 pve-ws2 corosync[1156]: [KNET ] link: host: 11 link: 0 is down
Dec 22 15:45:04 pve-ws2 corosync[1156]: [KNET ] link: host: 8 link: 0 is down
Dec 22 15:45:04 pve-ws2 corosync[1156]: [KNET ] link: host: 4 link: 0 is down
Dec 22 15:45:04 pve-ws2 corosync[1156]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
Dec 22 15:45:04 pve-ws2 corosync[1156]: [KNET ] host: host: 11 has no active links
Dec 22 15:45:04 pve-ws2 corosync[1156]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Last edited: