Hello,
I have a strange problem, this is sometimes and not always.
We reboot for whatever reason a node (hardware maintenance for example).
When the node comes back online all the nodes with vm's with HA enabled are getting rebooted.
This is also happend before when we are adding a node to the cluster.
it is a cluster of 9 nodes.
3 nodes only have local storage vm's so HA is not enabled, those servers not getting rebooted.
I have try to search the logs of the rebooted node but i cannot find anything specific.
Things i found/think can maybe be the issiue:
- spanning three (short disconnect maybe when host comes online, hello time)
- LACP, we use LACP on all the nodes
Maybe this is saying something, this is the part of the corosync log where i first get al the members and then it says link down
an 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: Global data MTU changed to: 1397
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] rx: host: 9 link: 0 is up
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 9 link: 0 from 469 to 1397
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [QUORUM] Sync members[9]: 1 2 3 4 5 6 7 9 10
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [QUORUM] Sync joined[8]: 1 2 3 4 5 7 9 10
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [TOTEM ] A new membership (1.25b4) was formed. Members joined: 1 2 3 4 5 7 9 10
Jan 22 17:15:52 IDC-PVE002 corosync[1437]: [TOTEM ] Token has not been received in 5662 ms
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] link: host: 1 link: 0 is down
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 has no active links
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Sync members[3]: 3 4 6
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Sync joined[2]: 3 4
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [TOTEM ] A new membership (3.25b8) was formed. Members left: 1 2 5 7 9 10
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [TOTEM ] Failed to receive the leave message. failed: 1 2 5 7 9 10
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Members[3]: 3 4 6
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] link: host: 9 link: 0 is down
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 has no active links
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] link: host: 10 link: 0 is down
I have a strange problem, this is sometimes and not always.
We reboot for whatever reason a node (hardware maintenance for example).
When the node comes back online all the nodes with vm's with HA enabled are getting rebooted.
This is also happend before when we are adding a node to the cluster.
it is a cluster of 9 nodes.
3 nodes only have local storage vm's so HA is not enabled, those servers not getting rebooted.
I have try to search the logs of the rebooted node but i cannot find anything specific.
Things i found/think can maybe be the issiue:
- spanning three (short disconnect maybe when host comes online, hello time)
- LACP, we use LACP on all the nodes
Maybe this is saying something, this is the part of the corosync log where i first get al the members and then it says link down
an 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: Global data MTU changed to: 1397
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] rx: host: 9 link: 0 is up
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 9 link: 0 from 469 to 1397
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [QUORUM] Sync members[9]: 1 2 3 4 5 6 7 9 10
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [QUORUM] Sync joined[8]: 1 2 3 4 5 7 9 10
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [TOTEM ] A new membership (1.25b4) was formed. Members joined: 1 2 3 4 5 7 9 10
Jan 22 17:15:52 IDC-PVE002 corosync[1437]: [TOTEM ] Token has not been received in 5662 ms
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] link: host: 1 link: 0 is down
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 has no active links
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Sync members[3]: 3 4 6
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Sync joined[2]: 3 4
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [TOTEM ] A new membership (3.25b8) was formed. Members left: 1 2 5 7 9 10
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [TOTEM ] Failed to receive the leave message. failed: 1 2 5 7 9 10
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Members[3]: 3 4 6
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] link: host: 9 link: 0 is down
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 has no active links
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] link: host: 10 link: 0 is down