little update on a second ring.
i have now
eno2 - vlan for ring0 (10gbit interface to OVH vrack)
eno1 - bridge via tinc for ring1
while most nodes report no errors, one special problem node does report link downs, sometimes even both at least to some hosts like
Now the strange part is:
Now given the cause could be very well OVH and their praised virtual network, however the way those errors are handled is wierd.
With one ring only link down would result in freezing the hole cluster and not recover (manual restart required)
Now we shoudl expect if both rings are down the result should be the same
well it isnt. it recovers imidiatly and no freezes where noticed (maybe there are but then only for one second as they recover imidiatly anyway)
so to summarize 3 basic issues to investigate
-cause of the problem (maybe driver, setting or network related and maybe individual different, maybe corobug)
-corosync handling of the issue (symptom) and ability recover or not
-even if coro can recover with second link, question remains - specially for HA user - if those short outages have consequences asside from spamming the log
ediT: i will test the behaviour on the weekend with replacing tinc with a simple 2nd VLAN on eno2. in theory this should be the same result as with only one ring, as by all logic if theres any fault in the network chain (driver - interface - cableling - switches - whatever) it shoudl affect both at the very same time and act as if thers only one ring
i come to think that tinc maybe unique able to midigate errors because of its meshed nature.
or corosync error corrects differently with a second ring.
ofc this would be only a workaround of the rootcause but the root seems to be very different for most people
i have now
eno2 - vlan for ring0 (10gbit interface to OVH vrack)
eno1 - bridge via tinc for ring1
while most nodes report no errors, one special problem node does report link downs, sometimes even both at least to some hosts like
Code:
Sep 12 14:06:49 h3 corosync[19797]: [KNET ] link: host: 4 link: 0 is down
Sep 12 14:06:49 h3 corosync[19797]: [KNET ] link: host: 1 link: 0 is down
Sep 12 14:06:49 h3 corosync[19797]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 12 14:06:49 h3 corosync[19797]: [KNET ] host: host: 4 has no active links
Sep 12 14:06:49 h3 corosync[19797]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Sep 12 14:06:50 h3 corosync[19797]: [KNET ] rx: host: 4 link: 0 is up
Sep 12 14:06:50 h3 corosync[19797]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Now the strange part is:
Now given the cause could be very well OVH and their praised virtual network, however the way those errors are handled is wierd.
With one ring only link down would result in freezing the hole cluster and not recover (manual restart required)
Now we shoudl expect if both rings are down the result should be the same
well it isnt. it recovers imidiatly and no freezes where noticed (maybe there are but then only for one second as they recover imidiatly anyway)
so to summarize 3 basic issues to investigate
-cause of the problem (maybe driver, setting or network related and maybe individual different, maybe corobug)
-corosync handling of the issue (symptom) and ability recover or not
-even if coro can recover with second link, question remains - specially for HA user - if those short outages have consequences asside from spamming the log
ediT: i will test the behaviour on the weekend with replacing tinc with a simple 2nd VLAN on eno2. in theory this should be the same result as with only one ring, as by all logic if theres any fault in the network chain (driver - interface - cableling - switches - whatever) it shoudl affect both at the very same time and act as if thers only one ring
i come to think that tinc maybe unique able to midigate errors because of its meshed nature.
or corosync error corrects differently with a second ring.
ofc this would be only a workaround of the rootcause but the root seems to be very different for most people
Last edited: