Ceph Routed Setup (with Fallback) - time outs

amarsalek

New Member
Oct 3, 2022
2
1
3
Hi,

I tried to configure a Ceph routed setup with fallback according to this post: Routed Setup (with Fallback).

Everything seems to work and the status is ok, but `journalctl -u frr` shows a lot of time outs:
Code:
Oct 03 13:52:02 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f1np1 timed out!
Oct 03 13:52:02 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 1111.1111.1111 on enp1s0f0np0
Oct 03 13:52:06 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f0np0 timed out!
Oct 03 13:52:06 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 2222.2222.2222 on enp1s0f1np1
Oct 03 13:52:10 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f1np1 timed out!
Oct 03 13:52:11 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 1111.1111.1111 on enp1s0f0np0
Oct 03 13:52:15 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f0np0 timed out!
Oct 03 13:52:15 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 2222.2222.2222 on enp1s0f1np1
Oct 03 13:52:19 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f1np1 timed out!
Oct 03 13:52:19 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 1111.1111.1111 on enp1s0f0np0
Oct 03 13:52:23 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f0np0 timed out!
Oct 03 13:52:23 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 2222.2222.2222 on enp1s0f1np1
Oct 03 13:52:27 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f1np1 timed out!
Oct 03 13:52:28 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 1111.1111.1111 on enp1s0f0np0
Oct 03 13:52:32 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f0np0 timed out!
Oct 03 13:52:33 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 2222.2222.2222 on enp1s0f1np1

Why does the initial synchronization time out?

Thanks!
 
  • Like
Reactions: rn-abc
Hi @amarsalek,

We are experiencing the same issue on a complete new install of Proxmox Virtual Environment 7.2-3.

Did you manage to find a solution to the problem?

Code:
Oct 14 17:09:13 pve39 fabricd[1579]: OpenFabric: Started initial synchronization with 3333.3333.3333 on enp130s0f1
Oct 14 17:09:17 pve39 fabricd[1579]: OpenFabric: Initial synchronization on enp130s0f1 timed out!
Oct 14 17:09:18 pve39 fabricd[1579]: OpenFabric: Started initial synchronization with 2222.2222.2222 on enp130s0f0
Oct 14 17:09:22 pve39 fabricd[1579]: OpenFabric: Initial synchronization on enp130s0f0 timed out!

We are able to ping the different interfaces using the IP's going over the direct fiber network:
Code:
root@hostname:~# ping 10.15.15.50
PING 10.15.15.50 (10.15.15.50) 56(84) bytes of data.
64 bytes from 10.15.15.50: icmp_seq=1 ttl=64 time=0.258 ms
64 bytes from 10.15.15.50: icmp_seq=2 ttl=64 time=0.162 ms
64 bytes from 10.15.15.50: icmp_seq=3 ttl=64 time=0.128 ms
64 bytes from 10.15.15.50: icmp_seq=4 ttl=64 time=0.155 ms
--- 10.15.15.50 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3067ms
rtt min/avg/max/mdev = 0.128/0.175/0.258/0.049 ms

root@hostname:~# ping 10.15.15.52
PING 10.15.15.52 (10.15.15.52) 56(84) bytes of data.
64 bytes from 10.15.15.52: icmp_seq=1 ttl=64 time=0.283 ms
64 bytes from 10.15.15.52: icmp_seq=2 ttl=64 time=0.189 ms
64 bytes from 10.15.15.52: icmp_seq=3 ttl=64 time=0.162 ms
64 bytes from 10.15.15.52: icmp_seq=4 ttl=64 time=0.164 ms
--- 10.15.15.52 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3062ms
rtt min/avg/max/mdev = 0.162/0.199/0.283/0.049 ms


Thanks!
 
Last edited:
I never have use the openfabric daemons,

but "frr defaults traditional" could be replaced by "frr defaults datacenter" to lower the defaults timeout values.(and other internal tuning knonbs).

I don't think It'll help here, but in case of link disconnect, it should improve the failover speed a lot.
 
  • Like
Reactions: vesalius
I'm seeing the same problem on 7.3.3. I haven't been able to find any related information online.

Changing frr defaults to datacenter did not get rid of the messages.

I'm using Mellanox ConnectX-3 cards in 56GbE mode in case it's relevant.

I looked at the relevant code in https://github.com/FRRouting/frr/blob/master/isisd/fabricd.c#L259 and saw that the log messages are for debugging purposes. As a workaround, I changed the log verbosity in /etc/frr/frr.conf to:
Code:
log syslog warning
 
Have you tried to increase the settings a little bit to increase the timings?

Those should be for each interface:
- csnp-interval
- hello-interval
- hello-multiplier

In the router section:
- lsp-gen-interval

They are currently (if you copied it from the guide) set to the lowest possible value to get very short switch-over times, but maybe they are a bit too aggressive in some situations.

Details about the parameters and possible values can be found in the FRR documentation.

I cannot reproduce those logs in my test lab setups.
 
  • Like
Reactions: gmpreussner
I'm seeing the same problem on 7.3.3. I haven't been able to find any related information online.

Changing frr defaults to datacenter did not get rid of the messages.

I'm using Mellanox ConnectX-3 cards in 56GbE mode in case it's relevant.

I looked at the relevant code in https://github.com/FRRouting/frr/blob/master/isisd/fabricd.c#L259 and saw that the log messages are for debugging purposes. As a workaround, I changed the log verbosity in /etc/frr/frr.conf to:
Code:
log syslog warning
Agree and would go as far as to say the standard suggested FRR log level in the Proxmox wiki /etc/frr/frr.conf should be changed to warning as opposed to informational
 
Last edited:
@aaron Thank you very much for your response! I can confirm that increasing the values in `/etc/frr/frr.conf` fixes the timeouts. In my case it was sufficient to add just one second to each of them (I didn't bother troubleshooting each setting invidually), so I have:

```
interface XXX
...
openfabric csnp-interval 3
openfabric hello-interval 2
openfabric hello-multiplier 2

router YYY
...
lsp-gen-interval 2
...
```
 
Thanks for the feedback. @gmpreussner. I'll add a hint in the guide.