Ceph Routed Setup (with Fallback) - time outs

amarsalek

New Member
Oct 3, 2022
2
1
3
Hi,

I tried to configure a Ceph routed setup with fallback according to this post: Routed Setup (with Fallback).

Everything seems to work and the status is ok, but `journalctl -u frr` shows a lot of time outs:
Code:
Oct 03 13:52:02 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f1np1 timed out!
Oct 03 13:52:02 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 1111.1111.1111 on enp1s0f0np0
Oct 03 13:52:06 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f0np0 timed out!
Oct 03 13:52:06 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 2222.2222.2222 on enp1s0f1np1
Oct 03 13:52:10 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f1np1 timed out!
Oct 03 13:52:11 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 1111.1111.1111 on enp1s0f0np0
Oct 03 13:52:15 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f0np0 timed out!
Oct 03 13:52:15 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 2222.2222.2222 on enp1s0f1np1
Oct 03 13:52:19 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f1np1 timed out!
Oct 03 13:52:19 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 1111.1111.1111 on enp1s0f0np0
Oct 03 13:52:23 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f0np0 timed out!
Oct 03 13:52:23 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 2222.2222.2222 on enp1s0f1np1
Oct 03 13:52:27 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f1np1 timed out!
Oct 03 13:52:28 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 1111.1111.1111 on enp1s0f0np0
Oct 03 13:52:32 host3 fabricd[2563]: [NT6J7-1RYRF] OpenFabric: Initial synchronization on enp1s0f0np0 timed out!
Oct 03 13:52:33 host3 fabricd[2563]: [R18GA-MS9R7] OpenFabric: Started initial synchronization with 2222.2222.2222 on enp1s0f1np1

Why does the initial synchronization time out?

Thanks!
 
  • Like
Reactions: rn-abc
Hi @amarsalek,

We are experiencing the same issue on a complete new install of Proxmox Virtual Environment 7.2-3.

Did you manage to find a solution to the problem?

Code:
Oct 14 17:09:13 pve39 fabricd[1579]: OpenFabric: Started initial synchronization with 3333.3333.3333 on enp130s0f1
Oct 14 17:09:17 pve39 fabricd[1579]: OpenFabric: Initial synchronization on enp130s0f1 timed out!
Oct 14 17:09:18 pve39 fabricd[1579]: OpenFabric: Started initial synchronization with 2222.2222.2222 on enp130s0f0
Oct 14 17:09:22 pve39 fabricd[1579]: OpenFabric: Initial synchronization on enp130s0f0 timed out!

We are able to ping the different interfaces using the IP's going over the direct fiber network:
Code:
root@hostname:~# ping 10.15.15.50
PING 10.15.15.50 (10.15.15.50) 56(84) bytes of data.
64 bytes from 10.15.15.50: icmp_seq=1 ttl=64 time=0.258 ms
64 bytes from 10.15.15.50: icmp_seq=2 ttl=64 time=0.162 ms
64 bytes from 10.15.15.50: icmp_seq=3 ttl=64 time=0.128 ms
64 bytes from 10.15.15.50: icmp_seq=4 ttl=64 time=0.155 ms
--- 10.15.15.50 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3067ms
rtt min/avg/max/mdev = 0.128/0.175/0.258/0.049 ms

root@hostname:~# ping 10.15.15.52
PING 10.15.15.52 (10.15.15.52) 56(84) bytes of data.
64 bytes from 10.15.15.52: icmp_seq=1 ttl=64 time=0.283 ms
64 bytes from 10.15.15.52: icmp_seq=2 ttl=64 time=0.189 ms
64 bytes from 10.15.15.52: icmp_seq=3 ttl=64 time=0.162 ms
64 bytes from 10.15.15.52: icmp_seq=4 ttl=64 time=0.164 ms
--- 10.15.15.52 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3062ms
rtt min/avg/max/mdev = 0.162/0.199/0.283/0.049 ms


Thanks!
 
Last edited:
I never have use the openfabric daemons,

but "frr defaults traditional" could be replaced by "frr defaults datacenter" to lower the defaults timeout values.(and other internal tuning knonbs).

I don't think It'll help here, but in case of link disconnect, it should improve the failover speed a lot.
 
  • Like
Reactions: vesalius
I'm seeing the same problem on 7.3.3. I haven't been able to find any related information online.

Changing frr defaults to datacenter did not get rid of the messages.

I'm using Mellanox ConnectX-3 cards in 56GbE mode in case it's relevant.

I looked at the relevant code in https://github.com/FRRouting/frr/blob/master/isisd/fabricd.c#L259 and saw that the log messages are for debugging purposes. As a workaround, I changed the log verbosity in /etc/frr/frr.conf to:
Code:
log syslog warning
 
Have you tried to increase the settings a little bit to increase the timings?

Those should be for each interface:
- csnp-interval
- hello-interval
- hello-multiplier

In the router section:
- lsp-gen-interval

They are currently (if you copied it from the guide) set to the lowest possible value to get very short switch-over times, but maybe they are a bit too aggressive in some situations.

Details about the parameters and possible values can be found in the FRR documentation.

I cannot reproduce those logs in my test lab setups.
 
  • Like
Reactions: gmpreussner
I'm seeing the same problem on 7.3.3. I haven't been able to find any related information online.

Changing frr defaults to datacenter did not get rid of the messages.

I'm using Mellanox ConnectX-3 cards in 56GbE mode in case it's relevant.

I looked at the relevant code in https://github.com/FRRouting/frr/blob/master/isisd/fabricd.c#L259 and saw that the log messages are for debugging purposes. As a workaround, I changed the log verbosity in /etc/frr/frr.conf to:
Code:
log syslog warning
Agree and would go as far as to say the standard suggested FRR log level in the Proxmox wiki /etc/frr/frr.conf should be changed to warning as opposed to informational
 
Last edited:
@aaron Thank you very much for your response! I can confirm that increasing the values in `/etc/frr/frr.conf` fixes the timeouts. In my case it was sufficient to add just one second to each of them (I didn't bother troubleshooting each setting invidually), so I have:

```
interface XXX
...
openfabric csnp-interval 3
openfabric hello-interval 2
openfabric hello-multiplier 2

router YYY
...
lsp-gen-interval 2
...
```
 
Thanks for the feedback. @gmpreussner. I'll add a hint in the guide.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!