PVE Full Mesh reconfiguration - Ceph (got timeout 500)

bbx1_ · Oct 7, 2025

Hello everyone,

I have a 3-node PVE cluster that I am using for testing and learning on. On this PVE cluster, I recently created a CEPH pool and had 4 VMs residing on it.

My PVE cluster was connected to my Mellanox 40GbE switch but I wanted to explore reconfiguring it for full mesh, which I did using 40GbE cabling from each server to the other.

I followed the guide here and used the Manual method with FRR since I am on PVE 8.4 still.
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(with_fallback)

Since doing so, I cannot access my CEPH pool or access any CEPH settings. CEPH for me was configured for the public network of 192.168.161.0/24 which is the same network as my 40GbE connection

I have tried to turn off the firewall on the nodes and ebtables on the datacenter option but that didn't help.

I don't want to undo my cabling and I'd like to see if I can possible get it to work as it is. The VMs being lost won't impact me as I have copies of them on PBS but I'd like to see if I can salvage my CEPH configuration if its something.

I know the documentation for Full Mesh suggests setting up CEPH afterwards but there was no mention about any impacts to CEPH if it is already configured before changing to full meshed configuration.

I have gone through the documentation again and made sure my configuration in /etc/frr/frr.conf and /etc/network/interfaces are correct by the looks of it.

When I run show openfabric route, I do see my servers listed here:

Has anybody ran into this issue before or know what I can look at to fix it?

shanreich · Oct 7, 2025

Can you please post the full FRR / ifupdown2 configuration from all 3 nodes, as well as the output of the following commands:

Code:

ip a
ip r

bbx1_ · Oct 7, 2025

Hello, I've attached the files to my original post. I've created a separate text file for each command. Please let me know if you would like me to change around the formatting or make it easier to read.

shanreich · Oct 7, 2025

The IP address on the loopback interface (in the FRR config) should be /32, not /24.

If that still doesn't work could you also post the output of the following commands:

Code:

vtysh -c 'show openfabric neighbor'
vtysh -c 'show openfabric interface'

beisser · Oct 7, 2025

Bu66as said:
Hello @Franky779,

@beisser is pointing in the right direction. The error "Failed to prepare disks for restore" during a HotAdd operation strongly suggests that the Veeam transport components on your Proxmox hosts are not compatible with PVE 9. You will need to install the latest Veeam plugin that officially supports Proxmox VE 9, as mentioned in the link.

also wrong thread

Bu66as · Oct 7, 2025

beisser said:
also wrong thread

Yes, unfortunately I got confused when pasting.

bbx1_ · Oct 7, 2025

shanreich said:
The IP address on the loopback interface (in the FRR config) should be /32, not /24.

If that still doesn't work could you also post the output of the following commands:

Code:

vtysh -c 'show openfabric neighbor' vtysh -c 'show openfabric interface'

I will try that and report back. I did see the /32 but I assumed it was a config preference within the documentation.

Edit: Didn't help by adjusting to /32.
On my first PVE host, if I run the command as you mentioned, I get this:

Code:

root@MPD-MDF-PVE01-5079:~# vtysh -c 'show openfabric neighbor'
Exiting: failed to connect to any daemons.

Checking the status of frr.service shows:

Code:

root@MPD-MDF-PVE01-5079:~# systemctl status frr.service
○ frr.service - FRRouting
     Loaded: loaded (/lib/systemd/system/frr.service; disabled; preset: enabled)
     Active: inactive (dead)
       Docs: https://frrouting.readthedocs.io/en/latest/setup.html
root@MPD-MDF-PVE01-5079:~#

I start the service with:
systemctl start frr.service

PVE01:
root@MPD-MDF-PVE01-5079:~# vtysh -c 'show openfabric neighbor'
Area 1:
System Id Interface L State Holdtime SNPA
MPD-MDF-PVE02-5080 enp1s0d1 2 Up 2 2020.2020.2020
MPD-MDF-PVE03-5081 enp1s0 2 Up 2 2020.2020.2020
root@MPD-MDF-PVE01-5079:~#
root@MPD-MDF-PVE01-5079:~#
root@MPD-MDF-PVE01-5079:~# vtysh -c 'show openfabric interface'
Area 1:
Interface CircId State Type Level
lo 0x0 Up loopback L2
enp1s0d1 0x0 Up p2p L2
enp1s0 0x0 Up p2p L2
root@MPD-MDF-PVE01-5079:~#

PVE02:
root@MPD-MDF-PVE02-5080:~# vtysh -c 'show openfabric neighbor'
Area 1:
System Id Interface L State Holdtime SNPA
MPD-MDF-PVE03-5081 enp1s0d1 2 Up 2 2020.2020.2020
MPD-MDF-PVE01-5079 enp1s0 2 Up 2 2020.2020.2020
root@MPD-MDF-PVE02-5080:~#
root@MPD-MDF-PVE02-5080:~#
root@MPD-MDF-PVE02-5080:~# vtysh -c 'show openfabric interface'
Area 1:
Interface CircId State Type Level
lo 0x0 Up loopback L2
enp1s0d1 0x0 Up p2p L2
enp1s0 0x0 Up p2p L2
root@MPD-MDF-PVE02-5080:~#

PVE03:
root@MPD-MDF-PVE03-5081:~# vtysh -c 'show openfabric neighbor'
Area 1:
System Id Interface L State Holdtime SNPA
MPD-MDF-PVE01-5079 enp1s0d1 2 Up 2 2020.2020.2020
MPD-MDF-PVE02-5080 enp1s0 2 Up 2 2020.2020.2020
root@MPD-MDF-PVE03-5081:~#
root@MPD-MDF-PVE03-5081:~#
root@MPD-MDF-PVE03-5081:~# vtysh -c 'show openfabric interface'
Area 1:
Interface CircId State Type Level
lo 0x0 Up loopback L2
enp1s0d1 0x0 Up p2p L2
enp1s0 0x0 Up p2p L2
root@MPD-MDF-PVE03-5081:~#

After modifying the frr config to change the IP from /24 to /32, I restarted each PVE node, this is why I had to manually start the frr service.

Bu66as · Oct 7, 2025

Hi @bbx1_,

Your output shows the core problem: the frr.service was inactive.

The status output also shows the service is disabled (Loaded: ...; disabled; ...), which is why it did not start automatically after you rebooted the nodes.

You should enable the service to start on boot on all nodes:systemctl enable frr.service

Now that the service is running, can you please check the output of ip r again to confirm that the routes are present? If they are, your Ceph GUI should also be accessible now.

bbx1_ · Oct 7, 2025

Bu66as said:
Hi @bbx1_,

Your output shows the core problem: the frr.service was inactive.

The status output also shows the service is disabled (Loaded: ...; disabled; ...), which is why it did not start automatically after you rebooted the nodes.

You should enable the service to start on boot on all nodes:systemctl enable frr.service

Now that the service is running, can you please check the output of ip r again to confirm that the routes are present? If they are, your Ceph GUI should also be accessible now.

Thank you,
I was starting the service after but that must have been the issue. Once I set it to start on boot as you provided, everything came up fine.

I believe previously I was able to select the migration network by using the 40GbE network, which it doesn't appear to be an option in Datacenter view --> Migration Settings.

I'll take a stab at it and say that it would impact performance/latency if I was to use my 40GbE mesh network for migrations.

This is a great option (full mesh) for some of the remote sites I have that I plan to deploy PVE into sometime next year.

Next thing I will handle is migrating from PVE8 to 9.

I appreciate all of your support, thank you!

Search

Search

PVE Full Mesh reconfiguration - Ceph (got timeout 500)

bbx1_

Active Member

Attachments

shanreich

Proxmox Staff Member

bbx1_

Active Member

shanreich

Proxmox Staff Member

beisser

Well-Known Member

Bu66as

Renowned Member

bbx1_

Active Member

Bu66as

Renowned Member

bbx1_

Active Member

We value your privacy