PVE Full Mesh reconfiguration - Ceph (got timeout 500)

bbx1_

Active Member
Nov 13, 2019
21
2
43
40
Ontario Canada
tweakmyskills.it
Hello everyone,

I have a 3-node PVE cluster that I am using for testing and learning on. On this PVE cluster, I recently created a CEPH pool and had 4 VMs residing on it.

My PVE cluster was connected to my Mellanox 40GbE switch but I wanted to explore reconfiguring it for full mesh, which I did using 40GbE cabling from each server to the other.

I followed the guide here and used the Manual method with FRR since I am on PVE 8.4 still.
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(with_fallback)

Since doing so, I cannot access my CEPH pool or access any CEPH settings. CEPH for me was configured for the public network of 192.168.161.0/24 which is the same network as my 40GbE connection

1759835080654.png


I have tried to turn off the firewall on the nodes and ebtables on the datacenter option but that didn't help.


I don't want to undo my cabling and I'd like to see if I can possible get it to work as it is. The VMs being lost won't impact me as I have copies of them on PBS but I'd like to see if I can salvage my CEPH configuration if its something.

I know the documentation for Full Mesh suggests setting up CEPH afterwards but there was no mention about any impacts to CEPH if it is already configured before changing to full meshed configuration.

I have gone through the documentation again and made sure my configuration in /etc/frr/frr.conf and /etc/network/interfaces are correct by the looks of it.

When I run show openfabric route, I do see my servers listed here:
1759835508779.png



Has anybody ran into this issue before or know what I can look at to fix it?
 

Attachments

Last edited:
Can you please post the full FRR / ifupdown2 configuration from all 3 nodes, as well as the output of the following commands:

Code:
ip a
ip r
 
The IP address on the loopback interface (in the FRR config) should be /32, not /24.

If that still doesn't work could you also post the output of the following commands:

Code:
vtysh -c 'show openfabric neighbor'
vtysh -c 'show openfabric interface'
 
Last edited:
Hello @Franky779,

@beisser is pointing in the right direction. The error "Failed to prepare disks for restore" during a HotAdd operation strongly suggests that the Veeam transport components on your Proxmox hosts are not compatible with PVE 9. You will need to install the latest Veeam plugin that officially supports Proxmox VE 9, as mentioned in the link.
also wrong thread ;)
 
The IP address on the loopback interface (in the FRR config) should be /32, not /24.

If that still doesn't work could you also post the output of the following commands:

Code:
vtysh -c 'show openfabric neighbor'
vtysh -c 'show openfabric interface'


I will try that and report back. I did see the /32 but I assumed it was a config preference within the documentation.

Edit: Didn't help by adjusting to /32.
On my first PVE host, if I run the command as you mentioned, I get this:

Code:
root@MPD-MDF-PVE01-5079:~# vtysh -c 'show openfabric neighbor'
Exiting: failed to connect to any daemons.

Checking the status of frr.service shows:

Code:
root@MPD-MDF-PVE01-5079:~# systemctl status frr.service
○ frr.service - FRRouting
     Loaded: loaded (/lib/systemd/system/frr.service; disabled; preset: enabled)
     Active: inactive (dead)
       Docs: https://frrouting.readthedocs.io/en/latest/setup.html
root@MPD-MDF-PVE01-5079:~#

I start the service with:
systemctl start frr.service


PVE01:

root@MPD-MDF-PVE01-5079:~# vtysh -c 'show openfabric neighbor'
Area 1:
System Id Interface L State Holdtime SNPA
MPD-MDF-PVE02-5080 enp1s0d1 2 Up 2 2020.2020.2020
MPD-MDF-PVE03-5081 enp1s0 2 Up 2 2020.2020.2020
root@MPD-MDF-PVE01-5079:~#
root@MPD-MDF-PVE01-5079:~#
root@MPD-MDF-PVE01-5079:~# vtysh -c 'show openfabric interface'
Area 1:
Interface CircId State Type Level
lo 0x0 Up loopback L2
enp1s0d1 0x0 Up p2p L2
enp1s0 0x0 Up p2p L2
root@MPD-MDF-PVE01-5079:~#


PVE02:
root@MPD-MDF-PVE02-5080:~# vtysh -c 'show openfabric neighbor'
Area 1:
System Id Interface L State Holdtime SNPA
MPD-MDF-PVE03-5081 enp1s0d1 2 Up 2 2020.2020.2020
MPD-MDF-PVE01-5079 enp1s0 2 Up 2 2020.2020.2020
root@MPD-MDF-PVE02-5080:~#
root@MPD-MDF-PVE02-5080:~#
root@MPD-MDF-PVE02-5080:~# vtysh -c 'show openfabric interface'
Area 1:
Interface CircId State Type Level
lo 0x0 Up loopback L2
enp1s0d1 0x0 Up p2p L2
enp1s0 0x0 Up p2p L2
root@MPD-MDF-PVE02-5080:~#


PVE03:
root@MPD-MDF-PVE03-5081:~# vtysh -c 'show openfabric neighbor'
Area 1:
System Id Interface L State Holdtime SNPA
MPD-MDF-PVE01-5079 enp1s0d1 2 Up 2 2020.2020.2020
MPD-MDF-PVE02-5080 enp1s0 2 Up 2 2020.2020.2020
root@MPD-MDF-PVE03-5081:~#
root@MPD-MDF-PVE03-5081:~#
root@MPD-MDF-PVE03-5081:~# vtysh -c 'show openfabric interface'
Area 1:
Interface CircId State Type Level
lo 0x0 Up loopback L2
enp1s0d1 0x0 Up p2p L2
enp1s0 0x0 Up p2p L2
root@MPD-MDF-PVE03-5081:~#



After modifying the frr config to change the IP from /24 to /32, I restarted each PVE node, this is why I had to manually start the frr service.
 
Hi @bbx1_,

Your output shows the core problem: the frr.service was inactive.

The status output also shows the service is disabled (Loaded: ...; disabled; ...), which is why it did not start automatically after you rebooted the nodes.

You should enable the service to start on boot on all nodes:systemctl enable frr.service

Now that the service is running, can you please check the output of ip r again to confirm that the routes are present? If they are, your Ceph GUI should also be accessible now.
 
  • Like
Reactions: bbx1_
Hi @bbx1_,

Your output shows the core problem: the frr.service was inactive.

The status output also shows the service is disabled (Loaded: ...; disabled; ...), which is why it did not start automatically after you rebooted the nodes.

You should enable the service to start on boot on all nodes:systemctl enable frr.service

Now that the service is running, can you please check the output of ip r again to confirm that the routes are present? If they are, your Ceph GUI should also be accessible now.


Thank you,
I was starting the service after but that must have been the issue. Once I set it to start on boot as you provided, everything came up fine.

I believe previously I was able to select the migration network by using the 40GbE network, which it doesn't appear to be an option in Datacenter view --> Migration Settings.

I'll take a stab at it and say that it would impact performance/latency if I was to use my 40GbE mesh network for migrations.

This is a great option (full mesh) for some of the remote sites I have that I plan to deploy PVE into sometime next year.

Next thing I will handle is migrating from PVE8 to 9.

I appreciate all of your support, thank you!