ceph status "authenticate timed out after 300" after attempt to migrate to new network

Mr.Lux · Saturday at 22:55

Hey dear Proxmox Pros,

I have an issue with my Ceph Cluster after the attempt of migrating it to a dedicated network.
I have a 3 node Proxmox Cluster with Ceph enabled. Since I only had one network connection on each of the nodes, I wanted to create a dedicated and separate network only for the Ceph Cluster.
But my attempt was ill fated, because now, my whole cluster isn't working anymore.

What have tried:

I have given every node a separate ip-address on the new network. 192.168.100.1, 192.168.100.2, 192.168.100.3.
Since they are the only nodes on this network, there is no gateway or something else.
After that, I change the cluster_network and the public_network from 192.168.200.4/24 (the old ip-address) to 192.168.100.1/24 in my /etc/pve/ceph.conf
Afterwards, I restarted the nodes.
Then I changed the ip-address of the monitor from 192.168.200.18 to 192.168.100.2

And then, I wasn't able to access the ceph cluster anymore. When I try to do a ceph --status or ceph -s or ceph health it just hangs.
So then I tried to revert my changes and tried to go back to where I was and the beginning.
But now I get the same issues, when I reverted everything.
I'm pretty sure, that I haven't had the option mon_host = 192.168.200.18 in my /etc/pve/ceph before I switched, but when this option isn't present I get the following error message, when I enter ceph --status

Bash:

root@pve-02:~# ceph --status
failed to get an address for mon.pve-02: error -2
unable to get monitor info from DNS SRV with service name: ceph-mon
2024-11-02T22:49:22.369+0100 72d5402006c0 -1 failed for service _ceph-mon._tcp
2024-11-02T22:49:22.369+0100 72d5402006c0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
[errno 2] RADOS object not found (error connecting to the cluster)

And if this option is present, I get this error message:

Bash:

root@pve-02:~# ceph status
2024-11-02T22:42:07.468+0100 7c96a50006c0  0 monclient(hunting): authenticate timed out after 300
[errno 110] RADOS timed out (error connecting to the cluster)

I'm working on this issue since 12 hours now, and because I'm a noob when it comes to ceph, I don't know what I can do now.
I have played around with the ceph.conf here and there. Mostly on the Monitor side, that I don't know what the original one looked like anymore. But here it is anyway:

Code:

root@pve-02:~# cat /etc/pve/ceph.conf
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 192.168.200.4/24
        fsid = cf5fdd35-0db9-4936-b163-a36e1457fb1e
        mon_allow_pool_delete = true
        mon_host = 192.168.200.18 192.168.200.15
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 192.168.200.4/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve-01]
        host = pve-01
        mds_standby_for_name = pve

[mds.pve-03]
        host = pve-03
        mds_standby_for_name = pve

[mon.pve-02]
        public_addr = 192.168.200.18

[mon.pve-03]
        public_addr = 192.168.200.15

I'm really out of ideas here and in need of help.
Can someone help me?
If you need more information, please ask. I have tried so many things, that I don't know what is important anymore.

Cheers and thanks for reading this.

aaron · Saturday at 23:52

Ceph MONs themselves don't take their IPs just from the ceph.conf file! They have their internal monmap where they keep track what other MONs should be there to form the quorum.

Is the old network config still available? Or can you add it temporarily? Then the procedure is rather simple. Switch back the MON IPs (dedicated section & mon_host line) to the old IPs.

Then restart the Ceph MONs. They should be able to communicate and form a quorum. You can then reload/restart the other Ceph services if needed.
To switch the Ceph MONs over to the new IPs, destroy and recreate them, one at a time. They will then use the IPs from the new networks configured in the public_network line.

If that is not the case, you could stop one MON, export the monmap, change or delete&add the MONs again. So that it prints all the MONs in the new IP subnet. Then copy it to the other nodes. Inject it to all (stopped) MONs before you start them again.

The Ceph docs have a section that relates closely and explains the process on how to extract and modify the monmap: https://docs.ceph.com/en/reef/rados.../#removing-monitors-from-an-unhealthy-cluster

But you just want them all to be as they are, just with IPs in the new network.

Mr.Lux · 2024-11-03T00:20:49+0100

Hey Aaron,
thanks for the fast reply. Especially at this time!
I just wanted to say thank you. I will try this tomorrow, since I'm pretty exhausted at the moment and need a bit of sleep

But it's making me a bit more hopeful, that you didn't say "you're f***ed :-D
I will report tomorrow and will surely have a few more questions.

aaron · 2024-11-03T00:34:40+0100

Ceph can handle and recover from a lot of situations. There are only a few situations where the answer is to nuke it and restore from backup

Such situations that I encountered over the last years were:

deleted all MONs (and even then, the cluster can be usually recovered to a point where the data is accessible again to create backups to nuke and re-setup it cleanly)
if you lose disks/OSDs that are higher than the redundancy settings. min-size=1 is one big contributor to run into such a situation.

Mr.Lux · 2024-11-03T22:03:43+0100

Hey @aaron,
again thanks for your help.
I was able to get my Ceph Cluster running again thanks to your link. The "funny" thing is, that I have seen this link all other the forum and I also tried it once, but it seems, that I did something wrong.
I've found out, that in the monmap on there were one IP address in the old range and one in the new range.
I've deleted the one from the new range (because I was already trying to get back to the "before" state) and after injecting the repaired monmap and restarting the monitor on that host, I was able to do a ceph status again!
Afterwards I needed to restart every node, one by one and my Cluster was available and healthy again.

Just to do it right this time: How do I switch my Ceph Cluster to a new network?

I have new IP-Addresses for the new network cards
I then change the cluster_network and the public_network from 192.168.200.4/24 (the old IP-address) to 192.168.100.1/24 (the new IP-address range) in my /etc/pve/ceph.conf
I destroy one of the old monitors (but not all at once)
I create a new Monitor on that host where I just destroyed the old one. It should get a new IP-address
If this went well, I continue with the next Monitor.

After doing this on every Monitor, I should be running on the new Network?
Is this correct? And do I have to destroy the managers as well?

Cheers and thanks again for the help!

aaron · 2024-11-03T22:39:27+0100

Sounds good overall. Both, old and new, network addresses are available during the procedure right?

And yeas, destroy and recreate one MON at a time. This way the PVE & Ceph tooling will take care of everything, changing that MONs IP in the ceph.conf and the monmap.

MGR don't necessarily need to be destroyed, but if you do it one at a time, and again, wait until Ceph is HEALTH_OK before you continue to the next, nothing speaks against it.

You can run ss -tulpn | grep -i ceph and check the IP addresses on which the Ceph services are listening on after recreation/restarting.
As long as both networks are available throughout, this can be done on the running cluster. Just make sure that you only restart OSD services on node at a time to always have 2 replicas available.

Search

Search

ceph status "authenticate timed out after 300" after attempt to migrate to new network

Mr.Lux

New Member

aaron

Proxmox Staff Member

Mr.Lux

New Member

aaron

Proxmox Staff Member

Mr.Lux

New Member

aaron

Proxmox Staff Member