PVECEPH Purge - Recovery Advice Needed

Azur

New Member
Dec 22, 2018
9
0
1
57
Hi to all.

I need an urgent advice about PVECEPH Purge. Yeah, I did that stupidity on perfectly working CEPH, 4 PM VE nodes, 10 OSDs...
Of course I lost connection with CEPH storage.. Now is marked with "?"

I have "rados_connect failed - No such file or directory (500" when click on CEPH in GUI

pveceph status returns

unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
rados_connect failed - No such file or directory
rados_connect failed - No such file or directory

Any advice how to recover from this ? It seems everything is in place in /var/lib/ceph..., i didn't erase anything after incident.

Thank You in advance
 
Hello,

could you do an "ls -la" on "/var/lib/ceph" and "/etc/pve/priv". In the Folder "/etc/pve/priv".
It seems your Mons are not running anymore, which is a big Problem because without the Mons you cannot access your CEPH Cluster. If you lost your Mons completly, they might be a chance with "reverse engineering" because all of your OSDs knows who they are and you can extract and combine these Information to recreate a Mon.

Do you have any Informations about your Cluster like last Status (was the Cluster in a healthy state?), Admin Keyring, Config, Pools with Replications, Crush Map / Rules, Versions etc. all such information can help to rebuild the CEPH Storage again.

You can check if this will help you: https://subscription.packtpub.com/b...73/recovering-from-a-complete-monitor-failure
You can create an 10 day free Account to see the Content. Otherwise you can us Google to get more Information about this.

Of course you can hire an consultant to recover your CEPH, for example the guys from Croit (https://croit.io/consulting). They have experience to recover a multiple Mon failure.
 
Tnx for the reply.
Firstly, I did a manual enty of my monitors in storage.conf RBD section. And restired block access to ceph pool and VM images/disks. This helped a lot, you can imagine. Obviously mon's are working ?

root@ahbab:/etc/pve/priv# systemctl status ceph-mon@ahbab.service
ceph-mon@ahbab.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; ve
Drop-In: /lib/systemd/system/ceph-mon@.service.d
└─ceph-after-pve-cluster.conf
Active: active (running) since Sat 2018-12-22 21:28:33 CET; 11h ag
Main PID: 13309 (ceph-mon)
Tasks: 23
CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@ahbab.serv
└─13309 /usr/bin/ceph-mon -f --cluster ceph --id ahbab --s

Dec 22 21:28:33 ahbab systemd[1]: Started Ceph cluster monitor daemon
Dec 23 06:25:02 ahbab ceph-mon[13309]: 2018-12-23 06:25:02.259270 7fd


But I've got this when trying to figure out the status


oot@ahbab:/etc/pve/priv# pveceph status
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
rados_connect failed - No such file or directory
rados_connect failed - No such file or directory
root@ahbab:/etc/pve/priv#


It seems that I only lack "link" between mon and client(s) but I dont know how to fix it...

Any suggestions?
 
Addinfg some ouputs while looking for solution:

root@ahbab:~# ceph health
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
2018-12-23 09:33:35.199545 7fd813f18700 -1 failed for service _ceph-mon._tcp
[errno 2] error connecting to the cluster
root@ahbab:~#



root@ahbab:~# ceph service status
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
2018-12-23 09:35:49.982111 7fd7c1810700 -1 failed for service _ceph-mon._tcp
[errno 2] error connecting to the cluster
root@ahbab:~#
 
unable to get monitor info from DNS SRV with service name: ceph-mon
This should be a problem with DNS Names. Could you check your /etc/hosts and check if all PVE Nodes have an entry there?

How many Mons do you have running? Do you tried to restart a Node / Mon to see if the Services are able to start again? Your Monitoring Service is up and running for the last 11h. So on the first view, it looks the monitor do not have any problems to start.
 
hello and tnx again.

I checked /etc/hosts and on each node. I have only entries for that node, two lines, loopback and FQDN. Looks normal to me ?
I spend whole day moving VMs from CEPH storage to local ones. I found monitor dead once.. But I restarted it without problems.
My plan is to reinitialize everything about CEPH after debugging this...

Will tey to restart node safely la late night to cee what will happen with monitors...


rgds
 
Could you paste the Output here or make a screenshot?
Under CEPH you need normally an entry for every Node in your Cluster. So if i look at my Nodes i have an entry for every Cluster Node on every Node - this might be a Problem by your Cluster.
 
sure, i speak only ipv4.
here is for one node, slightly edited :)

root@amiga:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
xxx.xxx.xxx.xxx amiga.team.ba amiga

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
root@amidza:~#
 
Please add every Node in the /etc/hosts on every Node. It should look like this:

Code:
root@alpha:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost

192.168.0.1 alpha.team.ba alpha pvelocalhost
192.168.0.2 bravo.team.ba bravo
192.168.0.3 charlie.team.ba charlie
192.168.0.4 delta.team.ba delta

You have to move the Name "pvelocalhost" to the correct line, so if you have to copy the file to Server charlie, you have to remove the "pvelocalhost" from alpha and add this to charlie.
 
  • Like
Reactions: Azur
those are ips ceph uses reight ?
If you have configured them for CEPH, then yes. Otherwise no, it depends on your Configuration. Its just an example from my side, you have to change this file so you can use it in your Cluster.
 
i have separate both corosync and ceph net/interface.. will change and try
 
just triend without succes. but got me to right direction i think...

in storage definition i derectly entered IPs with port for monitors...
But I dont have that in /etc/ceph/ceph.conf

I have global section osd section but not MON section

could you send me example from your system to try
 
nevermind bro.. i find example on net..
I lack almost everything :)

mon defs, osd defs... etc... for accademic purposes i will try to reconnect one monitor and chack what will happen
 
Nice to hear :)
But i will put in some Configurations from my standalone PVE CEPH Backup Server, so you can check it better than find some Configs in Google.

Under "/etc/ceph" i have an Admin Keyring and a linked ceph.conf (to /etc/pve/ceph.conf)

ceph.client.admin.keyring (Key is replaced with "a" only):
Code:
[client.admin]
    key = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa==

/etc/pve/ceph.conf
Code:
[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     bluestore cache size hdd = 536870912
     filestore xattr use omap = true
     fsid = aaaaaaa-1234-56bb-c123-07de0f9g5c56
     keyring = /etc/pve/priv/$cluster.$name.keyring
     mon allow pool delete = true
     osd journal size = 5120
     osd pool default min size = 1

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.ceph-storage1]
     host = ceph-storage1
     mds standby for name = pve

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.0]
     host = ceph-storage1
     mon addr = 192.168.0.1:6789

The MDS Keyring (/var/lib/ceph/mds/ceph-$id/keyring) and OSD Keyrings have the same scheme as the Admin Keyring, but is different from them.

/etc/pve/priv/ceph.mon.keyring (mon. Key is different from the Admin Key):
Code:
[mon.]
    key = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb==
    caps mon = "allow *"
[client.admin]
    key = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa==
    auid = 0
    caps mds = "allow"
    caps mon = "allow *"
    caps osd = "allow *"
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!