PVECEPH Purge - Recovery Advice Needed

Azur · Dec 22, 2018

Hi to all.

I need an urgent advice about PVECEPH Purge. Yeah, I did that stupidity on perfectly working CEPH, 4 PM VE nodes, 10 OSDs...
Of course I lost connection with CEPH storage.. Now is marked with "?"

I have "rados_connect failed - No such file or directory (500" when click on CEPH in GUI

pveceph status returns

unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
rados_connect failed - No such file or directory
rados_connect failed - No such file or directory

Any advice how to recover from this ? It seems everything is in place in /var/lib/ceph..., i didn't erase anything after incident.

Thank You in advance

sb-jw · Dec 23, 2018

Hello,

could you do an "ls -la" on "/var/lib/ceph" and "/etc/pve/priv". In the Folder "/etc/pve/priv".
It seems your Mons are not running anymore, which is a big Problem because without the Mons you cannot access your CEPH Cluster. If you lost your Mons completly, they might be a chance with "reverse engineering" because all of your OSDs knows who they are and you can extract and combine these Information to recreate a Mon.

Do you have any Informations about your Cluster like last Status (was the Cluster in a healthy state?), Admin Keyring, Config, Pools with Replications, Crush Map / Rules, Versions etc. all such information can help to rebuild the CEPH Storage again.

You can check if this will help you: https://subscription.packtpub.com/b...73/recovering-from-a-complete-monitor-failure
You can create an 10 day free Account to see the Content. Otherwise you can us Google to get more Information about this.

Of course you can hire an consultant to recover your CEPH, for example the guys from Croit (https://croit.io/consulting). They have experience to recover a multiple Mon failure.

Azur · Dec 23, 2018

Tnx for the reply.
Firstly, I did a manual enty of my monitors in storage.conf RBD section. And restired block access to ceph pool and VM images/disks. This helped a lot, you can imagine. Obviously mon's are working ?

root@ahbab:/etc/pve/priv# systemctl status ceph-mon@ahbab.service
● ceph-mon@ahbab.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; ve
Drop-In: /lib/systemd/system/ceph-mon@.service.d
└─ceph-after-pve-cluster.conf
Active: active (running) since Sat 2018-12-22 21:28:33 CET; 11h ag
Main PID: 13309 (ceph-mon)
Tasks: 23
CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@ahbab.serv
└─13309 /usr/bin/ceph-mon -f --cluster ceph --id ahbab --s

Dec 22 21:28:33 ahbab systemd[1]: Started Ceph cluster monitor daemon
Dec 23 06:25:02 ahbab ceph-mon[13309]: 2018-12-23 06:25:02.259270 7fd

But I've got this when trying to figure out the status

oot@ahbab:/etc/pve/priv# pveceph status
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
rados_connect failed - No such file or directory
rados_connect failed - No such file or directory
root@ahbab:/etc/pve/priv#

It seems that I only lack "link" between mon and client(s) but I dont know how to fix it...

Any suggestions?

Azur · Dec 23, 2018

Addinfg some ouputs while looking for solution:

root@ahbab:~# ceph health
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
2018-12-23 09:33:35.199545 7fd813f18700 -1 failed for service _ceph-mon._tcp
[errno 2] error connecting to the cluster
root@ahbab:~#

root@ahbab:~# ceph service status
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
2018-12-23 09:35:49.982111 7fd7c1810700 -1 failed for service _ceph-mon._tcp
[errno 2] error connecting to the cluster
root@ahbab:~#

sb-jw · Dec 23, 2018

Azur said:
unable to get monitor info from DNS SRV with service name: ceph-mon

This should be a problem with DNS Names. Could you check your /etc/hosts and check if all PVE Nodes have an entry there?

How many Mons do you have running? Do you tried to restart a Node / Mon to see if the Services are able to start again? Your Monitoring Service is up and running for the last 11h. So on the first view, it looks the monitor do not have any problems to start.

Azur · Dec 23, 2018

hello and tnx again.

I checked /etc/hosts and on each node. I have only entries for that node, two lines, loopback and FQDN. Looks normal to me ?
I spend whole day moving VMs from CEPH storage to local ones. I found monitor dead once.. But I restarted it without problems.
My plan is to reinitialize everything about CEPH after debugging this...

Will tey to restart node safely la late night to cee what will happen with monitors...

rgds

sb-jw · Dec 23, 2018

Could you paste the Output here or make a screenshot?
Under CEPH you need normally an entry for every Node in your Cluster. So if i look at my Nodes i have an entry for every Cluster Node on every Node - this might be a Problem by your Cluster.

Azur · Dec 23, 2018

sure, i speak only ipv4.
here is for one node, slightly edited

root@amiga:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
xxx.xxx.xxx.xxx amiga.team.ba amiga

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
root@amidza:~#

sb-jw · Dec 23, 2018

Please add every Node in the /etc/hosts on every Node. It should look like this:

Code:

root@alpha:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost

192.168.0.1 alpha.team.ba alpha pvelocalhost
192.168.0.2 bravo.team.ba bravo
192.168.0.3 charlie.team.ba charlie
192.168.0.4 delta.team.ba delta

You have to move the Name "pvelocalhost" to the correct line, so if you have to copy the file to Server charlie, you have to remove the "pvelocalhost" from alpha and add this to charlie.

Azur · Dec 23, 2018

will try surely..
those are ips ceph uses reight ?

sb-jw · Dec 23, 2018

Azur said:
those are ips ceph uses reight ?

If you have configured them for CEPH, then yes. Otherwise no, it depends on your Configuration. Its just an example from my side, you have to change this file so you can use it in your Cluster.

Azur · Dec 23, 2018

i have separate both corosync and ceph net/interface.. will change and try

Azur · Dec 23, 2018

just triend without succes. but got me to right direction i think...

in storage definition i derectly entered IPs with port for monitors...
But I dont have that in /etc/ceph/ceph.conf

I have global section osd section but not MON section

could you send me example from your system to try

Azur · Dec 23, 2018

nevermind bro.. i find example on net..
I lack almost everything

mon defs, osd defs... etc... for accademic purposes i will try to reconnect one monitor and chack what will happen

sb-jw · Dec 24, 2018

Nice to hear

But i will put in some Configurations from my standalone PVE CEPH Backup Server, so you can check it better than find some Configs in Google.

Under "/etc/ceph" i have an Admin Keyring and a linked ceph.conf (to /etc/pve/ceph.conf)

ceph.client.admin.keyring (Key is replaced with "a" only):

Code:

[client.admin]
    key = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa==

/etc/pve/ceph.conf

Code:

[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     bluestore cache size hdd = 536870912
     filestore xattr use omap = true
     fsid = aaaaaaa-1234-56bb-c123-07de0f9g5c56
     keyring = /etc/pve/priv/$cluster.$name.keyring
     mon allow pool delete = true
     osd journal size = 5120
     osd pool default min size = 1

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.ceph-storage1]
     host = ceph-storage1
     mds standby for name = pve

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.0]
     host = ceph-storage1
     mon addr = 192.168.0.1:6789

The MDS Keyring (/var/lib/ceph/mds/ceph-$id/keyring) and OSD Keyrings have the same scheme as the Admin Keyring, but is different from them.

/etc/pve/priv/ceph.mon.keyring (mon. Key is different from the Admin Key):

Code:

[mon.]
    key = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb==
    caps mon = "allow *"
[client.admin]
    key = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa==
    auid = 0
    caps mds = "allow"
    caps mon = "allow *"
    caps osd = "allow *"

roosei · Jul 17, 2019

Azur said:
nevermind bro.. i find example on net..
I lack almost everything

mon defs, osd defs... etc... for accademic purposes i will try to reconnect one monitor and chack what will happen

Hello Azur, did you solve this problem? I'm in exactly same situation...

Search

Search

PVECEPH Purge - Recovery Advice Needed

Azur

New Member

sb-jw

Famous Member

Azur

New Member

Azur

New Member

sb-jw

Famous Member

Azur

New Member

sb-jw

Famous Member

Azur

New Member

sb-jw

Famous Member

Azur

New Member

sb-jw

Famous Member

Azur

New Member

Azur

New Member

Azur

New Member

sb-jw

Famous Member

roosei

Active Member