Cephfs - MDS all up:standby, not becoming up:active

jw6677

Active Member
Oct 19, 2019
93
5
28
34
www.cayk.ca
Like a dummy I accidentally upgraded to the ceph dev branch (quincy?), and have been having nothing but trouble since.

This wasn't actually intentionally, I was trying to implement a PR which was expected to bring my cluster back online after the upgrade to v& (and ceph pacific).

--> It did bring my cluster back online, so that's good, but I failed to recognize that by building from the master branch that I also wouldn't be able to revert later. Whoops.

Lastly, while my MDS are online (MONs, OSDs and MGRs too.) MDSs are never marked up:active, so my cephfs data is inaccessible.

Hoping someone can help me determine the best way to bring an MDS up:active, as I have the bulk of my proxmox backups in cephfs along with a handful of VM Disks.


mds log immediately after restarting mds:
Code:
 2021-08-08T09:51:09.417-0600 7fb015e05700  1 mds.server Updating MDS map to version 1095533 from mon.1
 2021-08-08T09:51:09.417-0600 7fb013e01700  5 mds.beacon.server Sending beacon up:boot seq 1
 2021-08-08T09:51:09.417-0600 7fb015e05700 10 mds.server      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
 2021-08-08T09:51:09.417-0600 7fb015e05700 10 mds.server  mdsmap compat compat={},rocompat={},incompat={}
 2021-08-08T09:51:09.417-0600 7fb015e05700 10 mds.server my gid is 138479365
 2021-08-08T09:51:09.417-0600 7fb015e05700 10 mds.server map says I am mds.-1.-1 state null
 2021-08-08T09:51:09.417-0600 7fb015e05700 10 mds.server msgr says I am [v2:192.168.2.2:6808/1262779536,v1:192.168.2.2:6809/1262779536]
 2021-08-08T09:51:09.417-0600 7fb015e05700 10 mds.server handle_mds_map: handling map in rankless mode
 2021-08-08T09:51:09.441-0600 7fb013e01700 20 mds.beacon.server sender thread waiting interval 4s
 2021-08-08T09:51:09.441-0600 7fb015e05700 10 mds.server not in map yet
 2021-08-08T09:51:09.765-0600 7fb015e05700  1 mds.server Updating MDS map to version 1095534 from mon.1
 2021-08-08T09:51:09.765-0600 7fb015e05700 10 mds.server      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
 2021-08-08T09:51:09.765-0600 7fb015e05700 10 mds.server  mdsmap compat compat={},rocompat={},incompat={}
 2021-08-08T09:51:09.765-0600 7fb015e05700 10 mds.server my gid is 138479365
 2021-08-08T09:51:09.765-0600 7fb015e05700 10 mds.server map says I am mds.-1.0 state up:standby
 2021-08-08T09:51:09.765-0600 7fb015e05700 10 mds.server msgr says I am [v2:192.168.2.2:6808/1262779536,v1:192.168.2.2:6809/1262779536]
 2021-08-08T09:51:09.765-0600 7fb015e05700 10 mds.server handle_mds_map: handling map in rankless mode
 2021-08-08T09:51:09.765-0600 7fb015e05700  1 mds.server Monitors have assigned me to become a standby.
 2021-08-08T09:51:09.765-0600 7fb015e05700  5 mds.beacon.server set_want_state: up:boot -> up:standby
 2021-08-08T09:51:09.777-0600 7fb018e0b700  5 mds.beacon.server received beacon reply up:boot seq 1 rtt 0.360009
 2021-08-08T09:51:13.442-0600 7fb013e01700  5 mds.beacon.server Sending beacon up:standby seq 2
 2021-08-08T09:51:13.442-0600 7fb013e01700 20 mds.beacon.server sender thread waiting interval 4s
 2021-08-08T09:51:13.442-0600 7fb018e0b700  5 mds.beacon.server received beacon reply up:standby seq 2 rtt 0
 2021-08-08T09:51:17.442-0600 7fb013e01700  5 mds.beacon.server Sending beacon up:standby seq 3
 2021-08-08T09:51:17.442-0600 7fb013e01700 20 mds.beacon.server sender thread waiting interval 4s
cycles on this forever, never marked up.


This in particular looks weird to me from a log file, as 192.168.2.20 is a different node:
[v2:192.168.2.6:3300/0,v1:192.168.2.6:6789/0] >> conn(0x561e0593f800 0x561e07c57000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:192.168.2.20:60454/0)



As far as I can tell, no cephfs or mds settings seem to help. Number of ranks, standy-active or not, cephx or not, different networks, recreating mgrs mons or mds, etc.



I did however notice this and am hoping someone can confirm if normal, or if I am about to go on a goose chase.

Out of this block of the MDS log:
Code:
2021-08-10T09:31:09.484-0600 7ffa894fc700  1 mds.rog Updating MDS map to version 1095550 from mon.2

2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}

2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog  mdsmap compat compat={},rocompat={},incompat={}

2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog my gid is 139597028 2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog map says I am mds.-1.-1 state null

2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog msgr says I am [v2:192.168.10.50:6800/1353942242,v1:192.168.10.50:6801/1353942242]

2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog handle_mds_map: handling map in rankless mode

2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog not in map yet

2021-08-10T09:31:10.000-0600 7ffa894fc700  1 mds.rog Updating MDS map to version 1095551 from mon.2

2021-08-10T09:31:10.000-0600 7ffa894fc700 10 mds.rog      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}

2021-08-10T09:31:10.000-0600 7ffa894fc700 10 mds.rog  mdsmap compat compat={},rocompat={},incompat={}

2021-08-10T09:31:10.000-0600 7ffa894fc700 10 mds.rog my gid is 139597028

2021-08-10T09:31:10.000-0600 7ffa894fc700 10 mds.rog map says I am mds.-1.0 state up:standby

2021-08-10T09:31:10.000-0600 7ffa894fc700 10 mds.rog msgr says I am [v2:192.168.10.50:6800/1353942242,v1:192.168.10.50:6801/1353942242]

2021-08-10T09:31:10.000-0600 7ffa894fc700 10 mds.rog handle_mds_map: handling map in rankless mode

2021-08-10T09:31:10.000-0600 7ffa894fc700  1 mds.rog Monitors have assigned me to become a standby.

2021-08-10T09:31:10.000-0600 7ffa894fc700  5 mds.beacon.rog set_want_state: up:boot -> up:standby



These Lines:
Code:
2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog  mdsmap compat compat={},rocompat={},incompat={}
---> Looks like a difference between the mdsmap and MDS's "incompat={" setting. Is one meant to be `incompat={}` ?

This Line:
Code:
2021-08-10T09:31:09.484-0600 7ffa894fc700 10 mds.rog map says I am mds.-1.-1 state null
---> Is mds.-1.-1 normal?

This Line:
Code:
2021-08-10T09:31:10.000-0600 7ffa894fc700 10 mds.rog handle_mds_map: handling map in rankless mode
---> Is "handling map in rankless mode" normal?




I am hopeful to recover my cephfs data by whatever means makes the most sense.

Eventually I intend to create a seperate temporary ceph cluster, and migrate my data back to Pacific but really don't want to abandon this data if I can avoid it.



Help!


~Josh
 
Last edited:
It's a sad story unfortunately, I mucked about trying to get it working for months before finally giving up and accepting the data loss. :(