TL;DR - Upgrade from 16.2.5 to 16.2.6 - CEPHFS fails to start after upgrade, all MDS in "standby" - requires
to work again.
Longer version :
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve)
apt dist-upgraded, CEPH was upgraded from 16.2.5 to 16.2.6, which then made the GUI complain about differing versions. Following my historical mathod, which lines up with online advice:
* restart MGR
* restart MDS
* restart OSD
Followed that advice, MDS's all went standby, wouldn't start. Restarted everything, including the entire cluster(s) themselves, rank 0/1 both showed "failed" and all MDS's in standby, but nothing alive.
tried a lot of steps with "ceph-journal-tool" (good education) - nothing brought them back online. lots of google directions around "damaged" MDS, none of that applied or worked.
finally stumbled across this thread : https://www.mail-archive.com/ceph-users@ceph.io/msg12463.html
* stop all MDS
* ceph fs set cephfs max_mds 1
* ceph fs set cephfs allow_standby_replay false
* ceph fs compat <fs name> add_incompat 7 "mds uses inline data"
* start MDS
* magic ensues!
Posting here for other PMX users, this is a nasty stone to trip over, cost me easily 20 hours this weekend, and mild amounts of panic.
(more details here : https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/KQ5A5OWRIUEOJBC7VILBGDIKPQGJQIWN/)
Code:
ceph fs compat <fs name> add_incompat 7 "mds uses inline data"
to work again.
Longer version :
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve)
apt dist-upgraded, CEPH was upgraded from 16.2.5 to 16.2.6, which then made the GUI complain about differing versions. Following my historical mathod, which lines up with online advice:
* restart MGR
* restart MDS
* restart OSD
Followed that advice, MDS's all went standby, wouldn't start. Restarted everything, including the entire cluster(s) themselves, rank 0/1 both showed "failed" and all MDS's in standby, but nothing alive.
tried a lot of steps with "ceph-journal-tool" (good education) - nothing brought them back online. lots of google directions around "damaged" MDS, none of that applied or worked.
finally stumbled across this thread : https://www.mail-archive.com/ceph-users@ceph.io/msg12463.html
* stop all MDS
* ceph fs set cephfs max_mds 1
* ceph fs set cephfs allow_standby_replay false
* ceph fs compat <fs name> add_incompat 7 "mds uses inline data"
* start MDS
* magic ensues!
Posting here for other PMX users, this is a nasty stone to trip over, cost me easily 20 hours this weekend, and mild amounts of panic.
(more details here : https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/KQ5A5OWRIUEOJBC7VILBGDIKPQGJQIWN/)
Last edited: