My Ceph mon on one node fail and won't start

abeperspiration

New Member
Aug 27, 2022
2
0
1
I have A ceph on 3 node working for a year.

I get a HEALTH_WARN about :

2 OSD have spurious read erros
1/3 mons down, quorum ceph01,ceph03

I tried to start mon on ceph02. But not working.


Code:
xxxxxxx@ceph02:~# systemctl status ceph-mon@ceph02
        ● ceph-mon@ceph02.service - Ceph cluster monitor daemon
             Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
            Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
                     └─ceph-after-pve-cluster.conf
             Active: active (running) since Sat 2024-02-03 12:27:49 CST; 5 months 12 days ago
           Main PID: 1450 (ceph-mon)
              Tasks: 24
             Memory: 3.4G
                CPU: 2w 4d 14h 10min 5.925s
             CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@ceph02.service
                     └─1450 /usr/bin/ceph-mon -f --cluster ceph --id ceph02 --setuser ceph --setgroup ceph
       
        Jul 17 12:17:16 ceph02 ceph-mon[1450]: 2024-07-17T12:17:16.574+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:17:31 ceph02 ceph-mon[1450]: 2024-07-17T12:17:31.590+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:17:46 ceph02 ceph-mon[1450]: 2024-07-17T12:17:46.603+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:01 ceph02 ceph-mon[1450]: 2024-07-17T12:18:01.615+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:16 ceph02 ceph-mon[1450]: 2024-07-17T12:18:16.627+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:31 ceph02 ceph-mon[1450]: 2024-07-17T12:18:31.644+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:46 ceph02 ceph-mon[1450]: 2024-07-17T12:18:46.660+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:1
   
    9:01 ceph02 ceph-mon[1450]: 2024-07-17T12:19:01.672+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
    Jul 17 12:19:16 ceph02 ceph-mon[1450]: 2024-07-17T12:19:16.685+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
    Jul 17 12:19:31 ceph02 ceph-mon[1450]: 2024-07-17T12:19:31.697+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied


And I do have some google about debug it.

Code:
xxxxxx@ceph02:~# ceph tell mon.1 mon_status
    Error ENXIO: problem getting command descriptions from mon.1

And tried:

Code:
sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph02.asok mon_status
    ceph-mon -i ceph02 --debug_mon 10
    ls /var/lib/ceph/mon/ceph-ceph02/



Non of them have any output and no respon.
My systeam disk still have space and HEALTH is OK no error.

It's looks like the folder store for mon on this node have some issue.

Should I rm it. Or just reboot the node?
 
Last edited: