My Ceph mon on one node fail and won't start

abeperspiration · Jul 17, 2024

I have A ceph on 3 node working for a year.

I get a HEALTH_WARN about :

2 OSD have spurious read erros
1/3 mons down, quorum ceph01,ceph03

I tried to start mon on ceph02. But not working.

Code:

xxxxxxx@ceph02:~# systemctl status ceph-mon@ceph02
        ● ceph-mon@ceph02.service - Ceph cluster monitor daemon
             Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
            Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
                     └─ceph-after-pve-cluster.conf
             Active: active (running) since Sat 2024-02-03 12:27:49 CST; 5 months 12 days ago
           Main PID: 1450 (ceph-mon)
              Tasks: 24
             Memory: 3.4G
                CPU: 2w 4d 14h 10min 5.925s
             CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@ceph02.service
                     └─1450 /usr/bin/ceph-mon -f --cluster ceph --id ceph02 --setuser ceph --setgroup ceph
       
        Jul 17 12:17:16 ceph02 ceph-mon[1450]: 2024-07-17T12:17:16.574+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:17:31 ceph02 ceph-mon[1450]: 2024-07-17T12:17:31.590+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:17:46 ceph02 ceph-mon[1450]: 2024-07-17T12:17:46.603+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:01 ceph02 ceph-mon[1450]: 2024-07-17T12:18:01.615+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:16 ceph02 ceph-mon[1450]: 2024-07-17T12:18:16.627+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:31 ceph02 ceph-mon[1450]: 2024-07-17T12:18:31.644+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:46 ceph02 ceph-mon[1450]: 2024-07-17T12:18:46.660+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:1
   
    9:01 ceph02 ceph-mon[1450]: 2024-07-17T12:19:01.672+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
    Jul 17 12:19:16 ceph02 ceph-mon[1450]: 2024-07-17T12:19:16.685+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
    Jul 17 12:19:31 ceph02 ceph-mon[1450]: 2024-07-17T12:19:31.697+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied

And I do have some google about debug it.

Code:

xxxxxx@ceph02:~# ceph tell mon.1 mon_status
    Error ENXIO: problem getting command descriptions from mon.1

And tried:

Code:

sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph02.asok mon_status
    ceph-mon -i ceph02 --debug_mon 10
    ls /var/lib/ceph/mon/ceph-ceph02/

Non of them have any output and no respon.
My systeam disk still have space and HEALTH is OK no error.

It's looks like the folder store for mon on this node have some issue.

Should I rm it. Or just reboot the node?

abeperspiration · Jul 17, 2024

Update:
I tried inject monmap from another node.
But command get stuck also.

Search

Search

My Ceph mon on one node fail and won't start

abeperspiration

New Member

abeperspiration

New Member

We value your privacy