My Ceph mon on one node fail and won't start

abeperspiration

New Member
Aug 27, 2022
2
0
1
I have A ceph on 3 node working for a year.

I get a HEALTH_WARN about :

2 OSD have spurious read erros
1/3 mons down, quorum ceph01,ceph03

I tried to start mon on ceph02. But not working.


Code:
xxxxxxx@ceph02:~# systemctl status ceph-mon@ceph02
        ● ceph-mon@ceph02.service - Ceph cluster monitor daemon
             Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
            Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
                     └─ceph-after-pve-cluster.conf
             Active: active (running) since Sat 2024-02-03 12:27:49 CST; 5 months 12 days ago
           Main PID: 1450 (ceph-mon)
              Tasks: 24
             Memory: 3.4G
                CPU: 2w 4d 14h 10min 5.925s
             CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@ceph02.service
                     └─1450 /usr/bin/ceph-mon -f --cluster ceph --id ceph02 --setuser ceph --setgroup ceph
       
        Jul 17 12:17:16 ceph02 ceph-mon[1450]: 2024-07-17T12:17:16.574+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:17:31 ceph02 ceph-mon[1450]: 2024-07-17T12:17:31.590+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:17:46 ceph02 ceph-mon[1450]: 2024-07-17T12:17:46.603+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:01 ceph02 ceph-mon[1450]: 2024-07-17T12:18:01.615+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:16 ceph02 ceph-mon[1450]: 2024-07-17T12:18:16.627+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:31 ceph02 ceph-mon[1450]: 2024-07-17T12:18:31.644+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:18:46 ceph02 ceph-mon[1450]: 2024-07-17T12:18:46.660+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
        Jul 17 12:1
   
    9:01 ceph02 ceph-mon[1450]: 2024-07-17T12:19:01.672+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
    Jul 17 12:19:16 ceph02 ceph-mon[1450]: 2024-07-17T12:19:16.685+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
    Jul 17 12:19:31 ceph02 ceph-mon[1450]: 2024-07-17T12:19:31.697+0800 7f1ccdd33700 -1 mon.ceph02@1(peon) e3 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied


And I do have some google about debug it.

Code:
xxxxxx@ceph02:~# ceph tell mon.1 mon_status
    Error ENXIO: problem getting command descriptions from mon.1

And tried:

Code:
sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph02.asok mon_status
    ceph-mon -i ceph02 --debug_mon 10
    ls /var/lib/ceph/mon/ceph-ceph02/



Non of them have any output and no respon.
My systeam disk still have space and HEALTH is OK no error.

It's looks like the folder store for mon on this node have some issue.

Should I rm it. Or just reboot the node?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!