CEPH MON issue on one node

mojsiuk

Active Member
Feb 10, 2019
12
1
43
46
Poland
Hi, i have issue with ceph mon on one node few weeks after update to 17.27 version. Below logs from Syslog Mon.
once every few days the service stops and I have to start it manually. It works fine for a few days and then these errors come back.
The system disk has a lot of free space. It is in hardware RAID 1, SMART on the disks is correct.

Logs
Aug 20 01:09:03 pve13.xxx systemd[1]: Started Ceph cluster monitor daemon.
Aug 20 01:09:03 pve13.xxx ceph-mon[594532]: problem writing to /var/log/ceph/ceph-mon.pve13.log: (28) No space left on device
Aug 20 01:09:03 pve13.xxx ceph-mon[594532]: 2024-08-20T01:09:03.966+0200 7f1c89e2da00 -1 error: monitor data filesystem reached concerning levels of available storage space (available: 0% 0 B)
Aug 20 01:09:03 pve13.xxx ceph-mon[594532]: you may adjust 'mon data avail crit' to a lower value to make this go away (default: 5%)
Aug 20 01:09:03 pve13.xxx systemd[1]: ceph-mon@pve13.service: Main process exited, code=exited, status=28/n/a
Aug 20 01:09:03 pve13.xxx systemd[1]: ceph-mon@pve13.service: Failed with result 'exit-code'.
Aug 20 01:09:14 pve13.xxx systemd[1]: ceph-mon@pve13.service: Scheduled restart job, restart counter is at 6.
Aug 20 01:09:14 pve13.xxx systemd[1]: Stopped Ceph cluster monitor daemon.
Aug 20 01:09:14 pve13.xxx systemd[1]: ceph-mon@pve13.service: Start request repeated too quickly.
Aug 20 01:09:14 pve13.xxx systemd[1]: ceph-mon@pve13.service: Failed with result 'exit-code'.
Aug 20 01:09:14 pve13.xxx systemd[1]: Failed to start Ceph cluster monitor daemon.
Aug 25 19:47:32 pve13.xxx systemd[1]: Started Ceph cluster monitor daemon.
Aug 25 19:47:32 pve13.xxx ceph-mon[2859316]: 2024-08-25T19:47:32.543+0200 7f5b17e67700 -1 mon.pve13@-1(???) e4 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Aug 25 19:47:32 pve13.xxx ceph-mon[2859316]: 2024-08-25T19:47:32.743+0200 7f5b17e67700 -1 mon.pve13@2(probing) e4 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Aug 25 19:47:33 pve13.xxx ceph-mon[2859316]: 2024-08-25T19:47:33.147+0200 7f5b17e67700 -1 mon.pve13@2(probing) e4 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Aug 25 19:47:33 pve13.xxx ceph-mon[2859316]: 2024-08-25T19:47:33.947+0200 7f5b17e67700 -1 mon.pve13@2(synchronizing) e4 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
 
root@pve13:~# df -hT /var/log/ceph/
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/pve-root ext4 94G 9.7G 80G 11% /
root@pve13:~#
 
PVE updated and restarted, still same issue with mon on one node pve13. Disk is almost free, with 11% used space. Any ideas?
Update: Attached logs form journalctl --since '1 week ago' -u ceph-mon@pve13 >> ceph-mon_log.txt
 

Attachments

Last edited:
The "out of space" message is confusing.

But before that it tells that it cannot auth with the other MONs.

Is the cluster healthy except for this one MON?

I would then just throw it away and create a new MON instance.
When mon service i active cluster is healthy. The problem starts after update Ceph to Quincy (update from Proxmox solution, before PVE update to version 8 from 7.
 
When mon service i active cluster is healthy. The problem starts after update Ceph to Quincy (update from Proxmox solution, before PVE update to version 8 from 7.
and ceph -s log:
cluster:
id: bf79845c-f78b-4b28-8bf9-85fb8d320a38
health: HEALTH_OK

services:
mon: 3 daemons, quorum pve11,pve12,pve13 (age 10h)
mgr: pve11(active, since 5w), standbys: pve12, pve13
osd: 18 osds: 18 up (since 4d), 18 in (since 13M)

data:
pools: 3 pools, 545 pgs
objects: 598.68k objects, 2.3 TiB
usage: 6.7 TiB used, 17 TiB / 24 TiB avail
pgs: 545 active+clean

io:
client: 1.2 MiB/s rd, 2.6 MiB/s wr, 32 op/s rd, 121 op/s wr