CEPH MON issue on one node

mojsiuk · Aug 25, 2024

Hi, i have issue with ceph mon on one node few weeks after update to 17.27 version. Below logs from Syslog Mon.
once every few days the service stops and I have to start it manually. It works fine for a few days and then these errors come back.
The system disk has a lot of free space. It is in hardware RAID 1, SMART on the disks is correct.

Logs
Aug 20 01:09:03 pve13.xxx systemd[1]: Started Ceph cluster monitor daemon.
Aug 20 01:09:03 pve13.xxx ceph-mon[594532]: problem writing to /var/log/ceph/ceph-mon.pve13.log: (28) No space left on device
Aug 20 01:09:03 pve13.xxx ceph-mon[594532]: 2024-08-20T01:09:03.966+0200 7f1c89e2da00 -1 error: monitor data filesystem reached concerning levels of available storage space (available: 0% 0 B)
Aug 20 01:09:03 pve13.xxx ceph-mon[594532]: you may adjust 'mon data avail crit' to a lower value to make this go away (default: 5%)
Aug 20 01:09:03 pve13.xxx systemd[1]: ceph-mon@pve13.service: Main process exited, code=exited, status=28/n/a
Aug 20 01:09:03 pve13.xxx systemd[1]: ceph-mon@pve13.service: Failed with result 'exit-code'.
Aug 20 01:09:14 pve13.xxx systemd[1]: ceph-mon@pve13.service: Scheduled restart job, restart counter is at 6.
Aug 20 01:09:14 pve13.xxx systemd[1]: Stopped Ceph cluster monitor daemon.
Aug 20 01:09:14 pve13.xxx systemd[1]: ceph-mon@pve13.service: Start request repeated too quickly.
Aug 20 01:09:14 pve13.xxx systemd[1]: ceph-mon@pve13.service: Failed with result 'exit-code'.
Aug 20 01:09:14 pve13.xxx systemd[1]: Failed to start Ceph cluster monitor daemon.
Aug 25 19:47:32 pve13.xxx systemd[1]: Started Ceph cluster monitor daemon.
Aug 25 19:47:32 pve13.xxx ceph-mon[2859316]: 2024-08-25T19:47:32.543+0200 7f5b17e67700 -1 mon.pve13@-1(???) e4 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Aug 25 19:47:32 pve13.xxx ceph-mon[2859316]: 2024-08-25T19:47:32.743+0200 7f5b17e67700 -1 mon.pve13@2(probing) e4 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Aug 25 19:47:33 pve13.xxx ceph-mon[2859316]: 2024-08-25T19:47:33.147+0200 7f5b17e67700 -1 mon.pve13@2(probing) e4 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Aug 25 19:47:33 pve13.xxx ceph-mon[2859316]: 2024-08-25T19:47:33.947+0200 7f5b17e67700 -1 mon.pve13@2(synchronizing) e4 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied

gurubert · Aug 26, 2024

What is the output of df -hT /var/log/ceph/?

mojsiuk · Aug 26, 2024

root@pve13:~# df -hT /var/log/ceph/
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/pve-root ext4 94G 9.7G 80G 11% /
root@pve13:~#

mojsiuk · Aug 29, 2024

PVE updated and restarted, still same issue with mon on one node pve13. Disk is almost free, with 11% used space. Any ideas?
Update: Attached logs form journalctl --since '1 week ago' -u ceph-mon@pve13 >> ceph-mon_log.txt

mojsiuk · Sep 1, 2024

gurubert said:
What is the output of df -hT /var/log/ceph/?

gurubert · Sep 1, 2024

The "out of space" message is confusing.

But before that it tells that it cannot auth with the other MONs.

Is the cluster healthy except for this one MON?

I would then just throw it away and create a new MON instance.

mojsiuk · Sep 2, 2024

gurubert said:
The "out of space" message is confusing.

But before that it tells that it cannot auth with the other MONs.

Is the cluster healthy except for this one MON?

I would then just throw it away and create a new MON instance.

When mon service i active cluster is healthy. The problem starts after update Ceph to Quincy (update from Proxmox solution, before PVE update to version 8 from 7.

mojsiuk · Sep 2, 2024

mojsiuk said:
When mon service i active cluster is healthy. The problem starts after update Ceph to Quincy (update from Proxmox solution, before PVE update to version 8 from 7.

and ceph -s log:
cluster:
id: bf79845c-f78b-4b28-8bf9-85fb8d320a38
health: HEALTH_OK

services:
mon: 3 daemons, quorum pve11,pve12,pve13 (age 10h)
mgr: pve11(active, since 5w), standbys: pve12, pve13
osd: 18 osds: 18 up (since 4d), 18 in (since 13M)

data:
pools: 3 pools, 545 pgs
objects: 598.68k objects, 2.3 TiB
usage: 6.7 TiB used, 17 TiB / 24 TiB avail
pgs: 545 active+clean

io:
client: 1.2 MiB/s rd, 2.6 MiB/s wr, 32 op/s rd, 121 op/s wr

Search

Search

CEPH MON issue on one node

mojsiuk

Active Member

gurubert

Distinguished Member

mojsiuk

Active Member

mojsiuk

Active Member

Attachments

mojsiuk

Active Member

Attachments

gurubert

Distinguished Member

mojsiuk

Active Member

mojsiuk

Active Member

We value your privacy