[SOLVED] Ceph HEALTH_WARN,1 mons down

samontetro

Active Member
Jun 19, 2012
78
3
28
Grenoble, France
Hi,
I'm running a proxmox cluster 4.4-12 for a while with 3 nodes, each having 2 osd and running a monitor. Since a few days I have one monitor down on one node and I do not understand how to track the problem. Nothing obvious for me in the /var/log/ceph . I've no recent informations in ceph-mon.log (timestamp is 2017 for this file). My ceph.log file is also old (2 days ago).
Node reboot do not solve the problem.

Code:
# ceph -s
    cluster 602c5599-4cb8-4f19-8c46-44bea575d6e0
     health HEALTH_WARN
            1 mons down, quorum 0,1 0,1
     monmap e3: 3 mons at {0=192.168.20.5:6789/0,1=192.168.20.6:6789/0,2=192.168.20.7:6789/0}
            election epoch 332, quorum 0,1 0,1
     osdmap e1385: 6 osds: 6 up, 6 in
            flags sortbitwise,require_jewel_osds
      pgmap v55074580: 64 pgs, 1 pools, 476 GB data, 119 kobjects
            1429 GB used, 43240 GB / 44670 GB avail
                  64 active+clean
  client io 9974 B/s wr, 0 op/s rd, 0 op/s wr


On this node I have a ceph-mon@2.service (2 is the id of the failing monitor)
Code:
# systemctl status ceph-mon@2.service
● ceph-mon@2.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled)
  Drop-In: /lib/systemd/system/ceph-mon@.service.d
           └─ceph-after-pve-cluster.conf
   Active: failed (Result: start-limit) since Mon 2019-11-18 09:47:59 CET; 1min 6s ago
  Process: 3415 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=28)
 Main PID: 3415 (code=exited, status=28)

Nov 18 09:47:49 proxmost3 systemd[1]: Unit ceph-mon@2.service entered failed state.
Nov 18 09:47:59 proxmost3 systemd[1]: ceph-mon@2.service holdoff time over, scheduling restart.
Nov 18 09:47:59 proxmost3 systemd[1]: Stopping Ceph cluster monitor daemon...
Nov 18 09:47:59 proxmost3 systemd[1]: Starting Ceph cluster monitor daemon...
Nov 18 09:47:59 proxmost3 systemd[1]: ceph-mon@2.service start request repeated too quickly, refusing to start.
Nov 18 09:47:59 proxmost3 systemd[1]: Failed to start Ceph cluster monitor daemon.
Nov 18 09:47:59 proxmost3 systemd[1]: Unit ceph-mon@2.service entered failed state.

Thanks for your advices
 
  • Like
Reactions: samontetro
Thanks Alwin for these wise advices. They solve my problem.

  1. on one of the 3 cluster nodes / was nearly full because there were some local VM dumps in /var/lib/vz/dump (same physical partition). I had identified this (it was the main difference between the 3 nodes). But Partition was not full (96%, near 2GB available). Removing backup files to reach 40% of free space (16Gb) was not sufficient to solve the problem with a reboot.
  2. Looking in syslog on my node proxmost3 was a good idea, there was a message:
    Code:
    Nov 18 09:47:49 proxmost3 ceph-mon[3415]: error: monitor data filesystem reached concerning levels of available storage space (available: 4% 1830 MB
    Nov 18 09:47:49 proxmost3 ceph-mon[3415]: you may adjust 'mon data avail crit' to a lower value to make this go away (default: 5%)
    Nov 18 09:47:49 proxmost3 systemd[1]: ceph-mon@2.service: main process exited, code=exited, status=28/n/a
    Nov 18 09:47:49 proxmost3 systemd[1]: Unit ceph-mon@2.service entered failed state.
    This confirm my problem was coming from the available storage
  3. systemctl reset-failed ceph-mon@2.service is necessary, after restoring available storage in /var, to allow the monitor to start again. I was not aware of this command.
 
  • Like
Reactions: pcao