Ceph ODS monitoring Problem

danto107

New Member
Dec 27, 2020
2
0
1
40
Hello,

I am trying to monitor Ceph OSD, it works fine with root user, but when it is run by nagios user it fails:

# /usr/lib/nagios/plugins/check_ceph_osd -H 172.X.X.X -C 1
OSD OK
Up OSDs: osd.19 osd.20 osd.21 osd.22 osd.24
Down+In OSDs:
Down+Out OSDs:
| 'osd_up'=5 'osd_down_in'=0;;1 'osd_down_out'=0;;1

# sudo -u nagios /usr/lib/nagios/plugins/check_ceph_osd -H 172.X.X.X -C 1 --id nagios --keyring /etc/ceph/client.nagios.keyring
OSD ERROR: 2021-01-17T16:35:13.205+0200 7fd9947d8700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.nagios.keyring: (13) Permission denied


The problem is that the file is created inside /et/pve/priv , and inside this folder permissions cannot be modified, and nagios user cannot read the file:

# ls -lia /etc/pve/priv/
total 4
11 drwx------ 2 root www-data 0 Dec 31 14:28 .
1 drwxr-xr-x 2 root www-data 0 Jan 1 1970 ..
12 drwx------ 2 root www-data 0 Dec 31 14:28 acme
87395 -rw------- 1 root www-data 1675 Jan 17 14:29 authkey.key
67077 -rw------- 1 root www-data 1605 Jan 15 14:12 authorized_keys
23 drwx------ 2 root www-data 0 Dec 31 17:04 ceph
25 -rw------- 1 root www-data 151 Dec 31 15:31 ceph.client.admin.keyring
56861 -rw------- 1 root www-data 64 Jan 14 13:58 ceph.client.nagios.keyring
40943 -rw------- 1 root www-data 228 Dec 31 15:31 ceph.mon.keyring
67078 -rw------- 1 root www-data 3164 Jan 15 14:12 known_hosts
19 drwx------ 2 root www-data 0 Dec 31 14:28 lock
40941 -rw------- 1 root www-data 3243 Dec 31 14:28 pve-root-ca.key
40944 -rw------- 1 root www-data 3 Jan 8 22:44 pve-root-ca.srl
9476 drwx------ 2 root www-data 0 Jan 1 21:55 realm
29 drwx------ 2 root www-data 0 Jan 4 17:06 storage
40942 -rw------- 1 root www-data 0 Jan 2 20:27 tfa.cfg


It is a problem just with OSD, the health script runs fine:

# sudo -u nagios /usr/lib/nagios/plugins/check_ceph_health --id nagios --keyring /etc/ceph/client.nagios.keyring
HEALTH OK


Ceph is Octopus, PVE 6.3-3
 
Hello,

I have monitored ceph status, i want to monitor individual OSD state, to know if there is a problem with some disk.

Ceph might recover if some disk fails, and might miss that there is a disk failure.
 
Are you sure the Plugin "check_ceph_health" is working correctly? Maybe you can try to set "noout", that should change the health to "HEALTH_WARN".

# sudo -u nagios /usr/lib/nagios/plugins/check_ceph_osd -H 172.X.X.X -C 1 --id nagios --keyring /etc/ceph/client.nagios.keyring
OSD ERROR: 2021-01-17T16:35:13.205+0200 7fd9947d8700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.nagios.keyring: (13) Permission denied
Why the two location differ from each other? You entered "/etc/ceph/client.nagios.keyring" but the Error Message is "/etc/pve/priv/ceph.client.nagios.keyring". If this is a Symlink, then there is no reason why you not copy over the Keyring to the default location, this should solve your Problem.