Ceph 12.2.1 update - Weird syslog

TwiX · Oct 12, 2017

Hi,

I've just updated a 3 nodes pve 5.0 cluster with latest luminous packages.

Everything seems to be good after upgrade and reboot but on one node I have weird syslog relative to a "osd.12 service".

Code:

Oct 12 20:51:32 dc-prox-13 systemd[1]: ceph-osd@12.service: Service hold-off time over, scheduling restart.
Oct 12 20:51:32 dc-prox-13 systemd[1]: Stopped Ceph object storage daemon osd.12.
Oct 12 20:51:32 dc-prox-13 systemd[1]: Starting Ceph object storage daemon osd.12...
Oct 12 20:51:32 dc-prox-13 systemd[1]: Started Ceph object storage daemon osd.12.
Oct 12 20:51:32 dc-prox-13 ceph-osd[7157]: 2017-10-12 20:51:32.820011 7fc26f9e4e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-12: (2) No such file or directory
Oct 12 20:51:32 dc-prox-13 systemd[1]: ceph-osd@12.service: Main process exited, code=exited, status=1/FAILURE
Oct 12 20:51:32 dc-prox-13 systemd[1]: ceph-osd@12.service: Unit entered failed state.
Oct 12 20:51:32 dc-prox-13 systemd[1]: ceph-osd@12.service: Failed with result 'exit-code'.

But I only have 12 osds from osd.0 to osd.11.

My crush map :

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host dc-prox-06 {
    id -3        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 1.089
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.272
    item osd.1 weight 0.272
    item osd.2 weight 0.272
    item osd.3 weight 0.272
}
host dc-prox-07 {
    id -5        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 1.089
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.272
    item osd.5 weight 0.272
    item osd.6 weight 0.272
    item osd.7 weight 0.272
}
host dc-prox-13 {
    id -7        # do not change unnecessarily
    id -6 class hdd        # do not change unnecessarily
    # weight 1.089
    alg straw2
    hash 0    # rjenkins1
    item osd.8 weight 0.272
    item osd.9 weight 0.272
    item osd.10 weight 0.272
    item osd.11 weight 0.272
}
root default {
    id -1        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    # weight 3.266
    alg straw2
    hash 0    # rjenkins1
    item dc-prox-06 weight 1.089
    item dc-prox-07 weight 1.089
    item dc-prox-13 weight 1.089
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

If you have any idea of what's going on

Thanks in advanced guys

udo · Oct 12, 2017

Hi,
what is the output of the following commands on the node with "osd.12"?

Code:

ls -l /var/lib/ceph/osd/ceph-12
cat /var/lib/ceph/osd/ceph-12/whoami
mount | grep ceph-12
df -h /var/lib/ceph/osd/ceph-12

Udo

TwiX · Oct 13, 2017

Thanks for your help

Code:

root@dc-prox-13:~# ls -l /var/lib/ceph/osd/ceph-12/
total 0
root@dc-prox-13:~# df -h /var/lib/ceph/osd/ceph-12/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/pve-root   68G  3.1G   62G   5% /

lsblk

Code:

root@dc-prox-13:~# lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    0 278.9G  0 disk
├─sda1                 8:1    0     1M  0 part
├─sda2                 8:2    0   256M  0 part
└─sda3                 8:3    0 278.6G  0 part
  ├─pve-root         253:0    0  69.5G  0 lvm  /
  ├─pve-swap         253:1    0     8G  0 lvm  [SWAP]
  ├─pve-data_tmeta   253:2    0    96M  0 lvm 
  │ └─pve-data-tpool 253:4    0 185.1G  0 lvm 
  │   └─pve-data     253:5    0 185.1G  0 lvm 
  └─pve-data_tdata   253:3    0 185.1G  0 lvm 
    └─pve-data-tpool 253:4    0 185.1G  0 lvm 
      └─pve-data     253:5    0 185.1G  0 lvm 
sdb                    8:16   0 278.9G  0 disk
├─sdb1                 8:17   0   100M  0 part /var/lib/ceph/osd/ceph-8
└─sdb2                 8:18   0 278.8G  0 part
sdc                    8:32   0 278.9G  0 disk
├─sdc1                 8:33   0   100M  0 part /var/lib/ceph/osd/ceph-9
└─sdc2                 8:34   0 278.8G  0 part
sdd                    8:48   0 278.9G  0 disk
├─sdd1                 8:49   0   100M  0 part /var/lib/ceph/osd/ceph-10
└─sdd2                 8:50   0 278.8G  0 part
sde                    8:64   0 278.9G  0 disk
├─sde1                 8:65   0   100M  0 part /var/lib/ceph/osd/ceph-11
└─sde2                 8:66   0 278.8G  0 part
sr0                   11:0    1  1024M  0 rom

lvm settings

Code:

root@dc-prox-13:~# lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data pve twi-aotz-- 185.12g             0.00   0.42                           
  root pve -wi-ao----  69.50g                                                   
  swap pve -wi-ao----   8.00g                                                   
root@dc-prox-13:~# pvs
  PV         VG  Fmt  Attr PSize   PFree
  /dev/sda3  pve lvm2 a--  278.62g 15.81g

Thanks again

fabian · Oct 13, 2017

did you at some point have an OSD with ID 12? there was/is an issue in Ceph Luminous where OSD units are enabled persistently instead of for one boot only. after destroying the OSD, the unit is still there but of course fails to start.. if you are sure there is not supposed to be an OSD with ID 12 on that node, you can simply disable the unit ("systemctl disable ceph-osd@ID") and the log spam stops

current ceph packages in the test repo (12.2.1-pve3) already contain the fix so this should no longer happen (and be cleared up for actually existing OSDs on next reboot), but old leftover units for no longer existing OSDs need to be cleaned up manually.

TwiX · Oct 13, 2017

Hi,

I dont think so. I only have 4 disks dedicated to ceph.
But you're right, I've recreated osds on this node 2 times (by destroying and recreating osd from gui).

Search

Search

Ceph 12.2.1 update - Weird syslog

TwiX

Renowned Member

udo

Distinguished Member

TwiX

Renowned Member

fabian

Proxmox Staff Member

TwiX

Renowned Member