Ceph 12.2.1 update - Weird syslog

TwiX

Renowned Member
Feb 3, 2015
310
21
83
Hi,

I've just updated a 3 nodes pve 5.0 cluster with latest luminous packages.

Everything seems to be good after upgrade and reboot but on one node I have weird syslog relative to a "osd.12 service".

Code:
Oct 12 20:51:32 dc-prox-13 systemd[1]: ceph-osd@12.service: Service hold-off time over, scheduling restart.
Oct 12 20:51:32 dc-prox-13 systemd[1]: Stopped Ceph object storage daemon osd.12.
Oct 12 20:51:32 dc-prox-13 systemd[1]: Starting Ceph object storage daemon osd.12...
Oct 12 20:51:32 dc-prox-13 systemd[1]: Started Ceph object storage daemon osd.12.
Oct 12 20:51:32 dc-prox-13 ceph-osd[7157]: 2017-10-12 20:51:32.820011 7fc26f9e4e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-12: (2) No such file or directory
Oct 12 20:51:32 dc-prox-13 systemd[1]: ceph-osd@12.service: Main process exited, code=exited, status=1/FAILURE
Oct 12 20:51:32 dc-prox-13 systemd[1]: ceph-osd@12.service: Unit entered failed state.
Oct 12 20:51:32 dc-prox-13 systemd[1]: ceph-osd@12.service: Failed with result 'exit-code'.

But I only have 12 osds from osd.0 to osd.11.

My crush map :
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host dc-prox-06 {
    id -3        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 1.089
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.272
    item osd.1 weight 0.272
    item osd.2 weight 0.272
    item osd.3 weight 0.272
}
host dc-prox-07 {
    id -5        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 1.089
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.272
    item osd.5 weight 0.272
    item osd.6 weight 0.272
    item osd.7 weight 0.272
}
host dc-prox-13 {
    id -7        # do not change unnecessarily
    id -6 class hdd        # do not change unnecessarily
    # weight 1.089
    alg straw2
    hash 0    # rjenkins1
    item osd.8 weight 0.272
    item osd.9 weight 0.272
    item osd.10 weight 0.272
    item osd.11 weight 0.272
}
root default {
    id -1        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    # weight 3.266
    alg straw2
    hash 0    # rjenkins1
    item dc-prox-06 weight 1.089
    item dc-prox-07 weight 1.089
    item dc-prox-13 weight 1.089
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

If you have any idea of what's going on :)

Thanks in advanced guys
 
Hi,
what is the output of the following commands on the node with "osd.12"?
Code:
ls -l /var/lib/ceph/osd/ceph-12
cat /var/lib/ceph/osd/ceph-12/whoami
mount | grep ceph-12
df -h /var/lib/ceph/osd/ceph-12
Udo
 
Thanks for your help :)

Code:
root@dc-prox-13:~# ls -l /var/lib/ceph/osd/ceph-12/
total 0
root@dc-prox-13:~# df -h /var/lib/ceph/osd/ceph-12/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/pve-root   68G  3.1G   62G   5% /

lsblk
Code:
root@dc-prox-13:~# lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    0 278.9G  0 disk
├─sda1                 8:1    0     1M  0 part
├─sda2                 8:2    0   256M  0 part
└─sda3                 8:3    0 278.6G  0 part
  ├─pve-root         253:0    0  69.5G  0 lvm  /
  ├─pve-swap         253:1    0     8G  0 lvm  [SWAP]
  ├─pve-data_tmeta   253:2    0    96M  0 lvm 
  │ └─pve-data-tpool 253:4    0 185.1G  0 lvm 
  │   └─pve-data     253:5    0 185.1G  0 lvm 
  └─pve-data_tdata   253:3    0 185.1G  0 lvm 
    └─pve-data-tpool 253:4    0 185.1G  0 lvm 
      └─pve-data     253:5    0 185.1G  0 lvm 
sdb                    8:16   0 278.9G  0 disk
├─sdb1                 8:17   0   100M  0 part /var/lib/ceph/osd/ceph-8
└─sdb2                 8:18   0 278.8G  0 part
sdc                    8:32   0 278.9G  0 disk
├─sdc1                 8:33   0   100M  0 part /var/lib/ceph/osd/ceph-9
└─sdc2                 8:34   0 278.8G  0 part
sdd                    8:48   0 278.9G  0 disk
├─sdd1                 8:49   0   100M  0 part /var/lib/ceph/osd/ceph-10
└─sdd2                 8:50   0 278.8G  0 part
sde                    8:64   0 278.9G  0 disk
├─sde1                 8:65   0   100M  0 part /var/lib/ceph/osd/ceph-11
└─sde2                 8:66   0 278.8G  0 part
sr0                   11:0    1  1024M  0 rom

lvm settings
Code:
root@dc-prox-13:~# lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data pve twi-aotz-- 185.12g             0.00   0.42                           
  root pve -wi-ao----  69.50g                                                   
  swap pve -wi-ao----   8.00g                                                   
root@dc-prox-13:~# pvs
  PV         VG  Fmt  Attr PSize   PFree
  /dev/sda3  pve lvm2 a--  278.62g 15.81g

Thanks again ;)
 
did you at some point have an OSD with ID 12? there was/is an issue in Ceph Luminous where OSD units are enabled persistently instead of for one boot only. after destroying the OSD, the unit is still there but of course fails to start.. if you are sure there is not supposed to be an OSD with ID 12 on that node, you can simply disable the unit ("systemctl disable ceph-osd@ID") and the log spam stops ;)

current ceph packages in the test repo (12.2.1-pve3) already contain the fix so this should no longer happen (and be cleared up for actually existing OSDs on next reboot), but old leftover units for no longer existing OSDs need to be cleaned up manually.
 
Hi,

I dont think so. I only have 4 disks dedicated to ceph.
But you're right, I've recreated osds on this node 2 times (by destroying and recreating osd from gui).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!