Ceph Pacific (16.2.6) - Some OSDs fail to activate at boot

Jun 8, 2016
344
69
68
47
Johannesburg, South Africa
We have uncovered a problem with Ceph pacific OSDs not always starting automatically after a node is restarted. This is relatively prevalent with nodes exhibiting a single OSD with this problem approximately 70% of the time. We had one occurrence where a node had two OSDs in this state, whilst handling a rolling upgrade on clusters this weekend.

The problem appears to be that the Linux kernel doesn't identify the partition information as being a Ceph data volume and subsequently don't set the ownership permissions on the device node properly. Manually updating permissions and then restarting the OSD service subsequently results in the device being identified properly thereafter.

ie: blkid /dev/sdb2 would not return anything when in a problem state. Once the OSD boots the same command returns the expected information.

Code:
[admin@kvm5c ~]# ceph osd df
ID    CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
1010   nvme  2.91089   1.00000  2.9 TiB  1.5 GiB  854 MiB  424 MiB  237 MiB  2.9 TiB   0.05  0.00   73      up
  10    ssd  5.79149   1.00000  5.8 TiB  2.5 TiB  2.5 TiB  314 MiB  6.9 GiB  3.3 TiB  42.43  1.01   66      up
  11    ssd  5.79149   1.00000  5.8 TiB  2.7 TiB  2.7 TiB  297 MiB  7.5 GiB  3.0 TiB  47.44  1.13   74      up
  12    ssd  5.82179   1.00000  5.8 TiB  2.3 TiB  2.3 TiB  111 MiB  7.3 GiB  3.5 TiB  40.09  0.95   65      up
1020   nvme  2.91089   1.00000  2.9 TiB  1.3 GiB  676 MiB  353 MiB  284 MiB  2.9 TiB   0.04  0.00   64      up
  20    ssd  5.79149   1.00000  5.8 TiB  2.6 TiB  2.6 TiB  204 MiB  6.6 GiB  3.2 TiB  44.79  1.06   72      up
  21    ssd  5.79149   1.00000  5.8 TiB  2.6 TiB  2.6 TiB  134 MiB  6.7 GiB  3.2 TiB  45.19  1.07   70      up
  22    ssd  5.82179   1.00000  5.8 TiB  2.7 TiB  2.7 TiB  119 MiB  8.1 GiB  3.1 TiB  45.93  1.09   71      up
1030   nvme  2.91089   1.00000  2.9 TiB  1.4 GiB  860 MiB  388 MiB  151 MiB  2.9 TiB   0.05  0.00   70      up
  30    ssd  5.79149   1.00000  5.8 TiB  2.7 TiB  2.6 TiB  231 MiB  7.7 GiB  3.1 TiB  45.83  1.09   59      up
  31    ssd  5.79149   1.00000  5.8 TiB  2.7 TiB  2.6 TiB  220 MiB  7.5 GiB  3.1 TiB  45.87  1.09   46      up
  32    ssd  5.82179   1.00000  5.8 TiB  2.7 TiB  2.7 TiB  118 MiB  8.7 GiB  3.1 TiB  46.63  1.11    0    down
1040   nvme  2.91089   1.00000  2.9 TiB  1.5 GiB  828 MiB  375 MiB  317 MiB  2.9 TiB   0.05  0.00   73      up
  40    ssd  5.79149   1.00000  5.8 TiB  3.2 TiB  3.2 TiB  214 MiB  8.5 GiB  2.6 TiB  55.68  1.32   78      up
  41    ssd  5.79149   1.00000  5.8 TiB  2.7 TiB  2.7 TiB  116 MiB  8.1 GiB  3.1 TiB  45.96  1.09   66      up
  42    ssd  5.82179   1.00000  5.8 TiB  2.6 TiB  2.6 TiB  120 MiB  8.2 GiB  3.2 TiB  45.35  1.08   70      up
1050   nvme  2.91089   1.00000  2.9 TiB  1.5 GiB  870 MiB  403 MiB  309 MiB  2.9 TiB   0.05  0.00   82      up
  50    ssd  5.79149   1.00000  5.8 TiB  2.8 TiB  2.8 TiB  221 MiB  7.6 GiB  3.0 TiB  48.12  1.14   75      up
  51    ssd  5.79149   1.00000  5.8 TiB  2.9 TiB  2.9 TiB  225 MiB  8.2 GiB  2.9 TiB  50.41  1.20   76      up
  52    ssd  5.82179   1.00000  5.8 TiB  2.8 TiB  2.8 TiB  126 MiB  8.9 GiB  3.0 TiB  48.15  1.14   74      up
1060   nvme  2.91089   1.00000  2.9 TiB  1.2 GiB  768 MiB  409 MiB   98 MiB  2.9 TiB   0.04     0   73      up
  60    ssd  5.79149   1.00000  5.8 TiB  2.8 TiB  2.8 TiB  236 MiB  8.0 GiB  3.0 TiB  48.70  1.16   72      up
  61    ssd  5.79149   1.00000  5.8 TiB  2.7 TiB  2.7 TiB  230 MiB  7.5 GiB  3.1 TiB  46.72  1.11   70      up
  62    ssd  5.82179   1.00000  5.8 TiB  2.7 TiB  2.7 TiB  122 MiB  8.5 GiB  3.1 TiB  45.99  1.09   71      up
  70    ssd  5.79149   1.00000  5.8 TiB  2.7 TiB  2.7 TiB   95 MiB  8.0 GiB  3.1 TiB  46.09  1.10   68      up
  71    ssd  5.79149   1.00000  5.8 TiB  2.7 TiB  2.7 TiB  114 MiB  7.9 GiB  3.1 TiB  46.94  1.12   69      up
  72    ssd  5.82179   1.00000  5.8 TiB  2.5 TiB  2.5 TiB  116 MiB  8.4 GiB  3.3 TiB  43.61  1.04   68      up
  80    ssd  5.79149   1.00000  5.8 TiB  2.6 TiB  2.6 TiB  105 MiB  7.2 GiB  3.2 TiB  45.54  1.08   67      up
  81    ssd  5.79149   1.00000  5.8 TiB  2.6 TiB  2.6 TiB  100 MiB  7.3 GiB  3.2 TiB  44.53  1.06   67      up
  82    ssd  5.82179   1.00000  5.8 TiB  2.4 TiB  2.4 TiB  119 MiB  7.8 GiB  3.4 TiB  40.93  0.97   63      up
  90    ssd  5.79149   1.00000  5.8 TiB  2.7 TiB  2.7 TiB   75 MiB  7.2 GiB  3.1 TiB  46.76  1.11   71      up
  91    ssd  5.79149   1.00000  5.8 TiB  2.6 TiB  2.6 TiB   90 MiB  7.2 GiB  3.2 TiB  45.01  1.07   67      up
  92    ssd  5.82179   1.00000  5.8 TiB  2.7 TiB  2.7 TiB   99 MiB  8.4 GiB  3.1 TiB  46.23  1.10   71      up
 100    ssd  5.79149   1.00000  5.8 TiB  2.5 TiB  2.5 TiB  119 MiB  8.2 GiB  3.3 TiB  43.48  1.03   70      up
 101    ssd  5.79149   1.00000  5.8 TiB  2.5 TiB  2.5 TiB  109 MiB  7.9 GiB  3.3 TiB  43.20  1.03   70      up
 102    ssd  5.82179   1.00000  5.8 TiB  2.8 TiB  2.8 TiB   95 MiB  8.7 GiB  3.0 TiB  48.90  1.16   71      up
 110    ssd  5.79149   1.00000  5.8 TiB  2.5 TiB  2.5 TiB  112 MiB  7.8 GiB  3.3 TiB  43.18  1.03   65      up
 111    ssd  5.79149   1.00000  5.8 TiB  2.5 TiB  2.5 TiB  115 MiB  7.6 GiB  3.3 TiB  43.21  1.03   68      up
 112    ssd  5.82179   1.00000  5.8 TiB  2.8 TiB  2.8 TiB   82 MiB  8.4 GiB  3.0 TiB  48.19  1.15   75      up
                         TOTAL  209 TiB   88 TiB   88 TiB  7.1 GiB  260 GiB  121 TiB  42.08
MIN/MAX VAR: 0/1.32  STDDEV: 17.06
[admin@kvm5c ~]# df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                315G     0  315G   0% /dev
tmpfs                                63G  3.4M   63G   1% /run
/dev/md0                             30G  7.0G   22G  25% /
tmpfs                               315G   57M  315G   1% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
/dev/sda4                            94M  5.5M   89M   6% /var/lib/ceph/osd/ceph-30
/dev/sdb1                            94M  5.5M   89M   6% /var/lib/ceph/osd/ceph-32
/dev/nvme0n1p1                       94M  5.5M   89M   6% /var/lib/ceph/osd/ceph-1030
/dev/sdc4                            94M  5.5M   89M   6% /var/lib/ceph/osd/ceph-31
/dev/fuse                           128M  356K  128M   1% /etc/pve
10.254.1.3,10.254.1.4,10.254.1.5:/   26T   41G   26T   1% /mnt/pve/cephfs
[admin@kvm5c ~]# tail -f /var/log/ceph/ceph-osd.32.log
2021-11-27T08:57:08.883+0200 7fc9afd2ff00 -1 bdev(0x55f155ce4400 /var/lib/ceph/osd/ceph-32/block) open open got: (13) Permission denied
2021-11-27T08:57:08.883+0200 7fc9afd2ff00  0 osd.32:4.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196)
2021-11-27T08:57:08.883+0200 7fc9afd2ff00 -1 bluestore(/var/lib/ceph/osd/ceph-32/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-32/block: (13) Permission denied
2021-11-27T08:57:08.883+0200 7fc9afd2ff00  1 bluestore(/var/lib/ceph/osd/ceph-32) _mount path /var/lib/ceph/osd/ceph-32
2021-11-27T08:57:08.883+0200 7fc9afd2ff00  0 bluestore(/var/lib/ceph/osd/ceph-32) _open_db_and_around read-only:0 repair:0
2021-11-27T08:57:08.883+0200 7fc9afd2ff00 -1 bluestore(/var/lib/ceph/osd/ceph-32/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-32/block: (13) Permission denied
2021-11-27T08:57:08.883+0200 7fc9afd2ff00  1 bdev(0x55f155ce4400 /var/lib/ceph/osd/ceph-32/block) open path /var/lib/ceph/osd/ceph-32/block
2021-11-27T08:57:08.883+0200 7fc9afd2ff00 -1 bdev(0x55f155ce4400 /var/lib/ceph/osd/ceph-32/block) open open got: (13) Permission denied
2021-11-27T08:57:08.883+0200 7fc9afd2ff00 -1 osd.32 0 OSD:init: unable to mount object store
2021-11-27T08:57:08.883+0200 7fc9afd2ff00 -1  ** ERROR: osd init failed: (13) Permission denied
^C

Permission denied is due to 'blkid /dev/sdb2' not working, so Ceph block parrtition is not identified:
Code:
[admin@kvm5c ~]# dir /dev/sdb2
brw-rw---- 1 admin disk 8, 18 Nov 27 08:56 /dev/sdb2
[admin@kvm5c ~]# blkid /dev/sdb1
/dev/sdb1: UUID="26dfb0f4-d670-4a39-a1db-6ecb66fdc025" BLOCK_SIZE="4096" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="dbe970e0-c8d9-4365-b189-c60fa206e4a0"
[admin@kvm5c ~]# blkid /dev/sdb2
[admin@kvm5c ~]# chown ceph.ceph /dev/sdb2
[admin@kvm5c ~]# systemctl reset-failed; systemctl restart ceph-osd@32
# The following only updates once the OSD is back up again:
#  [admin@kvm5c ~]# blkid /dev/sdb2
#  /dev/sdb2: TYPE="ceph_bluestore" PARTLABEL="ceph block" PARTUUID="fa72ae1e-168a-4acc-934a-5edef24c9b06"


Full OSD log, should it be relevant, attached to this post:
[admin@kvm5c ~]# tail -f /var/log/ceph/ceph-osd.32.log
 

Attachments

  • ceph_log.txt
    95.2 KB · Views: 1
Last edited:
This issue at heart is due to `blkid` forming an ambivalent opinion of the block device and udev subsequently not applying the ceph:ceph ownership of the device when it initialises. The root cause is as such that `blkid` finds other filesystem identifiers in the block device (as expected as the block device will have data from images in it).

A very similar issue is discussed at length here:
https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1858802


A work around that we put in place was to add the following to the system initialisation script (/etc/rc.local):
Code:
# Fix block device ownership due to ambivalent block device identification:
for SYM in /var/lib/ceph/osd/ceph-*/block; do
  DEV=`readlink -f $SYM`;
  if [ -b $DEV ]; then
    if [ `stat -c "%U:%G" $DEV | grep -c "ceph:ceph"` -lt 1 ]; then
      chown ceph:ceph $DEV;
      OSD=`echo $SYM | perl -pe 's/\/var\/lib\/ceph\/osd\/ceph-(\d+).*/\1/'`;
      systemctl reset-failed;
      systemctl restart ceph-osd@$OSD;
    fi;
  fi;
done
 
Thank you for the update and the workaround!

Since it mentions /dev/sdb2 it seems the OSDs are older bluestore ones? Since Ceph Nautilus (14) ceph-volume is used to create the OSDs and it uses LVM instead of those 2 partitions. It may be helpful to recreate those OSDs one after the other (and wait for all data to be backfilled again in between) so the new layout is used.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!