CEPH 17.2.7 - "ceph device ls" is wrong

dlasher · Feb 21, 2024

Just ran into this in the lab, haven't gone digging in prod yet.

Code:

pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.2.16-20-pve)

Cluster is alive, working, zero issues, everything in GUI is happy, 100% alive -- however... the "ceph device" table appears to have NOT updated itself for a *very* long time, probably 2+ years? Not only are 30+ devices missing from the list, but there's things no longer present.

Code:

root@pmx1:~# ceph device ls
<snip>
INTEL_SSDSC2BA800G3_BTTV504000EU800JGN     pmx2:sdd  osd.19     0%
INTEL_SSDSC2BA800G3_BTTV5040012Z800JGN     pmx5:sdb  osd.39     1%
INTEL_SSDSC2BA800G3_BTTV5040019C800JGN                          1%
INTEL_SSDSC2BA800G3_BTTV504001ML800JGN                          1%
INTEL_SSDSC2BA800G3_BTTV504002QC800JGN                          0%
INTEL_SSDSC2BA800G3_BTTV504002UM800JGN                          1%
INTEL_SSDSC2BA800G4_BTHV505202SE800OGN     pmx2:sda  osd.3      0%
INTEL_SSDSC2BA800G4_BTHV513605MW800OGN     pmx1:sda  osd.4      0%
INTEL_SSDSC2BA800G4_BTHV535200VM800OGN     pmx5:sdj  osd.0      1%
</snip>

root@pmx1:~# ceph device ls-by-host pmx1
DEVICE                                  DEV  DAEMONS  EXPECTED FAILURE
INTEL_SSDSC2BA800G3_BTTV451404HT800JGN  sdh  osd.25
INTEL_SSDSC2BA800G4_BTHV513605MW800OGN  sda  osd.4

"ceph osd df" for that host looks *much* different (and is correct)

Code:

ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
101    hdd  7.27739   1.00000  7.3 TiB  4.0 TiB  4.0 TiB   31 KiB   11 GiB  3.2 TiB  55.43  1.06   79      up
102    hdd  7.27699   1.00000  7.3 TiB  4.0 TiB  4.0 TiB   32 KiB   10 GiB  3.3 TiB  54.69  1.05   79      up
103    hdd  7.27739   1.00000  7.3 TiB  4.0 TiB  4.0 TiB   34 KiB  9.4 GiB  3.3 TiB  55.24  1.06   82      up
104    hdd  7.27699   1.00000  7.3 TiB  4.7 TiB  4.7 TiB   29 KiB   12 GiB  2.6 TiB  64.29  1.23   89      up
105    hdd  7.27739   1.00000  7.3 TiB  4.1 TiB  4.1 TiB   28 KiB   10 GiB  3.1 TiB  56.93  1.09   83      up
  4    ssd  0.72769   1.00000  745 GiB  388 GiB  383 GiB  1.1 GiB  4.0 GiB  357 GiB  52.13  1.00  116      up
 25    ssd  0.72800   1.00000  745 GiB  462 GiB  457 GiB  1.4 GiB  3.3 GiB  283 GiB  62.00  1.19  145      up

No idea how to get ceph to rebuild the devices table -- an hour on google, and I can't find any answers. How do you make it match?

OFF TOPIC - found a wonderful cheatsheet worth sharing: https://github.com/TheJJ/ceph-cheatsheet

jsterr · Feb 21, 2024

dlasher said:

Just ran into this in the lab, haven't gone digging in prod yet.

Code:

pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.2.16-20-pve)

Cluster is alive, working, zero issues, everything in GUI is happy, 100% alive -- however... the "ceph device" table appears to have NOT updated itself for a *very* long time, probably 2+ years? Not only are 30+ devices missing from the list, but there's things no longer present.

Code:

root@pmx1:~# ceph device ls
<snip>
INTEL_SSDSC2BA800G3_BTTV504000EU800JGN     pmx2:sdd  osd.19     0%
INTEL_SSDSC2BA800G3_BTTV5040012Z800JGN     pmx5:sdb  osd.39     1%
INTEL_SSDSC2BA800G3_BTTV5040019C800JGN                          1%
INTEL_SSDSC2BA800G3_BTTV504001ML800JGN                          1%
INTEL_SSDSC2BA800G3_BTTV504002QC800JGN                          0%
INTEL_SSDSC2BA800G3_BTTV504002UM800JGN                          1%
INTEL_SSDSC2BA800G4_BTHV505202SE800OGN     pmx2:sda  osd.3      0%
INTEL_SSDSC2BA800G4_BTHV513605MW800OGN     pmx1:sda  osd.4      0%
INTEL_SSDSC2BA800G4_BTHV535200VM800OGN     pmx5:sdj  osd.0      1%
</snip>

root@pmx1:~# ceph device ls-by-host pmx1
DEVICE                                  DEV  DAEMONS  EXPECTED FAILURE
INTEL_SSDSC2BA800G3_BTTV451404HT800JGN  sdh  osd.25
INTEL_SSDSC2BA800G4_BTHV513605MW800OGN  sda  osd.4

"ceph osd df" for that host looks *much* different (and is correct)

Code:

ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
101    hdd  7.27739   1.00000  7.3 TiB  4.0 TiB  4.0 TiB   31 KiB   11 GiB  3.2 TiB  55.43  1.06   79      up
102    hdd  7.27699   1.00000  7.3 TiB  4.0 TiB  4.0 TiB   32 KiB   10 GiB  3.3 TiB  54.69  1.05   79      up
103    hdd  7.27739   1.00000  7.3 TiB  4.0 TiB  4.0 TiB   34 KiB  9.4 GiB  3.3 TiB  55.24  1.06   82      up
104    hdd  7.27699   1.00000  7.3 TiB  4.7 TiB  4.7 TiB   29 KiB   12 GiB  2.6 TiB  64.29  1.23   89      up
105    hdd  7.27739   1.00000  7.3 TiB  4.1 TiB  4.1 TiB   28 KiB   10 GiB  3.1 TiB  56.93  1.09   83      up
  4    ssd  0.72769   1.00000  745 GiB  388 GiB  383 GiB  1.1 GiB  4.0 GiB  357 GiB  52.13  1.00  116      up
 25    ssd  0.72800   1.00000  745 GiB  462 GiB  457 GiB  1.4 GiB  3.3 GiB  283 GiB  62.00  1.19  145      up

No idea how to get ceph to rebuild the devices table -- an hour on google, and I can't find any answers. How do you make it match?

OFF TOPIC - found a wonderful cheatsheet worth sharing: https://github.com/TheJJ/ceph-cheatsheet

ceph osd tree or the pve webui shows all needed disks correct? How does your crushmap look like?

dlasher · Feb 21, 2024

jsterr said:
ceph osd tree or the pve webui shows all needed disks correct? How does your crushmap look like?

yes to both "ceph osd free" and "pve webui"

crushmap is also correct - pmx1 for example.

Code:

host pmx1 {
    id -3        # do not change unnecessarily
    id -13 class ssd        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 37.84184
    alg straw2
    hash 0    # rjenkins1
    item osd.25 weight 0.72800
    item osd.102 weight 7.27699
    item osd.104 weight 7.27699
    item osd.4 weight 0.72769
    item osd.105 weight 7.27739
    item osd.103 weight 7.27739
    item osd.101 weight 7.27739
}

jsterr · Feb 21, 2024

Please post the complete crushmap.

dlasher · Feb 21, 2024

jsterr said:
Please post the complete crushmap.

ok then.....

Not a lot to see there - each device is correctly in the list (at the top) and each node has the right drives.

jsterr · Feb 21, 2024

dlasher said:
ok then.....

Not a lot to see there - each device is correctly in the list (at the top) and each node has the right drives.

Do your disks have values in:

Code:

root@PMX4:~# cat /sys/block/nvme0n1/device/
address            dev                kato               nvme0n1/           reset_controller   subsysnqn
cntlid             device/            model              power/             serial             subsystem/
cntrltype          firmware_rev       ng0n1/             queue_count        sqsize             transport
dctype             hwmon12/           numa_node          rescan_controller  state              uevent
root@PMX4:~# cat /sys/block/nvme0n1/device/model
KIOXIA KCD81RUG960G
root@PMX4:~# cat /sys/block/nvme0n1/device/serial
9240A01ZTLW9

they might need a serial and model to be recognized. Any difference when your comparing a shown disk with one that is not shown with: ceph device info

jsterr · Feb 21, 2024

I also asked in ceph slack, will reply if I have an answer to your question.

dlasher · Feb 21, 2024

jsterr said:
I also asked in ceph slack, will reply if I have an answer to your question.

Thank you, much appreciated. I couldn't find anything obvious wrong - I suspect something got toggled over time when upgrading from 15 to 16 to 17, and just never got turned back on.

dlasher · Feb 21, 2024

Ahh - something just occurred to me. Two problems here really:

1. Old device are still present - some sort of cleanup will solve that.
2. NEW devices are missing :lightbulb: -- my lab pool is running CEPH on bcache - so I'm guessing CEPH can't reach down to the actual drive serial numbers if that matters. I can't see why it would, since 99.99% of CEPH is happy with bcache0-5 on each box.

First drive is SSD, second is HDD/bcache

Code:

root@pmx1:~# ceph-volume lvm list | grep -A17 "osd.4"
====== osd.4 =======

  [block]       /dev/ceph-ae17d8cc-dcc6-4d69-b5d7-36ec2e0f1db8/osd-block-cc405bb9-4603-4661-9e4b-f9a183083197

      block device              /dev/ceph-ae17d8cc-dcc6-4d69-b5d7-36ec2e0f1db8/osd-block-cc405bb9-4603-4661-9e4b-f9a183083197
      block uuid                cVycP7-oLe1-kijy-2pE4-i8gY-Gt3k-qDOvpd
      cephx lockbox secret
      cluster fsid              d22fbcbd-24c8-4433-8a63-1c92bd81da83
      cluster name              ceph
      crush device class        ssd
      encrypted                 0
      osd fsid                  cc405bb9-4603-4661-9e4b-f9a183083197
      osd id                    4
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sda

root@pmx1:# cat /sys/block/sda/device/model
INTEL SSDSC2BA80 (weirdly truncated - actual model is INTEL SSDSC2BA800G4

(weirdly "serial" does not exist, in spite of being able to clearly read that data from smartctl)


root@pmx1:~# ceph-volume lvm list | grep -A17 "osd.105"
===== osd.105 ======

  [block]       /dev/ceph-9d621a59-8029-46e5-9e54-d1c92a74aaa7/osd-block-fc9b1f7d-f235-4a1b-92f6-e027577dd66d

      block device              /dev/ceph-9d621a59-8029-46e5-9e54-d1c92a74aaa7/osd-block-fc9b1f7d-f235-4a1b-92f6-e027577dd66d
      block uuid                b6tn0X-2D3f-CJwy-l9GG-3PMz-4g3e-10VT6k
      cephx lockbox secret
      cluster fsid              d22fbcbd-24c8-4433-8a63-1c92bd81da83
      cluster name              ceph
      crush device class        hdd
      encrypted                 0
      osd fsid                  fc9b1f7d-f235-4a1b-92f6-e027577dd66d
      osd id                    105
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/bcache2

/sys/block/bcache2/device doesn't exist - so there's no model/serial info there - in theory you could trace back to the underlying disk, but I'm pretty sure CEPH doesn't understand that.

So that might be why none of the bcache/HDD devices show up in the ceph devices ls - however, given that it hasn't been updated in years, I'm not sure that's the whole problem, especially since devices that haven't been plugged in for at least 12 months are still in the list.

(as an aside, ceph+bcache as an HDD/NVMe mixed platform performs MUCH better than HDD-data/NVMe-DB/WAL - I've spent the last 3 years trying lots of iterations of HDD+SSD/NVMe in my lab - running P3700's as cache devices for HDDs is pretty slick, handles IO contention *much* better, which doesn't take much with 7200rpm drives - only cache-tiers overall perform better, but last I hear, Redhat was obseleting them)

* https://docs.ceph.com/en/latest/rados/operations/cache-tiering/

jsterr · Feb 21, 2024

I only heard bad things about using bcache. For example: "Cache tiering will degrade performance for most workloads. Users should use extreme caution before using this feature."

Also

"Cache tiering has been deprecated in the Reef release as it has lacked a maintainer for a very long time. This does not mean it will be certainly removed, but we may choose to remove it without much further notice."

dlasher · Feb 22, 2024

jsterr said:
"Cache tiering has been deprecated in the Reef release as it has lacked a maintainer for a very long time. This does not mean it will be certainly removed, but we may choose to remove it without much further notice."

Yep - otherwise it would be my first choice. An NVMe pool with an HDD pool behind it is the best of both worlds.

Cache Tiering, in CEPH, is for specific workloads, and if the first tier isn't big enough, not tuned right, etc, it can be worse than no tiers at all. I've experimented with it, really liked it, but the fact it's going away kept me from making it my first choice.

I only heard bad things about using bcache. For example: "Cache tiering will degrade performance for most workloads. Users should use extreme caution before using this feature."

Those two things aren't really related. BCACHE is an underlying filesystem driver (it's been in the linux kernel since 3.10 - in 2013) and allows you to use faster devices (SSD/NVME) to cache for slower devices. CEPH purrs happily along on top of bcache with no complaints, and I get the advantages of NVMe latency for writs/acks/etc which is great for backfills/remaps/sync etc, and NVMe level read-caching for recent blocks, etc.

SEE: https://en.wikipedia.org/wiki/Bcache
ALSO : https://forum.proxmox.com/threads/pve-gui-doesnt-recognize-kernel-bcache-device.109761/post-559672
ALSO: https://forum.proxmox.com/search/6686006/?q=bcache&o=date

jsterr · Feb 22, 2024

Ah I see, thanks for clarifying

Search

Search

CEPH 17.2.7 - "ceph device ls" is wrong

dlasher

Renowned Member

jsterr

Renowned Member

dlasher

Renowned Member

jsterr

Renowned Member

dlasher

Renowned Member

Attachments

jsterr

Renowned Member

jsterr

Renowned Member

dlasher

Renowned Member

dlasher

Renowned Member

jsterr

Renowned Member

dlasher

Renowned Member

jsterr

Renowned Member