ceph osd backfill - some osd's dont appear to be balanced?

cfgmgr · Jan 19, 2024

Greetings!

We have a 3 node Proxmox cluster with 7 disk each.

All disk are Intel SSD DC S4510 that are 1.92TB capacity. All running the same firmware.

Each node is connected via a 25G MLAG pair via a Mellanox ConnectX-4.

Recently ceph has triggered the following alert:

Code:

# ceph health
HEALTH_WARN 1 backfillfull osd(s); 4 pool(s) backfillfull

For whatever reason it doesn't appear the OSD's are super balanced? Looks like we have some room to spare in the cluster. I did not setup the cluster orginally so I'm not sure if everything is set to defaults or some settings have been adjusted.

Here is some of the helpful output:

Code:

# pveversion  -v
proxmox-ve: 7.4-1 (running kernel: 5.15.126-1-pve)
pve-manager: 7.4-17 (running version: 7.4-17/513c62be)
pve-kernel-5.15: 7.4-7
pve-kernel-5.15.126-1-pve: 5.15.126-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph: 17.2.5-pve1
ceph-fuse: 17.2.5-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 10.2-ubuntu1~focal1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Code:

# ceph version
ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)

Code:

# ceph status
  cluster:
    id:     b7565e52-6907-49f9-85b9-526c3ce94676
    health: HEALTH_WARN
            1 backfillfull osd(s)
            4 pool(s) backfillfull

  services:
    mon: 3 daemons, quorum hv01,hv02,hv03 (age 3M)
    mgr: hv01(active, since 3M), standbys: hv02, hv03
    mds: 1/1 daemons up, 2 standby
    osd: 20 osds: 20 up (since 3M), 20 in (since 8M)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 577 pgs
    objects: 2.16M objects, 8.2 TiB
    usage:   24 TiB used, 11 TiB / 35 TiB avail
    pgs:     577 active+clean

  io:
    client:   450 MiB/s rd, 12 MiB/s wr, 2.02k op/s rd, 321 op/s wr

Code:

# ceph health detail
HEALTH_WARN 1 backfillfull osd(s); 4 pool(s) backfillfull
[WRN] OSD_BACKFILLFULL: 1 backfillfull osd(s)
    osd.5 is backfill full
[WRN] POOL_BACKFILLFULL: 4 pool(s) backfillfull
    pool '.mgr' is backfillfull
    pool 'cephfs_data' is backfillfull
    pool 'cephfs_metadata' is backfillfull
    pool 'ceph-vm' is backfillfull

Code:

# ceph df detail
--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
ssd    35 TiB  11 TiB  24 TiB    24 TiB      69.81
TOTAL  35 TiB  11 TiB  24 TiB    24 TiB      69.81

--- POOLS ---
POOL             ID  PGS   STORED   (DATA)   (OMAP)  OBJECTS     USED   (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
.mgr              1    1  264 MiB  264 MiB      0 B       67  792 MiB  792 MiB      0 B   0.05    518 GiB            N/A          N/A    N/A         0 B          0 B
cephfs_data       3   32   75 GiB   75 GiB      0 B   19.33k  226 GiB  226 GiB      0 B  12.73    518 GiB            N/A          N/A    N/A         0 B          0 B
cephfs_metadata   4   32   36 MiB   36 MiB  5.2 KiB       31  108 MiB  108 MiB   15 KiB      0    518 GiB            N/A          N/A    N/A         0 B          0 B
ceph-vm           6  512  8.0 TiB  8.0 TiB  109 KiB    2.14M   24 TiB   24 TiB  326 KiB  94.08    518 GiB            N/A          N/A    N/A         0 B          0 B

Code:

# ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1         34.93195         -   35 TiB   24 TiB   24 TiB  4.3 MiB   70 GiB   11 TiB  69.81  1.00    -          root default
-3         12.22618         -   12 TiB  8.1 TiB  8.1 TiB  1.5 MiB   24 GiB  4.1 TiB  66.49  0.95    -              host hv01
 0    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  214 KiB  3.2 GiB  662 GiB  62.98  0.90   77      up          osd.0
 3    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  213 KiB  3.3 GiB  570 GiB  68.11  0.98   85      up          osd.3
 6    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  201 KiB  3.5 GiB  615 GiB  65.59  0.94   81      up          osd.6
 9    ssd   1.74660   1.00000  1.7 TiB  1.3 TiB  1.3 TiB  230 KiB  3.6 GiB  474 GiB  73.49  1.05   94      up          osd.9
15    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  232 KiB  3.6 GiB  551 GiB  69.19  0.99   89      up          osd.15
16    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  204 KiB  3.5 GiB  633 GiB  64.59  0.93   76      up          osd.16
17    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  201 KiB  3.3 GiB  688 GiB  61.51  0.88   75      up          osd.17
-5         12.22618         -   12 TiB  8.1 TiB  8.1 TiB  1.4 MiB   24 GiB  4.1 TiB  66.49  0.95    -              host hv02
 1    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  221 KiB  3.5 GiB  557 GiB  68.84  0.99   84      up          osd.1
 4    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  204 KiB  3.4 GiB  642 GiB  64.09  0.92   83      up          osd.4
 7    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  223 KiB  3.4 GiB  569 GiB  68.17  0.98   86      up          osd.7
10    ssd   1.74660   1.00000  1.7 TiB  1.3 TiB  1.3 TiB  224 KiB  3.7 GiB  490 GiB  72.60  1.04   86      up          osd.10
14    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  211 KiB  3.1 GiB  592 GiB  66.91  0.96   79      up          osd.14
18    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  201 KiB  3.3 GiB  647 GiB  63.83  0.91   81      up          osd.18
19    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  199 KiB  3.3 GiB  697 GiB  61.00  0.87   78      up          osd.19
-7         10.47958         -   10 TiB  8.1 TiB  8.1 TiB  1.4 MiB   22 GiB  2.4 TiB  77.56  1.11    -              host hv03
 2    ssd   1.74660   1.00000  1.7 TiB  1.3 TiB  1.3 TiB  243 KiB  3.5 GiB  433 GiB  75.82  1.09   93      up          osd.2
 5    ssd   1.74660   1.00000  1.7 TiB  1.6 TiB  1.6 TiB  285 KiB  4.1 GiB  167 GiB  90.66  1.30  109      up          osd.5
 8    ssd   1.74660   1.00000  1.7 TiB  1.4 TiB  1.4 TiB  231 KiB  3.7 GiB  393 GiB  78.04  1.12   95      up          osd.8
11    ssd   1.74660   1.00000  1.7 TiB  1.3 TiB  1.3 TiB  245 KiB  3.7 GiB  413 GiB  76.91  1.10   96      up          osd.11
12    ssd   1.74660   1.00000  1.7 TiB  1.3 TiB  1.3 TiB  234 KiB  3.5 GiB  484 GiB  72.92  1.04   93      up          osd.12
13    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  238 KiB  3.5 GiB  519 GiB  71.01  1.02   91      up          osd.13
                        TOTAL   35 TiB   24 TiB   24 TiB  4.4 MiB   70 GiB   11 TiB  69.81

Code:

# ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

There's quite a few knobs to turn but it isn't apparent what is best (if that exists). Any suggestions?

Thanks!

sb-jw · Jan 20, 2024

You have to get data from OSD 5 down. Either you have to change the weighting or activate the balancer, then the CEPH takes care of it alone.

I would prefer the Balancer: https://docs.ceph.com/en/latest/rados/operations/balancer/

alexskysilk · Jan 20, 2024

your pool is too full.

but wait, you say, I'm only at 70% utilization! well, true, but thats not the limiting factor. node hv03 has less capacity then the other two, limiting what you can actually use to 30TB raw/10TB usable- making your actual usage ration 80%.

Add at least 2TB OSD space to HV03. and you'll need more then that pretty quickly- even 70% is pretty full.

cfgmgr · Jan 20, 2024

alexskysilk said:
your pool is too full.

but wait, you say, I'm only at 70% utilization! well, true, but thats not the limiting factor. node hv03 has less capacity then the other two, limiting what you can actually use to 30TB raw/10TB usable- making your actual usage ration 80%.

Add at least 2TB OSD space to HV03. and you'll need more then that pretty quickly- even 70% is pretty full.

Thanks for helping point something out. Looks like a disk might be missing warranting some additional investigation. Thank you for that!

cfgmgr · Jan 20, 2024

sb-jw said:
You have to get data from OSD 5 down. Either you have to change the weighting or activate the balancer, then the CEPH takes care of it alone.

I would prefer the Balancer: https://docs.ceph.com/en/latest/rados/operations/balancer/

The auto balancer is already enabled - but do I just need to follow the "Supervised Optimization" at the bottom?

Thank you for the response!

cfgmgr · Jan 20, 2024

cfgmgr said:
Thanks for helping point something out. Looks like a disk might be missing warranting some additional investigation. Thank you for that!

Now I'm a bit more puzzled. Looks like there's 7 block devices for ceph but only 6 OSDs? Doing a little homework I think.. its been 6 at least for a year

Code:

# lsblk|grep ceph
└─ceph--aaa72d5f--9b40--4e0e--b667--01a8a24ef730-osd--block--e91e0791--8a8d--42f2--8ac7--993499aa2c3a 253:2    0   1.7T  0 lvm
└─ceph--b76e1f6c--8cd8--4f19--808c--439dc408cb42-osd--block--419063ef--aed7--4d1c--8643--8c316b667694 253:0    0   1.7T  0 lvm
└─ceph--3bf5362e--016c--4ee9--9846--c438bd1830f2-osd--block--04281dd1--37ff--47eb--83c3--8b23b7934b98 253:1    0   1.7T  0 lvm
└─ceph--381df793--da0e--4cc4--bf83--1694f9260462-osd--block--dbcba1f6--e1d0--4e4d--81fa--44b5135ec81e 253:7    0   1.7T  0 lvm
└─ceph--cc78fb4f--900d--41e2--af9a--212bcd6a55c8-osd--block--10559d81--9c05--43ea--87cd--f7fdf236e72c 253:3    0   1.7T  0 lvm
└─ceph--af40abea--e7d7--4b8b--b86a--059928525d11-osd--block--a61a87f2--31e7--485b--bd06--16ba93fc3a46 253:4    0   1.7T  0 lvm
└─ceph--f2a93c73--36ea--471f--ab59--1b2779ac9b7c-osd--block--f321566f--3b68--404a--b1b0--957e43d716e4 253:8    0   1.7T  0 lvm

Code:

# lsblk|grep ceph |wc -l
7

Keep in mind this shows 8 - but the 8th one is where the OS is installed. Soidigm tooling says everything is just dandy.

Code:

# sst show -ssd|grep Status
DeviceStatus : Healthy
DeviceStatus : Healthy
DeviceStatus : Healthy
DeviceStatus : Healthy
DeviceStatus : Healthy
DeviceStatus : Healthy
DeviceStatus : Healthy
DeviceStatus : Healthy

alexskysilk · Jan 20, 2024

you have a previously created osd that is no longer part of the cluster. you'll need to identify the VG that isnt participating, blow it away, and recreate the osd.

cfgmgr · Jan 20, 2024

alexskysilk said:
you have a previously created osd that is no longer part of the cluster. you'll need to identify the VG that isnt participating, blow it away, and recreate the osd.

Well - this gets a bit more interesting now!

Code:

# lvs
  LV                                             VG                                        Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  osd-block-dbcba1f6-e1d0-4e4d-81fa-44b5135ec81e ceph-381df793-da0e-4cc4-bf83-1694f9260462 -wi-ao----  <1.75t
  osd-block-04281dd1-37ff-47eb-83c3-8b23b7934b98 ceph-3bf5362e-016c-4ee9-9846-c438bd1830f2 -wi-ao----  <1.75t
  osd-block-e91e0791-8a8d-42f2-8ac7-993499aa2c3a ceph-aaa72d5f-9b40-4e0e-b667-01a8a24ef730 -wi-ao----  <1.75t
  osd-block-a61a87f2-31e7-485b-bd06-16ba93fc3a46 ceph-af40abea-e7d7-4b8b-b86a-059928525d11 -wi-a-----  <1.75t  <-- device is not open
  osd-block-419063ef-aed7-4d1c-8643-8c316b667694 ceph-b76e1f6c-8cd8-4f19-808c-439dc408cb42 -wi-ao----  <1.75t
  osd-block-10559d81-9c05-43ea-87cd-f7fdf236e72c ceph-cc78fb4f-900d-41e2-af9a-212bcd6a55c8 -wi-ao----  <1.75t
  osd-block-f321566f-3b68-404a-b1b0-957e43d716e4 ceph-f2a93c73-36ea-471f-ab59-1b2779ac9b7c -wi-ao----  <1.75t
  data                                           pve                                       twi-a-tz-- 320.09g             0.00   0.52
  root                                           pve                                       -wi-ao----  96.00g
  swap                                           pve                                       -wi-ao----   8.00g

Dumping the full output of ceph-volume lvm list

Code:

# ceph-volume lvm list


====== osd.11 ======

  [block]       /dev/ceph-f2a93c73-36ea-471f-ab59-1b2779ac9b7c/osd-block-f321566f-3b68-404a-b1b0-957e43d716e4

      block device              /dev/ceph-f2a93c73-36ea-471f-ab59-1b2779ac9b7c/osd-block-f321566f-3b68-404a-b1b0-957e43d716e4
      block uuid                CKULq6-2m13-UdbC-qj2U-I7oX-4tPn-5hpZUi
      cephx lockbox secret
      cluster fsid              b7565e52-6907-49f9-85b9-526c3ce94676
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  f321566f-3b68-404a-b1b0-957e43d716e4
      osd id                    11
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdh

====== osd.12 ======

  [block]       /dev/ceph-cc78fb4f-900d-41e2-af9a-212bcd6a55c8/osd-block-10559d81-9c05-43ea-87cd-f7fdf236e72c

      block device              /dev/ceph-cc78fb4f-900d-41e2-af9a-212bcd6a55c8/osd-block-10559d81-9c05-43ea-87cd-f7fdf236e72c
      block uuid                dV5Njm-2fNM-JE6f-E0cl-etVT-Vquw-Jr9EOP
      cephx lockbox secret
      cluster fsid              b7565e52-6907-49f9-85b9-526c3ce94676
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  10559d81-9c05-43ea-87cd-f7fdf236e72c
      osd id                    12
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdf

====== osd.13 ======

  [block]       /dev/ceph-381df793-da0e-4cc4-bf83-1694f9260462/osd-block-dbcba1f6-e1d0-4e4d-81fa-44b5135ec81e

      block device              /dev/ceph-381df793-da0e-4cc4-bf83-1694f9260462/osd-block-dbcba1f6-e1d0-4e4d-81fa-44b5135ec81e
      block uuid                sEIgQx-le7B-O2Jd-tGgb-UmXz-K3G4-hNIw0v
      cephx lockbox secret
      cluster fsid              b7565e52-6907-49f9-85b9-526c3ce94676
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  dbcba1f6-e1d0-4e4d-81fa-44b5135ec81e
      osd id                    13
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sde

====== osd.2 =======

  [block]       /dev/ceph-aaa72d5f-9b40-4e0e-b667-01a8a24ef730/osd-block-e91e0791-8a8d-42f2-8ac7-993499aa2c3a

      block device              /dev/ceph-aaa72d5f-9b40-4e0e-b667-01a8a24ef730/osd-block-e91e0791-8a8d-42f2-8ac7-993499aa2c3a
      block uuid                FfP7zH-ePJq-0Jg0-XgdS-yloK-FpnX-UZquFc
      cephx lockbox secret
      cluster fsid              b7565e52-6907-49f9-85b9-526c3ce94676
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  e91e0791-8a8d-42f2-8ac7-993499aa2c3a
      osd id                    2
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdb

====== osd.5 =======

  [block]       /dev/ceph-b76e1f6c-8cd8-4f19-808c-439dc408cb42/osd-block-419063ef-aed7-4d1c-8643-8c316b667694

      block device              /dev/ceph-b76e1f6c-8cd8-4f19-808c-439dc408cb42/osd-block-419063ef-aed7-4d1c-8643-8c316b667694
      block uuid                YxcIfB-aeDf-KHzt-TJv5-drDm-viDI-c0pYoy
      cephx lockbox secret
      cluster fsid              b7565e52-6907-49f9-85b9-526c3ce94676
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  419063ef-aed7-4d1c-8643-8c316b667694
      osd id                    5
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdc

====== osd.8 =======

  [block]       /dev/ceph-3bf5362e-016c-4ee9-9846-c438bd1830f2/osd-block-04281dd1-37ff-47eb-83c3-8b23b7934b98

      block device              /dev/ceph-3bf5362e-016c-4ee9-9846-c438bd1830f2/osd-block-04281dd1-37ff-47eb-83c3-8b23b7934b98
      block uuid                jR2tN5-srcI-EOik-WUvJ-zJE6-Ofde-d0inyb
      cephx lockbox secret
      cluster fsid              b7565e52-6907-49f9-85b9-526c3ce94676
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  04281dd1-37ff-47eb-83c3-8b23b7934b98
      osd id                    8
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdd

  [block]       /dev/ceph-af40abea-e7d7-4b8b-b86a-059928525d11/osd-block-a61a87f2-31e7-485b-bd06-16ba93fc3a46

      block device              /dev/ceph-af40abea-e7d7-4b8b-b86a-059928525d11/osd-block-a61a87f2-31e7-485b-bd06-16ba93fc3a46
      block uuid                Umsb2B-KrcG-BPln-U22S-Obkw-ipEE-HZHmqg
      cephx lockbox secret
      cluster fsid              fa019436-5a9a-4535-ade6-2f6dfc1e2b29
      cluster name              ceph
      crush device class        None
      encrypted                 0
      osd fsid                  a61a87f2-31e7-485b-bd06-16ba93fc3a46
      osd id                    8
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdg

osd.8 is certainly different than all the rest.

Code:

====== osd.8 =======

  [block]       /dev/ceph-3bf5362e-016c-4ee9-9846-c438bd1830f2/osd-block-04281dd1-37ff-47eb-83c3-8b23b7934b98

      block device              /dev/ceph-3bf5362e-016c-4ee9-9846-c438bd1830f2/osd-block-04281dd1-37ff-47eb-83c3-8b23b7934b98
      block uuid                jR2tN5-srcI-EOik-WUvJ-zJE6-Ofde-d0inyb
      cephx lockbox secret
      cluster fsid              b7565e52-6907-49f9-85b9-526c3ce94676
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  04281dd1-37ff-47eb-83c3-8b23b7934b98
      osd id                    8
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdd

  [block]       /dev/ceph-af40abea-e7d7-4b8b-b86a-059928525d11/osd-block-a61a87f2-31e7-485b-bd06-16ba93fc3a46

      block device              /dev/ceph-af40abea-e7d7-4b8b-b86a-059928525d11/osd-block-a61a87f2-31e7-485b-bd06-16ba93fc3a46
      block uuid                Umsb2B-KrcG-BPln-U22S-Obkw-ipEE-HZHmqg
      cephx lockbox secret
      cluster fsid              fa019436-5a9a-4535-ade6-2f6dfc1e2b29
      cluster name              ceph
      crush device class        None
      encrypted                 0
      osd fsid                  a61a87f2-31e7-485b-bd06-16ba93fc3a46
      osd id                    8
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdg

I haven't the foggiest idea how it got this way (probably before my time).

How would something like THIS get fixed? It this an issue? I'm quiet familiar with LVM but still learning when it comes to ceph.

Thanks!

sb-jw · Jan 20, 2024

It's definitely not supposed to and that's certainly not a good thing. But it doesn't seem to be a problem so far?

But I would first recommend that you correct OSD 5 so that the CEPH is healthy again. Then you should make sure that you free up storage space or add SSDs. Then I would recommend that you remove and wipe the affected physical disk.

cfgmgr · Jan 20, 2024

sb-jw said:
It's definitely not supposed to and that's certainly not a good thing. But it doesn't seem to be a problem so far?

But I would first recommend that you correct OSD 5 so that the CEPH is healthy again. Then you should make sure that you free up storage space or add SSDs. Then I would recommend that you remove and wipe the affected physical disk.

Hasn't been an issue (yet) but like you mentioned it should be corrected. If I juggle some space around, probably with re-weighting some of the disks... hopefully that would be enough to balance things out a bit?

There are no more slots for additional disk.

sb-jw · Jan 20, 2024

You cannot reduce the fill level here with balancing. If an OSD fails on hv03 you are directly in the near full ratio. If you have replica 3, this means that the data that is currently on hv03 remains there. The CEPH will not redistribute these to the other nodes because it cannot. The current fill level is distributed across the currently 6 OSDs; if one fails, exactly this amount of data is distributed across 5. Your CEPH will inevitably run at near full unless you can reduce the amount of data or insert another disk. If you don't get the OSD 5 down from its level beforehand, it is very likely that your CEPH is running in full ratio and then switches to read-only.

You are currently at a critical level where you have to think very carefully about what you are doing in order not to drive the cluster into failure.

cfgmgr · Jan 20, 2024

sb-jw said:
You cannot reduce the fill level here with balancing. If an OSD fails on hv03 you are directly in the near full ratio. If you have replica 3, this means that the data that is currently on hv03 remains there. The CEPH will not redistribute these to the other nodes because it cannot. The current fill level is distributed across the currently 6 OSDs; if one fails, exactly this amount of data is distributed across 5. Your CEPH will inevitably run at near full unless you can reduce the amount of data or insert another disk. If you don't get the OSD 5 down from its level beforehand, it is very likely that your CEPH is running in full ratio and then switches to read-only.

You are currently at a critical level where you have to think very carefully about what you are doing in order not to drive the cluster into failure.

I went and took a look at what i had running on the cluster. Found a couple unreferenced disks I could safely clean up and remove. Also cleaned up a test VM I wasn't really using (can rebuild later). Trying to see what else I can clean up. fstrim'ed a few things... Looks a tiny bit better:

Code:

# ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1         34.93195         -   35 TiB   25 TiB   25 TiB  4.1 MiB   68 GiB  9.9 TiB  71.58  1.00    -          root default
-3         12.22618         -   12 TiB  8.3 TiB  8.3 TiB  1.4 MiB   23 GiB  3.9 TiB  68.17  0.95    -              host hv01
 0    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  252 KiB  3.2 GiB  634 GiB  64.56  0.90   77      up          osd.0
 3    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  207 KiB  3.4 GiB  540 GiB  69.82  0.98   85      up          osd.3
 6    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  241 KiB  3.3 GiB  585 GiB  67.27  0.94   81      up          osd.6
 9    ssd   1.74660   1.00000  1.7 TiB  1.3 TiB  1.3 TiB  117 KiB  3.2 GiB  441 GiB  75.35  1.05   94      up          osd.9
15    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  285 KiB  3.3 GiB  520 GiB  70.90  0.99   89      up          osd.15
16    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  100 KiB  3.3 GiB  604 GiB  66.21  0.93   76      up          osd.16
17    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  201 KiB  3.4 GiB  660 GiB  63.09  0.88   75      up          osd.17
-5         12.22618         -   12 TiB  8.3 TiB  8.3 TiB  1.4 MiB   24 GiB  3.9 TiB  68.18  0.95    -              host hv02
 1    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  218 KiB  3.6 GiB  525 GiB  70.62  0.99   84      up          osd.1
 4    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  105 KiB  3.1 GiB  614 GiB  65.66  0.92   83      up          osd.4
 7    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  219 KiB  3.4 GiB  538 GiB  69.90  0.98   86      up          osd.7
10    ssd   1.74660   1.00000  1.7 TiB  1.3 TiB  1.3 TiB  267 KiB  3.6 GiB  457 GiB  74.45  1.04   86      up          osd.10
14    ssd   1.74660   1.00000  1.7 TiB  1.2 TiB  1.2 TiB  210 KiB  3.4 GiB  560 GiB  68.66  0.96   79      up          osd.14
18    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  200 KiB  3.3 GiB  618 GiB  65.42  0.91   81      up          osd.18
19    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB  198 KiB  3.3 GiB  670 GiB  62.53  0.87   78      up          osd.19
-7         10.47958         -   10 TiB  8.3 TiB  8.3 TiB  1.3 MiB   21 GiB  2.1 TiB  79.52  1.11    -              host hv03
 2    ssd   1.74660   1.00000  1.7 TiB  1.4 TiB  1.4 TiB  304 KiB  3.6 GiB  315 GiB  82.39  1.15   98      up          osd.2
 5    ssd   1.74660   0.85004  1.7 TiB  1.4 TiB  1.4 TiB  285 KiB  4.1 GiB  342 GiB  80.87  1.13   95      up          osd.5
 8    ssd   1.74660   1.00000  1.7 TiB  1.4 TiB  1.4 TiB  120 KiB  3.7 GiB  309 GiB  82.75  1.16   98      up          osd.8
11    ssd   1.74660   1.00000  1.7 TiB  1.4 TiB  1.4 TiB  297 KiB  3.6 GiB  345 GiB  80.70  1.13   98      up          osd.11
12    ssd   1.74660   1.00000  1.7 TiB  1.3 TiB  1.3 TiB  126 KiB  3.2 GiB  435 GiB  75.66  1.06   94      up          osd.12
13    ssd   1.74660   1.00000  1.7 TiB  1.3 TiB  1.3 TiB  238 KiB  2.9 GiB  452 GiB  74.74  1.04   94      up          osd.13
                        TOTAL   35 TiB   25 TiB   25 TiB  4.1 MiB   68 GiB  9.9 TiB  71.58
MIN/MAX VAR: 0.87/1.16  STDDEV: 6.30

I still dont believe its enough to safely blow away and recreate osd.5

Assuming 82% in use - which if blown away 82/5=16.4 would need to go to the other disks on hv03. In each case that would put some disks at 98% which is bad. I think this math is correct?

I'll see what else I can hopefully trim down and remove...

alexskysilk · Jan 20, 2024

you have two osd.8's. not sure how you managed it, but here is how to recover:

vgremove ceph-af40abea-e7d7-4b8b-b86a-059928525d11
ceph-volume lvm disk zap /dev/sdg
pveceph createosd /dev/sdg

It may be necessary to destroy osd.8 to do this, in which case you'd need to perform the above for /dev/sdd as well. make sure you set norebalance if you do until the osd's are back and available.

cfgmgr · Jan 20, 2024

alexskysilk said:
you have two osd.8's. not sure how you managed it, but here is how to recover:

vgremove ceph-af40abea-e7d7-4b8b-b86a-059928525d11
ceph-volume lvm disk zap /dev/sdg
pveceph createosd /dev/sdg

It may be necessary to destroy osd.8 to do this, in which case you'd need to perform the above for /dev/sdd as well. make sure you set norebalance if you do until the osd's are back and available.

This is what I was thinking I would have to do - but wouldn't smoking osd.8 cause the rest of the volumes on hv03 to go RO since they would be above 96%?

I don't know how we got here and thats a question Ive been asking myself all morning.

Currently backing up some stuff so I can delete... fix and then restore. Thankfully this is a lab environment.

Really apprecciate the feedback!

alexskysilk · Jan 20, 2024

cfgmgr said:
This is what I was thinking I would have to do - but wouldn't smoking osd.8 cause the rest of the volumes on hv03 to go RO since they would be above 96%?

maybe. You have two options if this happens-
1. schedule downtime
2. reset the minimum replica to 1 temporarily. This is not really recommended, but depending on your risk tolerance could be an option. and before you ask
ceph osd pool set data min_size 1

edit- obviously needs norebalance set or you'd still have the issue with full osds.

cfgmgr · Jan 20, 2024

alexskysilk said:
maybe. You have two options if this happens-
1. schedule downtime
2. reset the minimum replica to 1 temporarily. This is not really recommended, but depending on your risk tolerance could be an option. and before you ask
ceph osd pool set data min_size 1

edit- obviously needs norebalance set or you'd still have the issue with full osds.

Good point about the "norebalance". Assuming to set it back to 3, flip the one to a 3 in the set ceph osd pool command?

It's a lab system, it shouldn't take long to set the min_size, clean up the OSDs (Will blow away osd.8) then format BOTH underlying disk and then readd back 2 OSDs + get the other one back in service.

I feel like the level of risk is acceptable.

Just waiting on backups which might be awhile

alexskysilk · Jan 20, 2024

cfgmgr said:
Assuming to set it back to 3, flip the one to a 3 in the set ceph osd pool command?

data min_size is 2 on a 3pg crush rule. if it was 3, you wouldnt be able to reboot a node without the pool going read-only.

In your case, I still assume you'd be able to wipe the "phantom" osd 8 without having to lose the osd itself, so you shouldnt be at any risk- but I have never actually seen this so I'd be curious to what happens when you try.

cfgmgr · Jan 20, 2024

Ah thats right - not sure where I had three stuck in my head. Probably thinking "Size".

Code:

#  pveceph pool ls
┌─────────────────┬──────┬──────────┬────────┬─────────────┬────────────────┬───────────────────┬──────────────────────────┬───────────────────────────┬─────────────────┬──────────────────────┬────────────────┐
│ Name            │ Size │ Min Size │ PG Num │ min. PG Num │ Optimal PG Num │ PG Autoscale Mode │ PG Autoscale Target Size │ PG Autoscale Target Ratio │ Crush Rule Name │               %-Used │           Used │
╞═════════════════╪══════╪══════════╪════════╪═════════════╪════════════════╪═══════════════════╪══════════════════════════╪═══════════════════════════╪═════════════════╪══════════════════════╪════════════════╡
│ .mgr            │    3 │        2 │      1 │           1 │              1 │ on                │                          │                           │ replicated_rule │ 0.000175372682861052 │      830877696 │
├─────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼─────────────────┼──────────────────────┼────────────────┤
│ ceph-vm         │    3 │        2 │    512 │             │            512 │ on                │                          │                           │ replicated_rule │    0.851437747478485 │ 27148353149766 │
├─────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼─────────────────┼──────────────────────┼────────────────┤
│ cephfs_data     │    3 │        2 │     32 │             │             32 │ on                │                          │                           │ replicated_rule │   0.0488220863044262 │   243138416640 │
├─────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼─────────────────┼──────────────────────┼────────────────┤
│ cephfs_metadata │    3 │        2 │     32 │          16 │             16 │ on                │                          │                           │ replicated_rule │ 2.38319698837586e-05 │      112893583 │
└─────────────────┴──────┴──────────┴────────┴─────────────┴────────────────┴───────────────────┴──────────────────────────┴───────────────────────────┴─────────────────┴──────────────────────┴────────────────┘

OSD.8 is a legit OSD, I just dont know how the heck it ended up with two block devices. I'm assuming Ill need to actually remove the LV before actually removing the VG. Even though the LV isnt online (LV ending in a46) it will yell about there being an LV in the VG before blowing that up.

Wonder if I'll need to restart that OSD after Ive cleaned up the "rouge" LV/VG? Before I re-add the missing osd for /dev/sdg.

Kinda thinking this (when backups are done).

1. set norebalance
2. Set min_size = 1
3. lvremove osd-block-a61a87f2-31e7-485b-bd06-16ba93fc3a46 (I know the command isnt exact)
4. vgremove ceph-af40abea-e7d7-4b8b-b86a-059928525d11
5. ceph-volume lvm disk zap /dev/sdg
6. pveceph createosd /dev/sdg
7. Set min_size =2
8. Unset rebalance

alexskysilk · Jan 20, 2024

yep. you dont really need step 3 as you can just hit yes when prompted on vgremove to remove any lvs.

sb-jw · Jan 20, 2024

After zapping, you may want to make sure that the LVs, VGs, etc. have actually disappeared, the disk is blank and there are no leftovers in the system. I also absolutely cannot understand how something like this can even happen.

ceph osd backfill - some osd's dont appear to be balanced?

Member

Famous Member

Distinguished Member

Member

Member

Member

Distinguished Member

Member

Famous Member

Member

Famous Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Famous Member