Ceph unable to sustain 1 OSD out with size=4

alyarb

Well-Known Member
Feb 11, 2020
140
25
48
37
We have 15 OSD hosts, and 22 OSDs. The servers physically have 2 drive bays. Of course the OSDs are not distributed perfectly evenly. Some servers have 1 OSD and some servers have 2 OSDs, but we are always adding drives to the system as time/availability allows.

OSD utilization according to the ceph dashboard is between 45-65% depending on whether the OSD is alone on a host or colocated with another.

Last week, a server with 2 OSDs had a problem resulting in a drop of 2 OSDs. With size=3 and min_size=2, a number of VMs were essentially frozen. After moving the physical drives to other OSD hosts and running ceph-volume lvm activate --all things got back to normal after a few minutes, but the drive-host distribution remains somewhat uneven.

That evening, we increased our replication to size=4, min_size=2. We are also in the progress of upgrading from PVE 7.2 to 7.3 and Ceph 17.2.4 to 17.2.5.

Even with the unbalanced OSD distribution, we now have size=4 which should be overkill for what is not a terribly large cluster. The expectation is that we can update and reboot any host without affecting RBD clients in any way. If we can get there then I'm happy with size=4.

We chose a host with a single OSD to reboot first. This was also a MON host but we have 7 monitors that were all in at the time. Our average iops during the day are 3k-10k and during the night settle around 2k. The noout flag was set prior to rebooting the host.

During the reboot, 1 OSD went down, resulting in roughly 5% of objects to become degraded as expected, and i/o across the cluster still slowed to distressingly low values and iops were showing under 100. A number of windows VMs had BSOD and required a reset even after the reboot completed and the downed OSD was brought back.

Again, this is a cluster with size=4,min_size=2 with 1 OSD down, behaving as if it were size=2.

Things should remain perfectly stable and functional with only 1 OSD down, and my aim is to achieve the same tolerance for 2 OSDs down.

Someone please tell me what I am missing and doing wrong.
 
Last edited:
This does sound like something is wrong. Can you please post the output of the following commands?
  • pveversion -v
  • ceph -s
  • ceph osd df tree
  • ceph device ls
  • pveceph pool ls --noborder (make sure to have the window wide enough as it will not print anything beyond the size of the terminal)
 
Hello,

i'm not a ceph expert yet, but i could tell you what we found out in the last 2 month about our ceph 17.2.5 (7 server, 30 OSDs) .
If we reboot a node we don't just set the "noout", but also the "norebalance" flag.

In ceph 17.2.x the mclock scheduler is the new default scheduler for rebalancing/recovering.
If ceph 17.2.5 is starting a rebalancing/recovering process, the client read/writes are going down a lot,
and windows VMs are starting to get bluescreens.
This is somewhat a known problem, and hit us hard allready twice.
There is also a patch in the line to address that problem: https://github.com/ceph/ceph/pull/48226

Maybe your problem with the 2 downed OSDs was in reality also a recovery/rebalancing problem?

For more info/ideas someone else has to answer.

best regards
Benedikt
 
Code:
root@virtual41:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-10
pve-kernel-5.4: 6.4-18
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.4.189-2-pve: 5.4.189-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 17.2.4-pve1
ceph-fuse: 17.2.4-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-network-perl: 0.7.1
libpve-storage-perl: 7.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1



Code:
root@virtual41:~# ceph -s
  cluster:
    id:     37999294-04ee-4428-a4e3-62f4288b3c3d
    health: HEALTH_WARN
            1/7 mons down, quorum virtual45,virtual41,virtual43,virtual44,virtual47,virtual46

  services:
    mon: 7 daemons, quorum virtual45,virtual41,virtual43,virtual44,virtual47,virtual46 (age 14s), out of quorum: virtual42
    mgr: virtual40(active, since 27m), standbys: virtual36, virtual39, virtual47, virtual37, virtual42, virtual45, virtual46, virtual38, virtual44
    mds: 1/1 daemons up, 6 standby
    osd: 22 osds: 22 up (since 13h), 22 in (since 3d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 561 pgs
    objects: 9.78M objects, 24 TiB
    usage:   91 TiB used, 69 TiB / 160 TiB avail
    pgs:     561 active+clean

  io:
    client:   9.8 MiB/s rd, 14 MiB/s wr, 457 op/s rd, 1.11k op/s wr



Code:
root@virtual41:~# ceph osd df tree

ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
 -1         160.10257         -  160 TiB   91 TiB   91 TiB  8.5 GiB  387 GiB   69 TiB  56.84  1.00    -          root default
 -3                 0         -      0 B      0 B      0 B      0 B      0 B      0 B      0     0    -              host virtual36
 -5          14.55478         -   15 TiB  8.4 TiB  8.4 TiB  844 MiB   36 GiB  6.1 TiB  58.03  1.02    -              host virtual37
  1    ssd    7.27739   1.00000  7.3 TiB  4.2 TiB  4.2 TiB  339 MiB   17 GiB  3.1 TiB  58.03  1.02  106      up          osd.1
 16    ssd    7.27739   1.00000  7.3 TiB  4.2 TiB  4.2 TiB  505 MiB   19 GiB  3.1 TiB  58.03  1.02  100      up          osd.16
 -7          14.55478         -   15 TiB  8.3 TiB  8.3 TiB  831 MiB   37 GiB  6.2 TiB  57.29  1.01    -              host virtual38
  2    ssd    7.27739   1.00000  7.3 TiB  3.7 TiB  3.7 TiB  333 MiB   16 GiB  3.6 TiB  50.63  0.89   91      up          osd.2
 17    ssd    7.27739   1.00000  7.3 TiB  4.7 TiB  4.6 TiB  498 MiB   21 GiB  2.6 TiB  63.96  1.13  119      up          osd.17
 -9          14.55478         -   15 TiB  7.8 TiB  7.8 TiB  719 MiB   35 GiB  6.7 TiB  53.71  0.95    -              host virtual39
  7    ssd    7.27739   1.00000  7.3 TiB  3.8 TiB  3.8 TiB  329 MiB   16 GiB  3.5 TiB  52.49  0.92   95      up          osd.7
 18    ssd    7.27739   1.00000  7.3 TiB  4.0 TiB  4.0 TiB  390 MiB   19 GiB  3.3 TiB  54.94  0.97   99      up          osd.18
-11          14.55478         -   15 TiB  7.5 TiB  7.5 TiB  804 MiB   33 GiB  7.0 TiB  51.87  0.91    -              host virtual40
  4    ssd    7.27739   1.00000  7.3 TiB  3.6 TiB  3.5 TiB  345 MiB   15 GiB  3.7 TiB  48.84  0.86   90      up          osd.4
 19    ssd    7.27739   1.00000  7.3 TiB  4.0 TiB  4.0 TiB  459 MiB   19 GiB  3.3 TiB  54.90  0.97   99      up          osd.19
-13          14.55478         -   15 TiB  7.3 TiB  7.3 TiB  821 MiB   34 GiB  7.3 TiB  50.08  0.88    -              host virtual41
 15    ssd    7.27739   1.00000  7.3 TiB  3.3 TiB  3.3 TiB  358 MiB   15 GiB  3.9 TiB  45.77  0.81   85      up          osd.15
 20    ssd    7.27739   1.00000  7.3 TiB  4.0 TiB  3.9 TiB  463 MiB   18 GiB  3.3 TiB  54.39  0.96   96      up          osd.20
-15          14.55478         -   15 TiB  8.7 TiB  8.7 TiB  802 MiB   38 GiB  5.8 TiB  59.88  1.05    -              host virtual42
  5    ssd    7.27739   1.00000  7.3 TiB  4.3 TiB  4.3 TiB  340 MiB   17 GiB  3.0 TiB  59.22  1.04  105      up          osd.5
 21    ssd    7.27739   1.00000  7.3 TiB  4.4 TiB  4.4 TiB  462 MiB   21 GiB  2.9 TiB  60.54  1.07  112      up          osd.21
-17          14.55478         -   15 TiB  8.2 TiB  8.2 TiB  732 MiB   34 GiB  6.3 TiB  56.40  0.99    -              host virtual43
  0    ssd    7.27739   1.00000  7.3 TiB  3.8 TiB  3.8 TiB  361 MiB   17 GiB  3.5 TiB  51.80  0.91   97      up          osd.0
  3    ssd    7.27739   1.00000  7.3 TiB  4.4 TiB  4.4 TiB  370 MiB   17 GiB  2.8 TiB  60.99  1.07  107      up          osd.3
-19           7.27739         -  7.3 TiB  4.7 TiB  4.6 TiB  396 MiB   18 GiB  2.6 TiB  63.96  1.13    -              host virtual44
  8    ssd    7.27739   1.00000  7.3 TiB  4.7 TiB  4.6 TiB  396 MiB   18 GiB  2.6 TiB  63.96  1.13  110      up          osd.8
-21           7.27739         -  7.3 TiB  3.9 TiB  3.8 TiB  307 MiB   15 GiB  3.4 TiB  53.06  0.93    -              host virtual45
  9    ssd    7.27739   1.00000  7.3 TiB  3.9 TiB  3.8 TiB  307 MiB   15 GiB  3.4 TiB  53.06  0.93   99      up          osd.9
-23           7.27739         -  7.3 TiB  4.6 TiB  4.6 TiB  383 MiB   18 GiB  2.7 TiB  62.88  1.11    -              host virtual46
 10    ssd    7.27739   1.00000  7.3 TiB  4.6 TiB  4.6 TiB  383 MiB   18 GiB  2.7 TiB  62.88  1.11  107      up          osd.10
-25           7.27739         -  7.3 TiB  3.7 TiB  3.7 TiB  349 MiB   15 GiB  3.5 TiB  51.29  0.90    -              host virtual47
 11    ssd    7.27739   1.00000  7.3 TiB  3.7 TiB  3.7 TiB  349 MiB   15 GiB  3.5 TiB  51.29  0.90   95      up          osd.11
-27           7.27739         -  7.3 TiB  4.4 TiB  4.4 TiB  353 MiB   17 GiB  2.9 TiB  60.77  1.07    -              host virtual48
 12    ssd    7.27739   1.00000  7.3 TiB  4.4 TiB  4.4 TiB  353 MiB   17 GiB  2.9 TiB  60.77  1.07  107      up          osd.12
-29           7.27739         -  7.3 TiB  4.3 TiB  4.3 TiB  483 MiB   20 GiB  2.9 TiB  59.77  1.05    -              host virtual49
 13    ssd    7.27739   1.00000  7.3 TiB  4.3 TiB  4.3 TiB  483 MiB   20 GiB  2.9 TiB  59.77  1.05  105      up          osd.13
-31           7.27739         -  7.3 TiB  4.7 TiB  4.7 TiB  491 MiB   22 GiB  2.5 TiB  65.26  1.15    -              host virtual50
 14    ssd    7.27739   1.00000  7.3 TiB  4.7 TiB  4.7 TiB  491 MiB   22 GiB  2.5 TiB  65.26  1.15  111      up          osd.14
-33           7.27739         -  7.3 TiB  4.3 TiB  4.3 TiB  408 MiB   14 GiB  3.0 TiB  58.91  1.04    -              host virtual51
  6    ssd    7.27739   1.00000  7.3 TiB  4.3 TiB  4.3 TiB  408 MiB   14 GiB  3.0 TiB  58.91  1.04  109      up          osd.6
                          TOTAL  160 TiB   91 TiB   91 TiB  8.5 GiB  387 GiB   69 TiB  56.84
MIN/MAX VAR: 0.81/1.15  STDDEV: 5.24



Code:
root@virtual41:~# ceph device ls
DEVICE                                   HOST:DEV           DAEMONS  WEAR  LIFE EXPECTANCY
INTEL_SSDPE2KX080T8K_PHLJ951401KP8P0HGN  virtual48:nvme0n1  osd.12     1%
INTEL_SSDPE2KX080T8_BTLJ0493012H8P0HGN   virtual51:nvme0n1  osd.6      2%
INTEL_SSDPE2KX080T8_BTLJ0493012L8P0HGN   virtual42:nvme0n1  osd.5      2%
INTEL_SSDPE2KX080T8_BTLJ0493015E8P0HGN   virtual43:nvme0n1  osd.3      2%
INTEL_SSDPE2KX080T8_BTLJ0493015F8P0HGN   virtual40:nvme1n1  osd.4      2%
INTEL_SSDPE2KX080T8_BTLJ1271083R8P0HGN   virtual40:nvme0n1  osd.19     0%
INTEL_SSDPE2KX080T8_BTLJ8414021F8P0HGN   virtual39:nvme0n1  osd.7      2%
INTEL_SSDPE2KX080T8_PHLJ011000RZ8P0HGN   virtual43:nvme1n1  osd.0      2%
INTEL_SSDPE2KX080T8_PHLJ011000W18P0HGN   virtual37:nvme0n1  osd.1      2%
INTEL_SSDPE2KX080T8_PHLJ0393007V8P0HGN   virtual46:nvme0n1  osd.10     0%
INTEL_SSDPE2KX080T8_PHLJ040101XG8P0HGN   virtual47:nvme0n1  osd.11     0%
INTEL_SSDPE2KX080T8_PHLJ0430037Y8P0HGN   virtual41:nvme1n1  osd.15     0%
INTEL_SSDPE2KX080T8_PHLJ043100EN8P0HGN   virtual37:nvme1n1  osd.16     0%
INTEL_SSDPE2KX080T8_PHLJ043100M68P0HGN   virtual38:nvme1n1  osd.17     0%
INTEL_SSDPE2KX080T8_PHLJ043101H78P0HGN   virtual50:nvme0n1  osd.14     0%
INTEL_SSDPE2KX080T8_PHLJ043101LN8P0HGN   virtual45:nvme0n1  osd.9      0%
INTEL_SSDPE2KX080T8_PHLJ0432007N8P0HGN   virtual44:nvme0n1  osd.8      0%
INTEL_SSDPE2KX080T8_PHLJ043201C08P0HGN   virtual49:nvme0n1  osd.13     0%
INTEL_SSDPE2KX080T8_PHLJ043300CW8P0HGN   virtual41:nvme0n1  osd.20     0%
INTEL_SSDPE2KX080T8_PHLJ131500N48P0HGN   virtual42:nvme1n1  osd.21     0%
INTEL_SSDPE2KX080T8_PHLJ131500NB8P0HGN   virtual39:nvme1n1  osd.18     0%
INTEL_SSDPE2KX080T8_PHLJ946401A78P0HGN   virtual38:nvme0n1  osd.2      2%



Code:
root@virtual41:~# pveceph pool ls --noborder
Name            Size Min Size PG Num min. PG Num Optimal PG Num PG Autoscale Mode PG Autoscale Target Size PG Autoscale Target Ratio Crush Rule Name               %-Used Used
.mgr               4        2      1           1              1 on                                                                   replicated_rule 1.41735881697969e-06 74203136
CephFS_data        4        2     32                         32 on                                                                   replicated_rule  0.00131959957070649 69176320147
CephFS_metadata    4        2     16          16             16 on                                                                   replicated_rule 1.04289083537878e-05 545990662
CephRBD_NVMe       4        2    512                        512 on                                                                   replicated_rule    0.655374765396118 99559882709778
 
With regards to the ceph status, don't worry about the 1 mon being down. They are on comparatively slower storage and spend a lot of time on get_health_metrics. That is one reason we have 7 mons, they are all active and running, but they come and go when they get bogged down with stats but we have never had a quorum threatened by it.


Thank you both. I can also add under PVE 6.x and Ceph 16.x albeit with fewer OSDs, we had way higher tolerance for failures, and performance remained consistent with 1 OSD down and size=3/min_size=2.

I have also noticed on Ceph 17, lowering the recovery/rebalance priority has no effect.

For last night's test, only the noout flag was set, but I will try it again with noout and norecover.
 
Last edited:
Hello,

i'm not a ceph expert yet, but i could tell you what we found out in the last 2 month about our ceph 17.2.5 (7 server, 30 OSDs) .
If we reboot a node we don't just set the "noout", but also the "norebalance" flag.

In ceph 17.2.x the mclock scheduler is the new default scheduler for rebalancing/recovering.
If ceph 17.2.5 is starting a rebalancing/recovering process, the client read/writes are going down a lot,
and windows VMs are starting to get bluescreens.
This is somewhat a known problem, and hit us hard allready twice.
There is also a patch in the line to address that problem: https://github.com/ceph/ceph/pull/48226

Maybe your problem with the 2 downed OSDs was in reality also a recovery/rebalancing problem?

For more info/ideas someone else has to answer.

best regards
Benedikt

This sounds closest to what we are experiencing. What else do you know about it, and what else can be done besides using the norebalance flag?

To me, the 17.2.4 changelog suggests that these things have been fixed, when they have not.
 
Last edited:
Tonight, rebooting a single-OSD host with the OSD flags set went better. There was a brief appearance of slow OSD ops but they were completed within a few seconds.

I fear how Quincy would respond to an unplanned failure. One would need to react quickly to disable recover, balance, and backfill in order to maintain service.

I also have noticed snapshots on Quincy are way slower than Pacific and will also freeze VMs.
 
The best practice for Ceph is maximum of 5 mon‘s. 3 mon‘s are OK dir this Cluster Size.
Please fix the out of quorum mon first.
 
This sounds closest to what we are experiencing. What else do you know about it, and what else can be done besides using the norebalance flag?

To me, the 17.2.4 changelog suggests that these things have been fixed, when they have not.

As far as i know, to get the pre 17.2.x behavior. you have to activate the old wpq scheduler on the OSDs.

https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue

We haven't done it yet. I hope that the pull request will end up in the 17.2.6 release.
Still discussing in our company what todo.
 
Last edited:
Hmm, I cannot see something terribly wrong with that cluster from the outputs. Should you run into similar problems again, please get the current Ceph status and OSD df tree. As that might give some more indications of what is going wrong.
I am considering like @BenediktS that it might have been a rebalance/recovery issue causing too much load.

Regarding the new scheduler:

The new mclock_scheduler works by having predefined profiles.

If the default profile "high_client_ops"
Code:
ceph config show-with-defaults  osd.<OSD ID> | grep mclock_profile
does not provide good enough performance for the clients, you can edit the actual parameters of a profile (first list of parameters), by first switching to the custom profile:
Code:
ceph tell osd.* injectargs "--osd_mclock_profile=custom"

Then you can adjust, for example, the weight for client operations:
Code:
ceph tell osd.* injectargs "--osd_mclock_scheduler_client_wgt=4"

Lowering the osd_mclock_scheduler_background_recovery_res might also be a good idea in such a situation, so that the guaranteed IOPS for recovery are lower.

What values work to balance recovery speed and VM performance is something that you will need to test, if you run into a similar situation again. Just like with the old parameters.

More details about the concepts behind the mclock_scheduler can be found as well in the Ceph docs.


So yeah, hopefully, once better defaults for the high_client_ops profile are decided on and released, customizing these settings will not needed in most situations :)
 
  • Like
Reactions: jsterr
Thanks Aaron. I am considering going back to wpq until the next release. I don't want to get into an open-ended situation of adjusting tunables all the time.
 
I don't want to get into an open-ended situation of adjusting tunables all the time.
Well, that can still happen with the wpq scheduler, the difference is, that it has been around for a while and therefore there are enough guides around what to tune ;)
 
I hear you, but I've already got the default profile that prioritizes client ops, yet during a recovery multiple slow OSD ops are all that it takes to freeze VMs.

It seems like regardless of my choice, I have to set the nobackfill and norecover flags and so on and wait until the evening to start balance and recovery.
 
We did our first tests with the 17.2.6 release from the "ceph quincy test" repository.
So far it looks like the new version did not bring the VMs to freez anymore when one disk fails.
( And the status "active+remapped+backfill_wait" returned to the status page :) )

We will make more tests with two and three disks failing, but it looks promising so far.
 
We did our first tests with the 17.2.6 release from the "ceph quincy test" repository.
So far it looks like the new version did not bring the VMs to freez anymore when one disk fails.
( And the status "active+remapped+backfill_wait" returned to the status page :) )

We will make more tests with two and three disks failing, but it looks promising so far.
I'm going to try installing 17.2.6-pve1 now. Any other feedback?
 
With 3 OSDs down, the VMs have been very slow, but no freezing and no bluescreens.
It is definitly way better then with default 17.2.5 parameter.

There is room for more improvement, but i unfortunately don't have time in the next few weeks to do more tweaking and testing.
 
Thank you. the 17.2.6 dashboard definitely had some unfortunate stylesheet changes. I haven't been able to test OSD failures yet. Are you using NVMe?

I'm going to attach a screenshot of the configuration database from the PVE GUI. There are some parameters I do not remember setting:

target_max_misplaced_ratio
osd_max_backfills
osd_recovery_max_active
osd_recovery_max_single_start
osd_recovery_sleep
osd_mclock_max_capacity_iops_ssd


do you think I should put them back to defaults (and how?)
 

Attachments

  • screenshot.png
    screenshot.png
    43.3 KB · Views: 11
We are using SSD and NVMe .

Except from the enabled telemetry data, in my configuration only the parameter osd_max_backfills is set to 16.
I can not remember setting this manually, but maybe i set it when our cluster was going down because it backfilled 500 PGs at the same time with the old version.

But it looks like "1" is the default value for osd_max_backfills .

Code:
root@prox1:~# ceph config help osd_max_backfills
osd_max_backfills - Maximum number of concurrent local and remote backfills or recoveries per OSD
  (uint, advanced)
  Default: 1
  Can update at runtime: true
  Services: [osd]

If you want to remove the configuration line, so you are sure that you are using the default values, then use this command

ceph config rm osd osd_max_backfills
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!