Experiencing slow OSDs after upgrading Ceph version 18 to 19 in Proxmox v8.4

statrtsmall

New Member
Jan 14, 2026
3
0
1
Since we upgrade our production Proxmox version from 8.2.x to 8.4.14 and ceph form version 18.2.1 to 19.2.3, we were observing slow OSDs since day 2 of upgrade.
We have daily backup of our production VMs from PVE to PBS starting from hour 2100 to ~0400. When its time for Database VM backup(time 0200) which takes ~40 minutes, we noticed that it is when we are experiencing slow OSDs.

What did we do since then, trying to resolve the issue?
- Increased warning time from default to below. source
Code:
ceph config set osd.0 bluestore_slow_ops_warn_threshold 10


ceph config set osd.0 bluestore_slow_ops_warn_lifetime 60
- Reshuffled the VMs backup timimg to see if it was backup that were slowing OSDs. It was then, when we realized it was backup.

- Tuned rocksdb, increasing write_buffer_size to 64 mb form 16, max_write_buffer_number to 32 from 64 and planning to decrease `max_background_jobs` from 4 to 2.

- Disabling scrub during the time of Database VM's backup.

Other Information
- We are using 4 HDD x 2235 GB (SAS-10K RPM) with RF3 across 3 nodes, not using any SSDs. Proxmox is installed in SSD tho.
- We are monitoring IO waits since 15 days, we are not monitoring any abnormalities either.
- Our current rocksdb_options
Code:
ceph tell osd.0 config get bluestore_rocksdb_options

{ "bluestore_rocksdb_options": "compression=kLZ4Compression,max_write_buffer_number=32,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0" }

Below are the log output.


Code:
cat ceph.log | grep slow

2026-01-14T02:18:10.838723+0545 mon.server-node1 (mon.0) 1067191 : cluster [WRN] Health check failed: 1 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T02:25:00.000080+0545 mon.server-node1 (mon.0) 1067375 : cluster [WRN] Health detail: HEALTH_WARN 1 OSD(s) experiencing slow operations in BlueStore

2026-01-14T02:25:00.000091+0545 mon.server-node1 (mon.0) 1067376 : cluster [WRN] [WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow operations in BlueStore

2026-01-14T02:25:00.000096+0545 mon.server-node1 (mon.0) 1067377 : cluster [WRN]      osd.11 observed slow operation indications in BlueStore

2026-01-14T02:25:43.219686+0545 mon.server-node1 (mon.0) 1067400 : cluster [WRN] Health check update: 2 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T02:28:00.864262+0545 mon.server-node1 (mon.0) 1067459 : cluster [WRN] Health check update: 3 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T02:29:13.093506+0545 mon.server-node1 (mon.0) 1067499 : cluster [WRN] Health check update: 4 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T02:31:27.079643+0545 mon.server-node1 (mon.0) 1067556 : cluster [WRN] Health check update: 5 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T02:32:25.235918+0545 mon.server-node1 (mon.0) 1067584 : cluster [WRN] Health check update: 6 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T02:35:00.000076+0545 mon.server-node1 (mon.0) 1067653 : cluster [WRN] Health detail: HEALTH_WARN 6 OSD(s) experiencing slow operations in BlueStore

2026-01-14T02:35:00.000086+0545 mon.server-node1 (mon.0) 1067654 : cluster [WRN] [WRN] BLUESTORE_SLOW_OP_ALERT: 6 OSD(s) experiencing slow operations in BlueStore

2026-01-14T02:35:00.000090+0545 mon.server-node1 (mon.0) 1067655 : cluster [WRN]      osd.0 observed slow operation indications in BlueStore

2026-01-14T02:35:00.000101+0545 mon.server-node1 (mon.0) 1067656 : cluster [WRN]      osd.2 observed slow operation indications in BlueStore

2026-01-14T02:35:00.000107+0545 mon.server-node1 (mon.0) 1067657 : cluster [WRN]      osd.5 observed slow operation indications in BlueStore

2026-01-14T02:35:00.000119+0545 mon.server-node1 (mon.0) 1067658 : cluster [WRN]      osd.7 observed slow operation indications in BlueStore

2026-01-14T02:35:00.000124+0545 mon.server-node1 (mon.0) 1067659 : cluster [WRN]      osd.8 observed slow operation indications in BlueStore

2026-01-14T02:35:00.000128+0545 mon.server-node1 (mon.0) 1067660 : cluster [WRN]      osd.11 observed slow operation indications in BlueStore

2026-01-14T02:37:08.727768+0545 mon.server-node1 (mon.0) 1067721 : cluster [WRN] Health check update: 7 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T02:45:00.000092+0545 mon.server-node1 (mon.0) 1067927 : cluster [WRN] Health detail: HEALTH_WARN 7 OSD(s) experiencing slow operations in BlueStore

2026-01-14T02:45:00.000107+0545 mon.server-node1 (mon.0) 1067928 : cluster [WRN] [WRN] BLUESTORE_SLOW_OP_ALERT: 7 OSD(s) experiencing slow operations in BlueStore

2026-01-14T02:45:00.000112+0545 mon.server-node1 (mon.0) 1067929 : cluster [WRN]      osd.0 observed slow operation indications in BlueStore

2026-01-14T02:45:00.000124+0545 mon.server-node1 (mon.0) 1067930 : cluster [WRN]      osd.2 observed slow operation indications in BlueStore

2026-01-14T02:45:00.000131+0545 mon.server-node1 (mon.0) 1067931 : cluster [WRN]      osd.5 observed slow operation indications in BlueStore

2026-01-14T02:45:00.000136+0545 mon.server-node1 (mon.0) 1067932 : cluster [WRN]      osd.7 observed slow operation indications in BlueStore

2026-01-14T02:45:00.000143+0545 mon.server-node1 (mon.0) 1067933 : cluster [WRN]      osd.8 observed slow operation indications in BlueStore

2026-01-14T02:45:00.000147+0545 mon.server-node1 (mon.0) 1067934 : cluster [WRN]      osd.10 observed slow operation indications in BlueStore

2026-01-14T02:45:00.000152+0545 mon.server-node1 (mon.0) 1067935 : cluster [WRN]      osd.11 observed slow operation indications in BlueStore

2026-01-14T02:55:00.000090+0545 mon.server-node1 (mon.0) 1068205 : cluster [WRN] overall HEALTH_WARN 7 OSD(s) experiencing slow operations in BlueStore

2026-01-14T03:05:00.000099+0545 mon.server-node1 (mon.0) 1068479 : cluster [WRN] overall HEALTH_WARN 7 OSD(s) experiencing slow operations in BlueStore

2026-01-14T03:11:55.325619+0545 mon.server-node1 (mon.0) 1068671 : cluster [WRN] Health check update: 6 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T03:15:00.000078+0545 mon.server-node1 (mon.0) 1068756 : cluster [WRN] Health detail: HEALTH_WARN 6 OSD(s) experiencing slow operations in BlueStore

2026-01-14T03:15:00.000092+0545 mon.server-node1 (mon.0) 1068757 : cluster [WRN] [WRN] BLUESTORE_SLOW_OP_ALERT: 6 OSD(s) experiencing slow operations in BlueStore

2026-01-14T03:15:00.000107+0545 mon.server-node1 (mon.0) 1068758 : cluster [WRN]      osd.0 observed slow operation indications in BlueStore

2026-01-14T03:15:00.000110+0545 mon.server-node1 (mon.0) 1068759 : cluster [WRN]      osd.2 observed slow operation indications in BlueStore

2026-01-14T03:15:00.000126+0545 mon.server-node1 (mon.0) 1068760 : cluster [WRN]      osd.5 observed slow operation indications in BlueStore

2026-01-14T03:15:00.000136+0545 mon.server-node1 (mon.0) 1068761 : cluster [WRN]      osd.7 observed slow operation indications in BlueStore

2026-01-14T03:15:00.000141+0545 mon.server-node1 (mon.0) 1068762 : cluster [WRN]      osd.8 observed slow operation indications in BlueStore

2026-01-14T03:15:00.000146+0545 mon.server-node1 (mon.0) 1068763 : cluster [WRN]      osd.11 observed slow operation indications in BlueStore

2026-01-14T03:25:00.000081+0545 mon.server-node1 (mon.0) 1069036 : cluster [WRN] overall HEALTH_WARN 6 OSD(s) experiencing slow operations in BlueStore

2026-01-14T03:25:37.501635+0545 mon.server-node1 (mon.0) 1069057 : cluster [WRN] Health check update: 5 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T03:25:44.469737+0545 mon.server-node1 (mon.0) 1069061 : cluster [WRN] Health check update: 4 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T03:27:41.631527+0545 mon.server-node1 (mon.0) 1069117 : cluster [WRN] Health check update: 3 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T03:29:13.547711+0545 mon.server-node1 (mon.0) 1069162 : cluster [WRN] Health check update: 2 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T03:29:19.612722+0545 mon.server-node1 (mon.0) 1069165 : cluster [WRN] Health check update: 1 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)

2026-01-14T03:32:17.772964+0545 mon.server-node1 (mon.0) 1069245 : cluster [INF] Health check cleared: BLUESTORE_SLOW_OP_ALERT (was: 1 OSD(s) experiencing slow operations in BlueStore)

Code:
cat ceph-osd.0.log | grep "slow operation" -A 5 -B 5
** File Read Latency Histogram By Level [P] **

2026-01-14T01:59:33.518+0545 72e4315b16c0  0 bluestore.MempoolThread fragmentation_score=0.465069 took=0.193281s
2026-01-14T02:00:25.079+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723021. Immutable memtables: 1.

2026-01-14T02:01:36.890+0545 72e4235956c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.554507256s, num_ios = 693
2026-01-14T02:01:36.892+0545 72e4235956c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.556301117s
2026-01-14T02:01:37.980+0545 72e4225936c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 6.644386292s, num_ios = 632
2026-01-14T02:01:37.982+0545 72e4225936c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 6.646862030s
2026-01-14T02:01:38.278+0545 72e421d926c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 6.942356586s, num_ios = 696
2026-01-14T02:01:38.280+0545 72e421d926c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 6.944276810s
2026-01-14T02:01:38.309+0545 72e4215916c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.030138016s, num_ios = 604
2026-01-14T02:01:38.311+0545 72e4215916c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.031661510s
2026-01-14T02:03:12.895+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723022. Immutable memtables: 2.

2026-01-14T02:05:24.275+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723023. Immutable memtables: 3.

2026-01-14T02:07:38.721+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723024. Immutable memtables: 4.
--

2026-01-14T02:21:09.910+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723030. Immutable memtables: 10.

2026-01-14T02:23:53.037+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723031. Immutable memtables: 11.

2026-01-14T02:25:34.610+0545 72e421d926c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.095891953s, num_ios = 641
2026-01-14T02:25:34.612+0545 72e421d926c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.098573208s
2026-01-14T02:26:46.179+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723032. Immutable memtables: 12.

2026-01-14T02:28:14.566+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:1601] Flushing all column families with data in WAL number 2723010. Total log size is 1073791804 while max_total_wal_size is 1073741824
2026-01-14T02:28:14.566+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [default] New memtable created with log file: #2723033. Immutable memtables: 0.

--

2026-01-14T02:33:05.626+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723044. Immutable memtables: 1.

2026-01-14T02:35:05.234+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723045. Immutable memtables: 2.

2026-01-14T02:35:56.923+0545 72e4215916c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.056079865s, num_ios = 331
2026-01-14T02:35:56.925+0545 72e4215916c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.057349682s
2026-01-14T02:35:56.992+0545 72e421d926c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.140451431s, num_ios = 526
2026-01-14T02:35:56.992+0545 72e421d926c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.141887665s
2026-01-14T02:35:57.446+0545 72e422d946c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.598603249s, num_ios = 672
2026-01-14T02:35:57.446+0545 72e422d946c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.600515366s
2026-01-14T02:35:57.587+0545 72e4225936c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.673437119s, num_ios = 677
2026-01-14T02:35:57.588+0545 72e4225936c0  0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.675236225s
2026-01-14T02:37:11.632+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723046. Immutable memtables: 3.

2026-01-14T02:38:34.710+0545 72e42d5a96c0  4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723047. Immutable memtables: 4.

2026-01-14T02:39:33.386+0545 72e42c5a76c0  4 rocksdb: [db/db_impl/db_impl.cc:1109] ------- DUMPING STATS -------
 
Last edited:
This is a new warning in the newer version, per the thread you linked. So the question is, is your I/O actually slower than before or are you just seeing a warning that didn't exist before? "lifetime" is seconds and "threshold" is a count, IIRC, so 60/10 is 10 alerts in 60 seconds and then the counter resets. You can just adjust them until the warnings go away...or start moving to SSDs.
 
This is a new warning in the newer version, per the thread you linked. So the question is, is your I/O actually slower than before or are you just seeing a warning that didn't exist before? "lifetime" is seconds and "threshold" is a count, IIRC, so 60/10 is 10 alerts in 60 seconds and then the counter resets. You can just adjust them until the warnings go away...or start moving to SSDs.
thanks for replying so fast. We are not facing any IO or speed issues, just concerned about warnings. I have seen in other threads that squid is aggresive(power hungry). We are suspecting Rocksdb issues. Just wondering what can we tune to make these warning go away effectively not just supressing them.
Code:
iozone speed test
Children see throughput for  1 initial writers  = 1014515.50 kB/sec
        Parent sees throughput for  1 initial writers   =  173502.03 kB/sec
        Min throughput per process                      = 1014515.50 kB/sec
        Max throughput per process                      = 1014515.50 kB/sec
        Avg throughput per process                      = 1014515.50 kB/sec
        Min xfer                                        = 1048576.00 kB

        Children see throughput for  1 rewriters        = 2583475.75 kB/sec
        Parent sees throughput for  1 rewriters         =  175236.77 kB/sec
        Min throughput per process                      = 2583475.75 kB/sec
        Max throughput per process                      = 2583475.75 kB/sec
        Avg throughput per process                      = 2583475.75 kB/sec
        Min xfer                                        = 1048576.00 kB

        Children see throughput for 1 random readers    = 1551088.12 kB/sec
        Parent sees throughput for 1 random readers     = 1548399.22 kB/sec
        Min throughput per process                      = 1551088.12 kB/sec
        Max throughput per process                      = 1551088.12 kB/sec
        Avg throughput per process                      = 1551088.12 kB/sec
        Min xfer                                        = 1048576.00 kB

        Children see throughput for 1 random writers    = 1290579.00 kB/sec
        Parent sees throughput for 1 random writers     =  165026.83 kB/sec
        Min throughput per process                      = 1290579.00 kB/sec
        Max throughput per process                      = 1290579.00 kB/sec
        Avg throughput per process                      = 1290579.00 kB/sec
        Min xfer                                        = 1048576.00 kB
 
We had started with v19 so I'm not that familiar with older versions.

For a while we did have a few old HDDs we kept but had used SSDs for DB/WAL. We had rare alerts as I posted in that other thread but it may depend on several things I imagine...total I/O, CPU, etc. Note if using one SSD for all four HDDs that will work (if big enough) but is a single point of failure for all four. FWIW.
 
Upgrading to SSDs for DB/WAL is not currently possible for our infrastructure. Interestingly, we didn't encounter 'slow ops' warnings while running on Reef. Despite applying the recommended tunings, the alerts persist. Are there any version specific configurations or optimizations we might have overlooked to restore the performance levels we had on Reef? Given that alerts popup during PBS backups, we suspect this might be a sensitivity issue with the warning thresholds rather than a hard performance drop.

Any guidance would be greatly appreciated
Thanks!