Since we upgrade our production Proxmox version from 8.2.x to 8.4.14 and ceph form version 18.2.1 to 19.2.3, we were observing slow OSDs since day 2 of upgrade.
We have daily backup of our production VMs from PVE to PBS starting from hour 2100 to ~0400. When its time for Database VM backup(time 0200) which takes ~40 minutes, we noticed that it is when we are experiencing slow OSDs.
What did we do since then, trying to resolve the issue?
- Increased warning time from default to below. source
- Reshuffled the VMs backup timimg to see if it was backup that were slowing OSDs. It was then, when we realized it was backup.
- Tuned rocksdb, increasing
- Disabling scrub during the time of Database VM's backup.
Other Information
- We are using 4 HDD x 2235 GB (SAS-10K RPM) with RF3 across 3 nodes, not using any SSDs. Proxmox is installed in SSD tho.
- We are monitoring IO waits since 15 days, we are not monitoring any abnormalities either.
- Our current rocksdb_options
Below are the log output.
We have daily backup of our production VMs from PVE to PBS starting from hour 2100 to ~0400. When its time for Database VM backup(time 0200) which takes ~40 minutes, we noticed that it is when we are experiencing slow OSDs.
What did we do since then, trying to resolve the issue?
- Increased warning time from default to below. source
Code:
ceph config set osd.0 bluestore_slow_ops_warn_threshold 10
ceph config set osd.0 bluestore_slow_ops_warn_lifetime 60
- Tuned rocksdb, increasing
write_buffer_size to 64 mb form 16, max_write_buffer_number to 32 from 64 and planning to decrease `max_background_jobs` from 4 to 2.- Disabling scrub during the time of Database VM's backup.
Other Information
- We are using 4 HDD x 2235 GB (SAS-10K RPM) with RF3 across 3 nodes, not using any SSDs. Proxmox is installed in SSD tho.
- We are monitoring IO waits since 15 days, we are not monitoring any abnormalities either.
- Our current rocksdb_options
Code:
ceph tell osd.0 config get bluestore_rocksdb_options
{ "bluestore_rocksdb_options": "compression=kLZ4Compression,max_write_buffer_number=32,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0" }
Below are the log output.
Code:
cat ceph.log | grep slow
2026-01-14T02:18:10.838723+0545 mon.server-node1 (mon.0) 1067191 : cluster [WRN] Health check failed: 1 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T02:25:00.000080+0545 mon.server-node1 (mon.0) 1067375 : cluster [WRN] Health detail: HEALTH_WARN 1 OSD(s) experiencing slow operations in BlueStore
2026-01-14T02:25:00.000091+0545 mon.server-node1 (mon.0) 1067376 : cluster [WRN] [WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow operations in BlueStore
2026-01-14T02:25:00.000096+0545 mon.server-node1 (mon.0) 1067377 : cluster [WRN] osd.11 observed slow operation indications in BlueStore
2026-01-14T02:25:43.219686+0545 mon.server-node1 (mon.0) 1067400 : cluster [WRN] Health check update: 2 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T02:28:00.864262+0545 mon.server-node1 (mon.0) 1067459 : cluster [WRN] Health check update: 3 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T02:29:13.093506+0545 mon.server-node1 (mon.0) 1067499 : cluster [WRN] Health check update: 4 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T02:31:27.079643+0545 mon.server-node1 (mon.0) 1067556 : cluster [WRN] Health check update: 5 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T02:32:25.235918+0545 mon.server-node1 (mon.0) 1067584 : cluster [WRN] Health check update: 6 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T02:35:00.000076+0545 mon.server-node1 (mon.0) 1067653 : cluster [WRN] Health detail: HEALTH_WARN 6 OSD(s) experiencing slow operations in BlueStore
2026-01-14T02:35:00.000086+0545 mon.server-node1 (mon.0) 1067654 : cluster [WRN] [WRN] BLUESTORE_SLOW_OP_ALERT: 6 OSD(s) experiencing slow operations in BlueStore
2026-01-14T02:35:00.000090+0545 mon.server-node1 (mon.0) 1067655 : cluster [WRN] osd.0 observed slow operation indications in BlueStore
2026-01-14T02:35:00.000101+0545 mon.server-node1 (mon.0) 1067656 : cluster [WRN] osd.2 observed slow operation indications in BlueStore
2026-01-14T02:35:00.000107+0545 mon.server-node1 (mon.0) 1067657 : cluster [WRN] osd.5 observed slow operation indications in BlueStore
2026-01-14T02:35:00.000119+0545 mon.server-node1 (mon.0) 1067658 : cluster [WRN] osd.7 observed slow operation indications in BlueStore
2026-01-14T02:35:00.000124+0545 mon.server-node1 (mon.0) 1067659 : cluster [WRN] osd.8 observed slow operation indications in BlueStore
2026-01-14T02:35:00.000128+0545 mon.server-node1 (mon.0) 1067660 : cluster [WRN] osd.11 observed slow operation indications in BlueStore
2026-01-14T02:37:08.727768+0545 mon.server-node1 (mon.0) 1067721 : cluster [WRN] Health check update: 7 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T02:45:00.000092+0545 mon.server-node1 (mon.0) 1067927 : cluster [WRN] Health detail: HEALTH_WARN 7 OSD(s) experiencing slow operations in BlueStore
2026-01-14T02:45:00.000107+0545 mon.server-node1 (mon.0) 1067928 : cluster [WRN] [WRN] BLUESTORE_SLOW_OP_ALERT: 7 OSD(s) experiencing slow operations in BlueStore
2026-01-14T02:45:00.000112+0545 mon.server-node1 (mon.0) 1067929 : cluster [WRN] osd.0 observed slow operation indications in BlueStore
2026-01-14T02:45:00.000124+0545 mon.server-node1 (mon.0) 1067930 : cluster [WRN] osd.2 observed slow operation indications in BlueStore
2026-01-14T02:45:00.000131+0545 mon.server-node1 (mon.0) 1067931 : cluster [WRN] osd.5 observed slow operation indications in BlueStore
2026-01-14T02:45:00.000136+0545 mon.server-node1 (mon.0) 1067932 : cluster [WRN] osd.7 observed slow operation indications in BlueStore
2026-01-14T02:45:00.000143+0545 mon.server-node1 (mon.0) 1067933 : cluster [WRN] osd.8 observed slow operation indications in BlueStore
2026-01-14T02:45:00.000147+0545 mon.server-node1 (mon.0) 1067934 : cluster [WRN] osd.10 observed slow operation indications in BlueStore
2026-01-14T02:45:00.000152+0545 mon.server-node1 (mon.0) 1067935 : cluster [WRN] osd.11 observed slow operation indications in BlueStore
2026-01-14T02:55:00.000090+0545 mon.server-node1 (mon.0) 1068205 : cluster [WRN] overall HEALTH_WARN 7 OSD(s) experiencing slow operations in BlueStore
2026-01-14T03:05:00.000099+0545 mon.server-node1 (mon.0) 1068479 : cluster [WRN] overall HEALTH_WARN 7 OSD(s) experiencing slow operations in BlueStore
2026-01-14T03:11:55.325619+0545 mon.server-node1 (mon.0) 1068671 : cluster [WRN] Health check update: 6 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T03:15:00.000078+0545 mon.server-node1 (mon.0) 1068756 : cluster [WRN] Health detail: HEALTH_WARN 6 OSD(s) experiencing slow operations in BlueStore
2026-01-14T03:15:00.000092+0545 mon.server-node1 (mon.0) 1068757 : cluster [WRN] [WRN] BLUESTORE_SLOW_OP_ALERT: 6 OSD(s) experiencing slow operations in BlueStore
2026-01-14T03:15:00.000107+0545 mon.server-node1 (mon.0) 1068758 : cluster [WRN] osd.0 observed slow operation indications in BlueStore
2026-01-14T03:15:00.000110+0545 mon.server-node1 (mon.0) 1068759 : cluster [WRN] osd.2 observed slow operation indications in BlueStore
2026-01-14T03:15:00.000126+0545 mon.server-node1 (mon.0) 1068760 : cluster [WRN] osd.5 observed slow operation indications in BlueStore
2026-01-14T03:15:00.000136+0545 mon.server-node1 (mon.0) 1068761 : cluster [WRN] osd.7 observed slow operation indications in BlueStore
2026-01-14T03:15:00.000141+0545 mon.server-node1 (mon.0) 1068762 : cluster [WRN] osd.8 observed slow operation indications in BlueStore
2026-01-14T03:15:00.000146+0545 mon.server-node1 (mon.0) 1068763 : cluster [WRN] osd.11 observed slow operation indications in BlueStore
2026-01-14T03:25:00.000081+0545 mon.server-node1 (mon.0) 1069036 : cluster [WRN] overall HEALTH_WARN 6 OSD(s) experiencing slow operations in BlueStore
2026-01-14T03:25:37.501635+0545 mon.server-node1 (mon.0) 1069057 : cluster [WRN] Health check update: 5 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T03:25:44.469737+0545 mon.server-node1 (mon.0) 1069061 : cluster [WRN] Health check update: 4 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T03:27:41.631527+0545 mon.server-node1 (mon.0) 1069117 : cluster [WRN] Health check update: 3 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T03:29:13.547711+0545 mon.server-node1 (mon.0) 1069162 : cluster [WRN] Health check update: 2 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T03:29:19.612722+0545 mon.server-node1 (mon.0) 1069165 : cluster [WRN] Health check update: 1 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
2026-01-14T03:32:17.772964+0545 mon.server-node1 (mon.0) 1069245 : cluster [INF] Health check cleared: BLUESTORE_SLOW_OP_ALERT (was: 1 OSD(s) experiencing slow operations in BlueStore)
Code:
cat ceph-osd.0.log | grep "slow operation" -A 5 -B 5
** File Read Latency Histogram By Level [P] **
2026-01-14T01:59:33.518+0545 72e4315b16c0 0 bluestore.MempoolThread fragmentation_score=0.465069 took=0.193281s
2026-01-14T02:00:25.079+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723021. Immutable memtables: 1.
2026-01-14T02:01:36.890+0545 72e4235956c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.554507256s, num_ios = 693
2026-01-14T02:01:36.892+0545 72e4235956c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.556301117s
2026-01-14T02:01:37.980+0545 72e4225936c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 6.644386292s, num_ios = 632
2026-01-14T02:01:37.982+0545 72e4225936c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 6.646862030s
2026-01-14T02:01:38.278+0545 72e421d926c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 6.942356586s, num_ios = 696
2026-01-14T02:01:38.280+0545 72e421d926c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 6.944276810s
2026-01-14T02:01:38.309+0545 72e4215916c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.030138016s, num_ios = 604
2026-01-14T02:01:38.311+0545 72e4215916c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.031661510s
2026-01-14T02:03:12.895+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723022. Immutable memtables: 2.
2026-01-14T02:05:24.275+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723023. Immutable memtables: 3.
2026-01-14T02:07:38.721+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723024. Immutable memtables: 4.
--
2026-01-14T02:21:09.910+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723030. Immutable memtables: 10.
2026-01-14T02:23:53.037+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723031. Immutable memtables: 11.
2026-01-14T02:25:34.610+0545 72e421d926c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.095891953s, num_ios = 641
2026-01-14T02:25:34.612+0545 72e421d926c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.098573208s
2026-01-14T02:26:46.179+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723032. Immutable memtables: 12.
2026-01-14T02:28:14.566+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:1601] Flushing all column families with data in WAL number 2723010. Total log size is 1073791804 while max_total_wal_size is 1073741824
2026-01-14T02:28:14.566+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [default] New memtable created with log file: #2723033. Immutable memtables: 0.
--
2026-01-14T02:33:05.626+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723044. Immutable memtables: 1.
2026-01-14T02:35:05.234+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723045. Immutable memtables: 2.
2026-01-14T02:35:56.923+0545 72e4215916c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.056079865s, num_ios = 331
2026-01-14T02:35:56.925+0545 72e4215916c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.057349682s
2026-01-14T02:35:56.992+0545 72e421d926c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.140451431s, num_ios = 526
2026-01-14T02:35:56.992+0545 72e421d926c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.141887665s
2026-01-14T02:35:57.446+0545 72e422d946c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.598603249s, num_ios = 672
2026-01-14T02:35:57.446+0545 72e422d946c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.600515366s
2026-01-14T02:35:57.587+0545 72e4225936c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency_fn slow operation observed for _do_readv, latency = 5.673437119s, num_ios = 677
2026-01-14T02:35:57.588+0545 72e4225936c0 0 bluestore(/var/lib/ceph/osd/ceph-0) log_latency slow operation observed for readv, latency = 5.675236225s
2026-01-14T02:37:11.632+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723046. Immutable memtables: 3.
2026-01-14T02:38:34.710+0545 72e42d5a96c0 4 rocksdb: [db/db_impl/db_impl_write.cc:2098] [L] New memtable created with log file: #2723047. Immutable memtables: 4.
2026-01-14T02:39:33.386+0545 72e42c5a76c0 4 rocksdb: [db/db_impl/db_impl.cc:1109] ------- DUMPING STATS -------
Last edited: