Hi ! I have a cluster of pve 7.2.7 and ceph 16.2.9
About a month ago, some virtual machines in my cluster with CIS rolled on them (https://github.com/ansible-lockdown/UBUNTU20-CIS ) started to hang up for no reason. Only resetting the machine helps. There is nothing in the logs of the machines that would suggest a reason. In the ceph logs, there are errors about osd.54 like:
...
many lines like:
and after that:
but the disk does not fall out of the ceph cluster and continues to work later. Please help me figure out what the reason may be.
UPDATE1: I changed default SCSI controller to virtio and vm's stopped hanging tightly for now.
About a month ago, some virtual machines in my cluster with CIS rolled on them (https://github.com/ansible-lockdown/UBUNTU20-CIS ) started to hang up for no reason. Only resetting the machine helps. There is nothing in the logs of the machines that would suggest a reason. In the ceph logs, there are errors about osd.54 like:
...
** File Read Latency Histogram By Level [default] **
2023-06-11T02:00:57.664+0600 7f69beb2e700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency slow operation observed for kv_final, latency = 5.295015335s
2023-06-11T02:00:57.668+0600 7f69beb2e700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.264364243s, txc
= 0x55d756e7c000
2023-06-11T02:00:57.668+0600 7f69beb2e700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.263929367s, txc
= 0x55d6f2c1fc00
2023-06-11T02:00:57.668+0600 7f69beb2e700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.148843288s, txc
= 0x55d773da0000
2023-06-11T02:01:12.929+0600 7f69c8dda700 0 bad crc in data 4237898677 != exp 706411430 from v1:192.168.160.5:0/171261133
2023-06-11T02:01:20.709+0600 7f69ae31c700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency slow operation observed for submit_transact, latency = 5.904376507s
2023-06-11T02:01:20.709+0600 7f69beb2e700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency slow operation observed for kv_final, latency = 5.099318504s
2023-06-11T02:01:20.709+0600 7f69beb2e700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.101370811s, txc
= 0x55d7d2caa380
2023-06-11T02:01:20.709+0600 7f69beb2e700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.905891418s, txc
= 0x55d761e26000
...
many lines like:
2023-06-11T02:01:52.926+0600 7f69c8dda700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f69af31e700' had timed out after 15.000000954s
and after that:
2023-06-11T02:02:23.654+0600 7f69af31e700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency slow operation observed for submit_transact, latency = 45.771850586s
2023-06-11T02:02:23.654+0600 7f69af31e700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f69af31e700' had timed out after 15.000000954s
2023-06-11T02:02:23.654+0600 7f69beb2e700 0 bluestore(/var/lib/ceph/osd/ceph-54) log_latency_fn slow operation observed for _txc_committed_kv, latency = 45.773029327s, txc
= 0x55d7aa9bdc00
2023-06-11T02:02:34.362+0600 7f69b9b24700 4 rocksdb: [db_impl/db_impl_write.cc:1665] [default] New memtable created with log file: #318321. Immutable memtables: 0.
2023-06-11T02:02:34.366+0600 7f69c0340700 4 rocksdb: (Original Log Time 2023/06/11-02:02:34.367705) [db_impl/db_impl_compaction_flush.cc:2611] Compaction nothing to do
2023-06-11T02:02:34.366+0600 7f69bfb3f700 4 rocksdb: (Original Log Time 2023/06/11-02:02:34.367851) [db_impl/db_impl_compaction_flush.cc:2190] Calling FlushMemTableToOutpu
tFile with column family [default], flush slots available 1, compaction slots available 1, flush slots scheduled 1, compaction slots scheduled 0
2023-06-11T02:02:34.366+0600 7f69bfb3f700 4 rocksdb: [flush_job.cc:318] [default] [JOB 17566] Flushing memtable with next log file: 318321
but the disk does not fall out of the ceph cluster and continues to work later. Please help me figure out what the reason may be.
UPDATE1: I changed default SCSI controller to virtio and vm's stopped hanging tightly for now.
Last edited: