Hi to all
We have recently upgraded our cluster from the version 5.4 to version 6.3. Our cluster is composed of 6 ceph node and 5 hypervisor. All server have the same packages versions. Here's the details:
We've followed the steps in the procedure guide and everything went right, but after few days, we've started to get some crash osd side. We have already deep-scrubed all the OSD and reinstalled the OSD on each node (one node per time, destroyng all the OSD and recreating them); unfortunately without result.
All servers have 6 OSD with a DB disk on a NVME sife 30GB each DB part.
Reading the information from the crash dump I can't understand what happened and the cause, does have a clue?
We have recently upgraded our cluster from the version 5.4 to version 6.3. Our cluster is composed of 6 ceph node and 5 hypervisor. All server have the same packages versions. Here's the details:
Code:
pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.101-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-6
pve-kernel-helper: 6.3-6
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-27-pve: 4.15.18-55
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph: 14.2.16-pve1
ceph-fuse: 14.2.16-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve2
We've followed the steps in the procedure guide and everything went right, but after few days, we've started to get some crash osd side. We have already deep-scrubed all the OSD and reinstalled the OSD on each node (one node per time, destroyng all the OSD and recreating them); unfortunately without result.
Code:
sudo ceph crash ls
ID ENTITY NEW
2021-03-07_01:12:26.218909Z_208c3854-65b9-4eed-b49b-e2be77945470 mon.is104
2021-03-09_06:33:27.088249Z_297f620f-6cd0-4120-9076-ea82cd50c34f osd.7
2021-03-13_04:29:51.479099Z_856b779d-6077-4882-9c72-73f1ac21f754 osd.11
2021-03-13_04:30:04.931045Z_3f875be2-2cfe-4004-9c1e-7248cfb019e5 osd.11
2021-03-13_04:30:16.013286Z_ebfd1eeb-de25-40b2-88ad-f3c1c99bf6ff osd.11
2021-03-13_04:30:27.712840Z_59776330-87d3-4cf8-9c8d-909e767c0c00 osd.11
2021-03-13_10:15:45.143602Z_fd12bded-4b67-4e5d-9043-818fe2b0762e osd.11
2021-03-13_10:15:57.017921Z_24d29e3f-8fc9-450f-91ef-3602b104cefd osd.11
2021-03-13_10:16:08.347586Z_ae6fe305-9086-4518-8f1f-57446b7fa1d8 osd.11
2021-03-13_10:16:55.169232Z_cf73388a-990b-440f-9a81-88fa95079e17 osd.11
2021-03-13_10:17:07.046529Z_a71aa847-7037-4548-8eac-69bc678edf98 osd.11
2021-03-13_10:17:18.350953Z_f966f8af-ec25-4938-adaf-777fca12cdc5 osd.11
2021-03-17_21:17:07.029168Z_95c8bfb0-641b-4f7b-a3a9-6dd056569521 osd.31
2021-03-19_07:55:53.344451Z_cd76eea6-6054-4661-ad3f-00df36a155fb osd.29
2021-03-21_05:01:57.468501Z_063378aa-05c6-4c18-9b8d-fc114959b2bb osd.28
2021-03-22_10:27:31.074965Z_b1de5fda-99bf-4d8b-9d84-1e54db925d6d osd.4
2021-03-22_17:31:00.510777Z_1bedff2f-a39b-438b-84bc-324bde91c6fb osd.30
2021-03-22_23:28:01.827518Z_f8d7c19c-7d83-4eee-ab27-09c47b704419 osd.29
sudo ceph crash info 2021-03-22_23:28:01.827518Z_f8d7c19c-7d83-4eee-ab27-09c47b704419
{
"os_version_id": "10",
"utsname_machine": "x86_64",
"entity_name": "osd.29",
"backtrace": [
"(()+0x12730) [0x7f2640439730]",
"(gsignal()+0x10b) [0x7f263ff1c7bb]",
"(abort()+0x121) [0x7f263ff07535]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x55c48a47ac6f]",
"(()+0x513df6) [0x55c48a47adf6]",
"(BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0xc7e) [0x55c48aaa238e]",
"(BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x20) [0x55c48aad0820]",
"(rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x956) [0x55c48b0ded56]",
"(rocksdb::BlockFetcher::ReadBlockContents()+0x589) [0x55c48b0a2f49]",
"(rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache(rocksdb::FilePrefetchBuffer*, rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, bool, rocksdb::GetContext*)+0x475) [0x55c48b091975]",
"(rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, bool, bool, bool, rocksdb::GetContext*, rocksdb::Status, rocksdb::FilePrefetchBuffer*)+0x390) [0x55c48b09edf0]",
"(rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bool)+0x492) [0x55c48b09a9d2]",
"(rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, rocksdb::HistogramImpl*, bool, int)+0x194) [0x55c48b010914]",
"(rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*)+0x35f) [0x55c48b02b04f]",
"(rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, bool*, rocksdb::ReadCallback*, bool*)+0x9a0) [0x55c48af48930]",
"(rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*)+0x26) [0x55c48af48c46]",
"(rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0xae) [0x55c48af574ce]",
"(RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v14_2_0::list*)+0x2bf) [0x55c48af1892f]",
"(()+0xa33fa9) [0x55c48a99afa9]",
"(()+0xa1eb11) [0x55c48a985b11]",
"(BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x557) [0x55c48a9a0a67]",
"(BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x35a) [0x55c48a9e983a]",
"(BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x1b5) [0x55c48a9f7925]",
"(ReplicatedBackend::objects_read_sync(hobject_t const&, unsigned long, unsigned long, unsigned int, ceph::buffer::v14_2_0::list*)+0xa3) [0x55c48a85d833]",
"(PrimaryLogPG::do_read(PrimaryLogPG::OpContext*, OSDOp&)+0x21b) [0x55c48a702e7b]",
"(PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x36ef) [0x55c48a71c58f]",
"(PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x14f) [0x55c48a72be6f]",
"(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x385) [0x55c48a72c605]",
"(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3101) [0x55c48a730cc1]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd77) [0x55c48a7330c7]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x392) [0x55c48a55eec2]",
"(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55c48a802fe2]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x55c48a57ac07]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55c48ab4b7f4]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55c48ab4e200]",
"(()+0x7fa3) [0x7f264042efa3]",
"(clone()+0x3f) [0x7f263ffde4cf]"
],
"assert_line": 1565,
"utsname_release": "5.4.101-1-pve",
"assert_file": "/build/ceph/ceph-14.2.16/src/os/bluestore/BlueFS.cc",
"utsname_sysname": "Linux",
"os_version": "10 (buster)",
"os_id": "10",
"assert_thread_name": "tp_osd_tp",
"assert_msg": "/build/ceph/ceph-14.2.16/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)' thread 7f26246e5700 time 2021-03-23 00:28:01.804118\n/build/ceph/ceph-14.2.16/src/os/bluestore/BlueFS.cc: 1565: FAILED ceph_assert(r == 0)\n",
"assert_func": "int BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)",
"ceph_version": "14.2.16",
"os_name": "Debian GNU/Linux 10 (buster)",
"timestamp": "2021-03-22 23:28:01.827518Z",
"process_name": "ceph-osd",
"archived": "2021-03-23 00:49:14.832600",
"utsname_hostname": "is104",
"crash_id": "2021-03-22_23:28:01.827518Z_f8d7c19c-7d83-4eee-ab27-09c47b704419",
"assert_condition": "r == 0",
"utsname_version": "#1 SMP PVE 5.4.101-1 (Fri, 26 Feb 2021 13:13:09 +0100)"
}
sudo ceph --version
ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)
All servers have 6 OSD with a DB disk on a NVME sife 30GB each DB part.
Reading the information from the crash dump I can't understand what happened and the cause, does have a clue?