Hi,
in an setup with Ceph we've a problem:
An osd goes down immediatily and the pool go readonly:
In the log I've this entrys:
This occure the scond time this week, on Monday night there where two osd on an other node that goes down.
After kill all VMs and wait some time we could reactivate the osd's.
Now two days after the osd on the other node is down.
Some ideas for a reason or solution?
Bye
Gregor
in an setup with Ceph we've a problem:
An osd goes down immediatily and the pool go readonly:
In the log I've this entrys:
Code:
root@ph-pve006:~# grep 'osd.8' /var/log/ceph/ceph.audit.log
2020-09-17 13:05:11.773212 mon.ph-pve004 (mon.0) 3575117 : audit [INF] from='osd.8 ' entity='osd.8' cmd=[{"prefix": "osd crush set-device-class", "class": "ssd", "ids": ["8"]}]: dispatch
2020-09-17 13:05:11.775337 mon.ph-pve004 (mon.0) 3575118 : audit [INF] from='osd.8 ' entity='osd.8' cmd=[{"prefix": "osd crush create-or-move", "id": 8, "weight":1.7461, "args": ["host=ph-pve006", "root=default"]}]: dispatch
2020-09-17 13:05:11.781545 mon.ph-pve006 (mon.1) 3338019 : audit [INF] from='osd.8 [v2:10.0.45.43:6816/509242,v1:10.0.45.43:6817/509242]' entity='osd.8' cmd=[{"prefix": "osd crush set-device-class", "class": "ssd", "ids": ["8"]}]: dispatch
2020-09-17 13:05:11.783787 mon.ph-pve006 (mon.1) 3338020 : audit [INF] from='osd.8 [v2:10.0.45.43:6816/509242,v1:10.0.45.43:6817/509242]' entity='osd.8' cmd=[{"prefix": "osd crush create-or-move", "id": 8, "weight":1.7461, "args": ["host=ph-pve006", "root=default"]}]: dispatch
2020-09-17 13:05:19.608085 mon.ph-pve004 (mon.0) 3575178 : audit [INF] from='osd.8 ' entity='osd.8' cmd=[{"prefix": "osd crush set-device-class", "class": "ssd", "ids": ["8"]}]: dispatch
2020-09-17 13:05:19.611308 mon.ph-pve004 (mon.0) 3575179 : audit [INF] from='osd.8 ' entity='osd.8' cmd=[{"prefix": "osd crush create-or-move", "id": 8, "weight":1.7461, "args": ["host=ph-pve006", "root=default"]}]: dispatch
2020-09-17 13:05:19.616324 mon.ph-pve005 (mon.2) 3410338 : audit [INF] from='osd.8 [v2:10.0.45.43:6816/509449,v1:10.0.45.43:6817/509449]' entity='osd.8' cmd=[{"prefix": "osd crush set-device-class", "class": "ssd", "ids": ["8"]}]: dispatch
2020-09-17 13:05:19.619604 mon.ph-pve005 (mon.2) 3410339 : audit [INF] from='osd.8 [v2:10.0.45.43:6816/509449,v1:10.0.45.43:6817/509449]' entity='osd.8' cmd=[{"prefix": "osd crush create-or-move", "id": 8, "weight":1.7461, "args": ["host=ph-pve006", "root=default"]}]: dispatch
2020-09-17 13:05:28.323236 mon.ph-pve004 (mon.0) 3575231 : audit [INF] from='osd.8 [v2:10.0.45.43:6816/509654,v1:10.0.45.43:6817/509654]' entity='osd.8' cmd=[{"prefix": "osd crush set-device-class", "class": "ssd", "ids": ["8"]}]: dispatch
2020-09-17 13:05:28.325814 mon.ph-pve004 (mon.0) 3575232 : audit [INF] from='osd.8 [v2:10.0.45.43:6816/509654,v1:10.0.45.43:6817/509654]' entity='osd.8' cmd=[{"prefix": "osd crush create-or-move", "id": 8, "weight":1.7461, "args": ["host=ph-pve006", "root=default"]}]: dispatch
2020-09-17 13:05:44.364742 mon.ph-pve004 (mon.0) 3575288 : audit [INF] from='mgr.204125 10.0.45.41:0/1975' entity='mgr.ph-pve004' cmd=[{"prefix":"config-key set","key":"mgr/crash/crash/2020-09-17_11:05:21.190075Z_1ab3986c-6cd8-4976-ac2e-0751e967f4e9","val":"{\"os_version_id\": \"10\", \"assert_condition\": \"r == 0\", \"utsname_release\": \"5.3.18-1-pve\", \"os_name\": \"Debian GNU/Linux 10 (buster)\", \"entity_name\": \"osd.8\", \"assert_file\": \"/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/os/bluestore/BlueStore.cc\", \"timestamp\": \"2020-09-17 11:05:21.190075Z\", \"process_name\": \"ceph-osd\", \"utsname_machine\": \"x86_64\", \"assert_line\": 9152, \"utsname_sysname\": \"Linux\", \"os_version\": \"10 (buster)\", \"os_id\": \"10\", \"assert_thread_name\": \"tp_osd_tp\", \"utsname_version\": \"#1 SMP PVE 5.3.18-1 (Wed, 05 Feb 2020 11:49:10 +0100)\", \"backtrace\": [\"(()+0x12730) [0x7f3b108a6730]\", \"(gsignal()+0x10b) [0x7f3b103897bb]\", \"(abort()+0x121) [0x7f3b10374535]\", \"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x560ece4e7ad7]\", \"(()+0x518c5e) [0x560ece4e7c5e]\", \"(BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x39e7) [0x560ecea527f7]\", \"(BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x1b5) [0x560ecea56595]\", \"(ReplicatedBackend::objects_read_sync(hobject_t const&, unsigned long, unsigned long, unsigned int, ceph::buffer::v14_2_0::list*)+0xa3) [0x560ece8c59a3]\", \"(PrimaryLogPG::do_sparse_read(PrimaryLogPG::OpContext*, OSDOp&)+0x5bd) [0x560ece77048d]\", \"(PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x7c77) [0x560ece787127]\", \"(PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x14f) [0x560ece79251f]\", \"(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x385) [0x560ece792cb5]\", \"(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3101) [0x560ece797371]\", \"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd77) [0x560ece799777]\", \"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x397) [0x560ece5c8b77]\", \"(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x560ece86aa52]\", \"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x560ece5e3d17]\", \"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x560eceba2864]\", \"(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560eceba5270]\", \"(()+0x7fa3) [0x7f3b1089bfa3]\", \"(clone()+0x3f) [0x7f3b1044b4cf]\"], \"utsname_hostname\": \"ph-pve006.peiker-holding.de\", \"assert_msg\": \"/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_do_read(BlueStore::Collection*, BlueStore::OnodeRef, uint64_t, size_t, ceph::bufferlist&, uint32_t, uint64_t)' thread 7f3af390f700 time 2020-09-17 13:05:21.166856\\n/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 9152: FAILED ceph_assert(r == 0)\\n\", \"crash_id\": \"2020-09-17_11:05:21.190075Z_1ab3986c-6cd8-4976-ac2e-0751e967f4e9\", \"assert_func\": \"int BlueStore::_do_read(BlueStore::Collection*, BlueStore::OnodeRef, uint64_t, size_t, ceph::bufferlist&, uint32_t, uint64_t)\", \"ceph_version\": \"14.2.6\"}"}]: dispatch
This occure the scond time this week, on Monday night there where two osd on an other node that goes down.
After kill all VMs and wait some time we could reactivate the osd's.
Now two days after the osd on the other node is down.
Some ideas for a reason or solution?
Bye
Gregor
Last edited: