OSD crasht nach Reboot von Clusterknoten (Ceph)

sebastian_k

Member
Sep 10, 2020
2
0
6
Berlin, Germany
Hallo,

Nach einem Update des Ceph-Clusters und eines Reboot des Knotens bekommen wir einen der OSDs nicht gestartet bzw. nach einem manuellen Start dieser erneut. Es handelt sich dabei immer um das selbe OSD.
Nachfolgend die Ceph Crash-Info:

JSON:
{
    "os_version_id": "10",
    "assert_condition": "p != spanning_blob_map.end()",
    "utsname_release": "5.4.55-1-pve",
    "os_name": "Debian GNU/Linux 10 (buster)",
    "entity_name": "osd.7",
    "assert_file": "/build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/BlueStore.h",
    "timestamp": "2020-09-10 09:10:43.062623Z",
    "process_name": "ceph-osd",
    "utsname_machine": "x86_64",
    "assert_line": 828,
    "utsname_sysname": "Linux",
    "os_version": "10 (buster)",
    "os_id": "10",
    "assert_thread_name": "tp_osd_tp",
    "utsname_version": "#1 SMP PVE 5.4.55-1 (Mon, 10 Aug 2020 10:26:27 +0200)",
    "backtrace": [
        "(()+0x12730) [0x7f4d56c1a730]",
        "(gsignal()+0x10b) [0x7f4d566fd7bb]",
        "(abort()+0x121) [0x7f4d566e8535]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x559ae05cc419]",
        "(()+0x5115a0) [0x559ae05cc5a0]",
        "(BlueStore::ExtentMap::decode_some(ceph::buffer::v14_2_0::list&)+0x28c) [0x559ae0aedd0c]",
        "(BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x581) [0x559ae0aefe91]",
        "(BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0x33f) [0x559ae0b4621f]",
        "(BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>)+0xd1) [0x559ae0b46a41]",
        "(BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&)+0x62) [0x559ae0b48402]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x153d) [0x559ae0b4d34d]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3c8) [0x559ae0b4eeb8]",
        "(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x80) [0x559ae0732500]",
        "(non-virtual thunk to PrimaryLogPG::queue_transaction(ObjectStore::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x4f) [0x559ae08be9bf]",
        "(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x461) [0x559ae09bbb11]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x559ae09c3af8]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x57) [0x559ae08d4e17]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x61f) [0x559ae088384f]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x392) [0x559ae06aff02]",
        "(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x559ae0953e92]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x559ae06cbba7]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x559ae0c980c4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x559ae0c9aad0]",
        "(()+0x7fa3) [0x7f4d56c0ffa3]",
        "(clone()+0x3f) [0x7f4d567bf4cf]"
    ],
    "utsname_hostname": "pveXX",
    "assert_msg": "/build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/BlueStore.h: In function 'BlueStore::BlobRef BlueStore::ExtentMap::get_spanning_blob(int)' thread 7f4d35c92700 time 2020-09-10 11:10:43.051455\n/build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/BlueStore.h: 828: FAILED ceph_assert(p != spanning_blob_map.end())\n",
    "crash_id": "2020-09-10_09:10:43.062623Z_c86c5da6-7508-47ec-aac2-c4147d3b1cbc",
    "assert_func": "BlueStore::BlobRef BlueStore::ExtentMap::get_spanning_blob(int)",
    "ceph_version": "14.2.11"
}

Auch ein weiterer Reboot der Maschine hat nicht geholfen, die SSDs sind den S.M.A.R.T. Werten zufolge alle ok und weisen keine Probleme auf. Wir konnten bisher auch keine anderen Hinweise auf diesen Fehler/Crash finden.

Wir sind leider inzwischen ratlos. Hat jemand eine Idee, wie wir das reparieren und das OSD wieder zum laufen bekommen?
 
Hallo,

Nach einem Update des Ceph-Clusters und eines Reboot des Knotens bekommen wir einen der OSDs nicht gestartet bzw. nach einem manuellen Start dieser erneut. Es handelt sich dabei immer um das selbe OSD.
Nachfolgend die Ceph Crash-Info:

JSON:
{
    "os_version_id": "10",
    "assert_condition": "p != spanning_blob_map.end()",
    "utsname_release": "5.4.55-1-pve",
    "os_name": "Debian GNU/Linux 10 (buster)",
    "entity_name": "osd.7",
    "assert_file": "/build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/BlueStore.h",
    "timestamp": "2020-09-10 09:10:43.062623Z",
    "process_name": "ceph-osd",
    "utsname_machine": "x86_64",
    "assert_line": 828,
    "utsname_sysname": "Linux",
    "os_version": "10 (buster)",
    "os_id": "10",
    "assert_thread_name": "tp_osd_tp",
    "utsname_version": "#1 SMP PVE 5.4.55-1 (Mon, 10 Aug 2020 10:26:27 +0200)",
    "backtrace": [
        "(()+0x12730) [0x7f4d56c1a730]",
        "(gsignal()+0x10b) [0x7f4d566fd7bb]",
        "(abort()+0x121) [0x7f4d566e8535]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x559ae05cc419]",
        "(()+0x5115a0) [0x559ae05cc5a0]",
        "(BlueStore::ExtentMap::decode_some(ceph::buffer::v14_2_0::list&)+0x28c) [0x559ae0aedd0c]",
        "(BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x581) [0x559ae0aefe91]",
        "(BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0x33f) [0x559ae0b4621f]",
        "(BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>)+0xd1) [0x559ae0b46a41]",
        "(BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&)+0x62) [0x559ae0b48402]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x153d) [0x559ae0b4d34d]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3c8) [0x559ae0b4eeb8]",
        "(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x80) [0x559ae0732500]",
        "(non-virtual thunk to PrimaryLogPG::queue_transaction(ObjectStore::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x4f) [0x559ae08be9bf]",
        "(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x461) [0x559ae09bbb11]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x559ae09c3af8]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x57) [0x559ae08d4e17]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x61f) [0x559ae088384f]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x392) [0x559ae06aff02]",
        "(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x559ae0953e92]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x559ae06cbba7]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x559ae0c980c4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x559ae0c9aad0]",
        "(()+0x7fa3) [0x7f4d56c0ffa3]",
        "(clone()+0x3f) [0x7f4d567bf4cf]"
    ],
    "utsname_hostname": "pveXX",
    "assert_msg": "/build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/BlueStore.h: In function 'BlueStore::BlobRef BlueStore::ExtentMap::get_spanning_blob(int)' thread 7f4d35c92700 time 2020-09-10 11:10:43.051455\n/build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/BlueStore.h: 828: FAILED ceph_assert(p != spanning_blob_map.end())\n",
    "crash_id": "2020-09-10_09:10:43.062623Z_c86c5da6-7508-47ec-aac2-c4147d3b1cbc",
    "assert_func": "BlueStore::BlobRef BlueStore::ExtentMap::get_spanning_blob(int)",
    "ceph_version": "14.2.11"
}

Auch ein weiterer Reboot der Maschine hat nicht geholfen, die SSDs sind den S.M.A.R.T. Werten zufolge alle ok und weisen keine Probleme auf. Wir konnten bisher auch keine anderen Hinweise auf diesen Fehler/Crash finden.

Wir sind leider inzwischen ratlos. Hat jemand eine Idee, wie wir das reparieren und das OSD wieder zum laufen bekommen?


Da es sich immer um die selbe Disk handelt (und offenbar auch nur bei dieser passiert) liegt doch die Vermutung, dass es sich um ein Hardware Problem handelt am nächsten. Versuchen Sie mal diese Disk zu stoppen, als OSD zu entfernen und durch eine andere zu ersetzen.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!