OSD Crashing

gyverchang

New Member
Oct 7, 2022
6
1
3
Hi guys,

Would appreciate some assistance on this as this is quite an urgent issue.

This morning when I woke up, received a call from our client and it seems like 2 NVME OSDs on their site crashed without any apparent reason. Other non NVME OSDs are running normally without any issues.

Issue is with osd5 and osd7, both having similar errors. We have two other NVME OSDs running on other servers but those started normally.


pve-manager/7.3-6/723bb6ec
Ceph 16.2.11

Would really appreciate some help on this. Thank you!

1692526736524.png

root@abc007sr:~# journalctl --since "2 min ago"

-- Journal begins at Tue 2023-02-07 13:53:15 +07, ends at Sun 2023-08-20 17:15:54 +07. --

Aug 20 17:13:59 abc007sr ceph-mon[2614901]: 2023-08-20T17:13:59.713+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:01 abc007sr systemd[1]: ceph-osd@5.service: Scheduled restart job, restart counter is at 2.

Aug 20 17:14:01 abc007sr systemd[1]: Stopped Ceph object storage daemon osd.5.

Aug 20 17:14:01 abc007sr systemd[1]: Starting Ceph object storage daemon osd.5...

Aug 20 17:14:01 abc007sr ceph-osd-prestart.sh[4023182]: OSD data directory /var/lib/ceph/osd/ceph-5 does not exist; bailing out.

Aug 20 17:14:01 abc007sr systemd[1]: ceph-osd@5.service: Control process exited, code=exited, status=1/FAILURE

Aug 20 17:14:01 abc007sr systemd[1]: ceph-osd@5.service: Failed with result 'exit-code'.

Aug 20 17:14:01 abc007sr systemd[1]: Failed to start Ceph object storage daemon osd.5.

Aug 20 17:14:04 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:04.713+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:09 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:09.717+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:11 abc007sr systemd[1]: ceph-osd@5.service: Scheduled restart job, restart counter is at 3.

Aug 20 17:14:11 abc007sr systemd[1]: Stopped Ceph object storage daemon osd.5.

Aug 20 17:14:11 abc007sr systemd[1]: Starting Ceph object storage daemon osd.5...

Aug 20 17:14:11 abc007sr ceph-osd-prestart.sh[4049702]: OSD data directory /var/lib/ceph/osd/ceph-5 does not exist; bailing out.

Aug 20 17:14:11 abc007sr systemd[1]: ceph-osd@5.service: Control process exited, code=exited, status=1/FAILURE

Aug 20 17:14:11 abc007sr systemd[1]: ceph-osd@5.service: Failed with result 'exit-code'.

Aug 20 17:14:11 abc007sr systemd[1]: Failed to start Ceph object storage daemon osd.5.

Aug 20 17:14:14 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:14.717+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:19 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:19.717+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:21 abc007sr systemd[1]: ceph-osd@5.service: Scheduled restart job, restart counter is at 4.

Aug 20 17:14:21 abc007sr systemd[1]: Stopped Ceph object storage daemon osd.5.

Aug 20 17:14:21 abc007sr systemd[1]: ceph-osd@5.service: Start request repeated too quickly.

Aug 20 17:14:21 abc007sr systemd[1]: ceph-osd@5.service: Failed with result 'exit-code'.

Aug 20 17:14:21 abc007sr systemd[1]: Failed to start Ceph object storage daemon osd.5.

Aug 20 17:14:24 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:24.717+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:24 abc007sr ceph-crash[2620]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-08-20T08:40:56.609031Z_a918c08e-c5e6-4427-834e-49a4d1a8d58f as client.admin failed: (None, b'')

Aug 20 17:14:25 abc007sr ceph-crash[2620]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-08-20T07:25:02.324887Z_7d314e4c-48f3-4bc5-9081-0c1eb2ee30b8 as client.crash.abc007sr failed: (None, b'2023-08-20T17:14:25.065+0700 7f528ce21700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.crash.abc007sr.keyring: (2) No such file or directory\n2023-08-20T17:14:25.065+0700 7f528ce21700 -1 AuthRegistry(0x7f528805b7a8>

Aug 20 17:14:25 abc007sr ceph-crash[2620]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-08-20T07:25:02.324887Z_7d314e4c-48f3-4bc5-9081-0c1eb2ee30b8 as client.crash failed: (None, b'2023-08-20T17:14:25.145+0700 7f91b2e4a700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.crash.keyring: (2) No such file or directory\n2023-08-20T17:14:25.145+0700 7f91b2e4a700 -1 AuthRegistry(0x7f91ac05b748) no keyring found>

Aug 20 17:14:29 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:29.717+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:34 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:34.717+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:39 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:39.725+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:44 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:44.725+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:46 abc007sr pmxcfs[2895]: [status] notice: received log

Aug 20 17:14:49 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:49.729+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:54 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:54.729+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:14:55 abc007sr ceph-crash[2620]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-08-20T07:25:02.324887Z_7d314e4c-48f3-4bc5-9081-0c1eb2ee30b8 as client.admin failed: (None, b'')

Aug 20 17:14:55 abc007sr ceph-crash[2620]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-08-20T08:52:04.401938Z_76f49d14-b6f8-4a03-919b-2e75edbc5312 as client.crash.abc007sr failed: (None, b'2023-08-20T17:14:55.225+0700 7f227d282700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.crash.abc007sr.keyring: (2) No such file or directory\n2023-08-20T17:14:55.225+0700 7f227d282700 -1 AuthRegistry(0x7f227805b7a8>

Aug 20 17:14:55 abc007sr ceph-crash[2620]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-08-20T08:52:04.401938Z_76f49d14-b6f8-4a03-919b-2e75edbc5312 as client.crash failed: (None, b'2023-08-20T17:14:55.305+0700 7fade9a12700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.crash.keyring: (2) No such file or directory\n2023-08-20T17:14:55.305+0700 7fade9a12700 -1 AuthRegistry(0x7fade405b748) no keyring found>

Aug 20 17:14:59 abc007sr ceph-mon[2614901]: 2023-08-20T17:14:59.729+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:15:04 abc007sr ceph-mon[2614901]: 2023-08-20T17:15:04.729+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:15:09 abc007sr ceph-mon[2614901]: 2023-08-20T17:15:09.729+0700 7fa35bd10700 -1 mon.abc007sr@1(peon) e11 get_health_metrics reporting 4 slow ops, oldest is mon_command({"format":"json","prefix":"osd pool autoscale-status"} v 0)

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 2023-08-20T17:15:11.645+0700 7f2165f20080 -1 bluefs _allocate allocation failed, needed 0x714a7

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 2023-08-20T17:15:11.645+0700 7f2165f20080 -1 bluefs _flush_range allocated: 0x110000 offset: 0x100e47 length: 0x80660

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: ./src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f2165f20080 time 2023-08-20T17:15:11.646724+0700

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: ./src/os/bluestore/BlueFS.cc: 2810: ceph_abort_msg("bluefs enospc")

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: ceph version 16.2.11 (578f8e68e41b0a98523d0045ef6db90ce6f2e5ab) pacific (stable)

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd3) [0x563fc598b8e3]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x9bd) [0x563fc6083b2d]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x9a) [0x563fc608413a]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x2f) [0x563fc609edcf]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x100) [0x563fc60aea80]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x48) [0x563fc657b84e]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x338) [0x563fc6756318]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x5d7) [0x563fc675489b]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 9: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0x11d) [0x563fc691e8d7]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x7d0) [0x563fc691e6be]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x48) [0x563fc691deda]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 12: (rocksdb::BlockBasedTableBuilder::Flush()+0x9a) [0x563fc691de8a]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 13: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x197) [0x563fc691d9bf]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 14: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneI>

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 15: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0x5ea) [0x563fc6619826]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 16: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1ad1) [0x563fc661849d]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 17: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0x159e) [0x563fc66159d4]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 18: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x677) [0x563fc661>

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 19: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x52) [0x563fc661a0a4]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 20: (RocksDBStore::do_open(std::eek:stream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1096) [0x563fc6529846]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 21: (BlueStore::_open_db(bool, bool, bool)+0xa19) [0x563fc5f91ca9]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 22: (BlueStore::_open_db_and_around(bool, bool)+0x332) [0x563fc5fe5642]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 23: (BlueStore::_mount()+0x191) [0x563fc5fe7fe1]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 24: (OSD::init()+0x58d) [0x563fc5a8397d]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 25: main()

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 26: __libc_start_main()

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 27: _start()

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: *** Caught signal (Aborted) **

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: in thread 7f2165f20080 thread_name:ceph-osd

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 2023-08-20T17:15:11.677+0700 7f2165f20080 -1 ./src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f2165f20080 time 2023-08-20T17:15:11.646724+0700

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: ./src/os/bluestore/BlueFS.cc: 2810: ceph_abort_msg("bluefs enospc")

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: ceph version 16.2.11 (578f8e68e41b0a98523d0045ef6db90ce6f2e5ab) pacific (stable)

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd3) [0x563fc598b8e3]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x9bd) [0x563fc6083b2d]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x9a) [0x563fc608413a]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x2f) [0x563fc609edcf]

Aug 20 17:15:11 abc007sr ceph-osd[4021257]: 5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x100) [0x563fc60aea80]

lines 1-84
 
Hello,

Could you please share with us the relevant sections of the logs of the affected OSDs (`journalctl -u ceph-osd@NUMERICAL_ID.service`)? What do the logs say during the time one tries to start them again (Datacenter->{node}->Ceph->OSD->{osd}->Start in the web UI)?

Did both NVMe fail about the same time? Are they both the same model?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!