Ceph OSD keeps failing!

D

Deleted member 33567

Guest
Hi,

Recent updates have made ceph started to act very weird:

we keep loosing one OSD with following from syslog:

Bash:
2020-10-17 04:28:21.922478 mon.n02-sxb-pve01 (mon.0) 912 : cluster [INF] osd.6 [v2:172.17.1.2:6814/308596,v1:172.17.1.2:6817/308596] boot
2020-10-17 04:28:23.919914 mon.n02-sxb-pve01 (mon.0) 918 : cluster [WRN] Health check update: Degraded data redundancy: 15240/781113 objects degraded (1.951%), 61 pgs degraded (PG_DEGRADED)
2020-10-17 04:28:22.935866 osd.8 (osd.8) 2005 : cluster [INF] 3.18 continuing backfill to osd.6 from (1412'1563816,2102'1566816] MIN to 2102'1566816
2020-10-17 04:28:28.917032 mon.n02-sxb-pve01 (mon.0) 920 : cluster [INF] osd.6 failed (root=default,host=n02-sxb-pve01) (connection refused reported by osd.0)
2020-10-17 04:28:28.934114 mon.n02-sxb-pve01 (mon.0) 928 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2020-10-17 04:28:29.937247 mon.n02-sxb-pve01 (mon.0) 930 : cluster [WRN] Health check update: Degraded data redundancy: 15233/781113 objects degraded (1.950%), 55 pgs degraded (PG_DEGRADED)
2020-10-17 04:28:34.726454 mon.n02-sxb-pve01 (mon.0) 934 : cluster [INF] osd.8 failed (root=default,host=n03-sxb-pve01) (connection refused reported by osd.3)
2020-10-17 04:28:34.942949 mon.n02-sxb-pve01 (mon.0) 954 : cluster [WRN] Health check update: 2 osds down (OSD_DOWN)
2020-10-17 04:28:37.293519 mon.n02-sxb-pve01 (mon.0) 958 : cluster [WRN] Health check update: Degraded data redundancy: 41466/781113 objects degraded (5.309%), 142 pgs degraded (PG_DEGRADED)
2020-10-17 04:28:39.953418 mon.n02-sxb-pve01 (mon.0) 959 : cluster [WRN] Health check failed: Reduced data availability: 6 pgs inactive (PG_AVAILABILITY)
2020-10-17 04:28:42.007881 mon.n02-sxb-pve01 (mon.0) 962 : cluster [WRN] Health check update: 1 osds down (OSD_DOWN)
2020-10-17 04:28:42.015487 mon.n02-sxb-pve01 (mon.0) 963 : cluster [INF] osd.6 [v2:172.17.1.2:6814/308850,v1:172.17.1.2:6817/308850] boot
2020-10-17 04:28:42.293935 mon.n02-sxb-pve01 (mon.0) 966 : cluster [WRN] Health check update: Degraded data redundancy: 82616/781113 objects degraded (10.577%), 214 pgs degraded (PG_DEGRADED)
2020-10-17 04:28:47.302802 mon.n02-sxb-pve01 (mon.0) 971 : cluster [WRN] Health check update: Degraded data redundancy: 60544/781113 objects degraded (7.751%), 177 pgs degraded (PG_DEGRADED)
2020-10-17 04:28:47.925637 mon.n02-sxb-pve01 (mon.0) 974 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 6 pgs inactive)
2020-10-17 04:28:48.929895 mon.n02-sxb-pve01 (mon.0) 975 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2020-10-17 04:28:48.931768 mon.n02-sxb-pve01 (mon.0) 976 : cluster [INF] osd.8 [v2:172.17.1.3:6818/1402458,v1:172.17.1.3:6819/1402458] boot
2020-10-17 04:28:51.334835 mon.n02-sxb-pve01 (mon.0) 981 : cluster [INF] osd.6 failed (root=default,host=n02-sxb-pve01) (connection refused reported by osd.7)
2020-10-17 04:28:51.934397 mon.n02-sxb-pve01 (mon.0) 1009 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)


1602901883343.png

1602901952180.png

There seems to be a issue communicating between the ceph OSDs.

While whenever I want to start osd.6 it will start once when it fails again will give me a:

command '/bin/systemctl start ceph-osd@6' failed: exit code 1

1602902052955.png

ceph -s
1602902087720.png

1602902113017.png
ceph version:

1602902141278.png
 
Please note this worked for months properly no issues only in latest upgrades issues started.

Bash:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 172.17.1.2/24
     fsid = 4a5d5cdc-fc64-4ac2-8e14-6cae6ade627a
     mon_allow_pool_delete = true
     mon_host = 172.17.1.2 172.17.1.3 172.17.1.1
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 172.17.1.2/24
         mon_cluster_log_file_level = info

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.n01-sxb-pve01]
     host = n01-sxb-pve01
     mds standby for name = pve

[mds.n02-sxb-pve01]
     host = n02-sxb-pve01
     mds_standby_for_name = pve

[mds.n03-sxb-pve01]
     host = n03-sxb-pve01
     mds_standby_for_name = pve

[mon.n01-sxb-pve01]
     public_addr = 172.17.1.1


While the Proxmox HA cluster have ips:

1602909624426.png

I used the same local network for CEPH since this config was working for us since Proxmox 4.


Syslog:

Code:
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  -3443> 2020-10-17 06:38:26.176 7fb59f01ac80 -1 osd.6 2298 log_to_monitors {default=true}
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  -3322> 2020-10-17 06:38:26.180 7fb598339700 -1 osd.6 2298 set_numa_affinity unable to identify public interface 'vmbr1' numa node: (2) No such file or directory
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:     -1> 2020-10-17 06:38:32.460 7fb582b0e700 -1 /build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, uint64_t, size_t, ceph::bufferlist*, char*)' thread 7fb582b0e700 time 2020-10-17 06:38:32.456695
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]: /build/ceph-JY24tx/ceph-14.2.11/src/os/bluestore/BlueFS.cc: 1662: FAILED ceph_assert(r == 0)
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94) nautilus (stable)
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x564f3f4623c8]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  2: (()+0x5115a0) [0x564f3f4625a0]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::v14_2_0::list*, char*)+0xf10) [0x564f3fa87e80]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  4: (BlueRocksRandomAccessFile::Prefetch(unsigned long, unsigned long)+0x2a) [0x564f3fab320a]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  5: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock()+0x29f) [0x564f40080e3f]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  6: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindKeyForward()+0x1c8) [0x564f40081078]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  7: (()+0x10b2a8a) [0x564f40003a8a]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  8: (rocksdb::MergingIterator::Next()+0xb1) [0x564f40096691]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  9: (rocksdb::DBIter::Next()+0x266) [0x564f3ff93696]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  10: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d) [0x564f3fef69bd]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  11: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0xd4a) [0x564f3f9a389a]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  12: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x29b) [0x564f3f9a4f3b]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  13: (PGBackend::objects_list_partial(hobject_t const&, int, int, std::vector<hobject_t, std::allocator<hobject_t> >*, hobject_t*)+0x309) [0x564f3f765159]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  14: (PrimaryLogPG::scan_range(int, int, PG::BackfillInterval*, ThreadPool::TPHandle&)+0xfa) [0x564f3f6d02ca]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  15: (PrimaryLogPG::do_scan(boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x7bf) [0x564f3f6d148f]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  16: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7ee) [0x564f3f719a1e]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x392) [0x564f3f545f02]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  18: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x564f3f7e9e92]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x564f3f561ba7]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x564f3fb2e0c4]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  21: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564f3fb30ad0]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  22: (()+0x7fa3) [0x7fb59fa5efa3]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  23: (clone()+0x3f) [0x7fb59f60e4cf]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:      0> 2020-10-17 06:38:32.464 7fb582b0e700 -1 *** Caught signal (Aborted) **
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  in thread 7fb582b0e700 thread_name:tp_osd_tp
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94) nautilus (stable)
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  1: (()+0x12730) [0x7fb59fa69730]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  2: (gsignal()+0x10b) [0x7fb59f54c7bb]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  3: (abort()+0x121) [0x7fb59f537535]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x564f3f462419]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  5: (()+0x5115a0) [0x564f3f4625a0]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::v14_2_0::list*, char*)+0xf10) [0x564f3fa87e80]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  7: (BlueRocksRandomAccessFile::Prefetch(unsigned long, unsigned long)+0x2a) [0x564f3fab320a]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  8: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock()+0x29f) [0x564f40080e3f]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  9: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindKeyForward()+0x1c8) [0x564f40081078]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  10: (()+0x10b2a8a) [0x564f40003a8a]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  11: (rocksdb::MergingIterator::Next()+0xb1) [0x564f40096691]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  12: (rocksdb::DBIter::Next()+0x266) [0x564f3ff93696]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  13: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d) [0x564f3fef69bd]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  14: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0xd4a) [0x564f3f9a389a]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  15: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x29b) [0x564f3f9a4f3b]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  16: (PGBackend::objects_list_partial(hobject_t const&, int, int, std::vector<hobject_t, std::allocator<hobject_t> >*, hobject_t*)+0x309) [0x564f3f765159]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  17: (PrimaryLogPG::scan_range(int, int, PG::BackfillInterval*, ThreadPool::TPHandle&)+0xfa) [0x564f3f6d02ca]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  18: (PrimaryLogPG::do_scan(boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x7bf) [0x564f3f6d148f]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  19: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7ee) [0x564f3f719a1e]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  20: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x392) [0x564f3f545f02]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  21: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x564f3f7e9e92]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  22: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x564f3f561ba7]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  23: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x564f3fb2e0c4]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  24: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564f3fb30ad0]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  25: (()+0x7fa3) [0x7fb59fa5efa3]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  26: (clone()+0x3f) [0x7fb59f60e4cf]
Oct 17 06:38:32 n02-sxb-pve01 ceph-osd[368771]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 17 06:38:32 n02-sxb-pve01 systemd[1]: ceph-osd@6.service: Main process exited, code=killed, status=6/ABRT
Oct 17 06:38:32 n02-sxb-pve01 systemd[1]: ceph-osd@6.service: Failed with result 'signal'.
Oct 17 06:38:32 n02-sxb-pve01 systemd[1]: ceph-osd@6.service: Service RestartSec=100ms expired, scheduling restart.
Oct 17 06:38:32 n02-sxb-pve01 systemd[1]: ceph-osd@6.service: Scheduled restart job, restart counter is at 3.
Oct 17 06:38:32 n02-sxb-pve01 systemd[1]: Stopped Ceph object storage daemon osd.6.
 
Last edited by a moderator: