Ceph-OSDs gehen "out"

exagun

New Member
Jul 26, 2022
4
0
1
Hallo,

in den letzten 2 Tagen sind mehrere OSDs gecrashed und dabei wurden auch 2 OSDs permanent "out" genommen.
Hardware-Fehler sind nicht offensichtlich.

Netzwerk und SSDs (NVMe) wurden überprüft.

Ausschnitt aus dem ceph-osd.log von OSD42:

ceph version 16.2.9
proxmox 7.2-5

Code:
7 daemons have recently crashed
osd.44 crashed on host core05 at 2022-07-24T12:09:34.883425Z
osd.13 crashed on host core02 at 2022-07-25T06:36:33.079214Z
osd.42 crashed on host core05 at 2022-07-25T12:58:31.269559Z
osd.42 crashed on host core05 at 2022-07-25T12:59:10.529714Z
osd.42 crashed on host core05 at 2022-07-25T12:59:50.377356Z
osd.34 crashed on host core04 at 2022-07-25T18:56:03.107681Z
osd.9 crashed on host core01 at 2022-07-26T05:46:28.760537Z

Code:
b=10.147726059s) [10,42,55,34,28,8]/[10,NONE,55,34,28,8]p10(0) r=-1 lpr=4287 pi=[3523,4287)/1 crt=4285'7391451 lcod 0'0 mlcod 0'0 remapped NOTIFY pruub 20.191219330s@ m=29 mbc={}] enter Started/ReplicaActive
    -3> 2022-07-25T14:59:50.359+0200 7ff167311700  5 osd.42 pg_epoch: 4288 pg[2.124s1( v 4285'7391451 lc 4273'7391067 (4272'7385969,4285'7391451] local-lis/les=3523/3524 n=21853 ec=430/408 lis/c=4284/3523 les/c/f=4285/3524/0 sis=4287 pruub=10.147726059s) [10,42,55,34,28,8]/[10,NONE,55,34,28,8]p10(0) r=-1 lpr=4287 pi=[3523,4287)/1 crt=4285'7391451 lcod 0'0 mlcod 0'0 remapped NOTIFY pruub 20.191219330s@ m=29 mbc={}] enter Started/ReplicaActive/RepNotRecovering
    -2> 2022-07-25T14:59:50.363+0200 7ff163309700 -1 ./src/os/bluestore/BlueStore.h: In function 'BlueStore::BlobRef BlueStore::ExtentMap::get_spanning_blob(int)' thread 7ff163309700 time 2022-07-25T14:59:50.354555+0200
./src/os/bluestore/BlueStore.h: 858: FAILED ceph_assert(p != spanning_blob_map.end())

 ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x55958bb6de2e]
 2: /usr/bin/ceph-osd(+0xac0fb9) [0x55958bb6dfb9]
 3: (BlueStore::ExtentMap::decode_some(ceph::buffer::v15_2_0::list&)+0x2e4) [0x55958c16ab34]
 4: (BlueStore::Onode::decode(boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list const&)+0x49f) [0x55958c16bedf]
 5: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x352) [0x55958c16c2c2]
 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x154d) [0x55958c1b34dd]
 7: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2e0) [0x55958c1b4430]
 8: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x83) [0x55958bc95fc3]
 9: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x101) [0x55958bc27fc1]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x20c) [0x55958bc445dc]
 11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x55958bea93e5]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x55958bc56367]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55958c2ff3da]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55958c3019b0]
 15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7ff18448eea7]
 16: clone()

    -1> 2022-07-25T14:59:50.363+0200 7ff172327700  3 osd.42 4288 handle_osd_map epochs [4288,4288], i have 4288, src has [3556,4288]
     0> 2022-07-25T14:59:50.375+0200 7ff163309700 -1 *** Caught signal (Aborted) **
 in thread 7ff163309700 thread_name:tp_osd_tp

 ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7ff18449a140]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x55958bb6de78]
 5: /usr/bin/ceph-osd(+0xac0fb9) [0x55958bb6dfb9]
 6: (BlueStore::ExtentMap::decode_some(ceph::buffer::v15_2_0::list&)+0x2e4) [0x55958c16ab34]
 7: (BlueStore::Onode::decode(boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list const&)+0x49f) [0x55958c16bedf]
 8: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x352) [0x55958c16c2c2]
 9: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x154d) [0x55958c1b34dd]
 10: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2e0) [0x55958c1b4430]
 11: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x83) [0x55958bc95fc3]
 12: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x101) [0x55958bc27fc1]
 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x20c) [0x55958bc445dc]
 14: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x55958bea93e5]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x55958bc56367]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55958c2ff3da]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55958c3019b0]
 18: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7ff18448eea7]
 19: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  140674694403840 / osd_srv_heartbt
  140674702796544 / tp_osd_tp
  140674711189248 / tp_osd_tp
  140674719581952 / tp_osd_tp
  140674727974656 / tp_osd_tp
  140674736367360 / tp_osd_tp
  140674744760064 / tp_osd_tp
  140674753152768 / tp_osd_tp
  140674761545472 / tp_osd_tp
  140674769938176 / tp_osd_tp
  140674778330880 / tp_osd_tp
  140674786723584 / tp_osd_tp
  140674795116288 / tp_osd_tp
  140674803508992 / tp_osd_tp
  140674811901696 / tp_osd_tp
  140674820294400 / tp_osd_tp
  140674828687104 / tp_osd_tp
  140674979755776 / ms_dispatch
  140674988148480 / rocksdb:dump_st
  140675004933888 / cfin
  140675013326592 / bstore_kv_sync
  140675049797376 / ms_dispatch
  140675063682816 / bstore_mempool
  140675105707776 / rocksdb:low0
  140675147671296 / fn_anonymous
  140675172849408 / safe_timer
  140675200616192 / io_context_pool
  140675225933568 / io_context_pool
  140675234326272 / admin_socket
  140675242718976 / msgr-worker-2
  140675251111680 / msgr-worker-1
  140675259504384 / msgr-worker-0
  140675276615808 / ceph-osd
  max_recent     10000
  max_new        10000
  log_file /var/log/ceph/ceph-osd.42.log

Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-3-pve)
pve-manager: 7.2-5 (running version: 7.2-5/12f1e639)
pve-kernel-5.15: 7.2-5
pve-kernel-helper: 7.2-5
pve-kernel-5.15.35-3-pve: 5.15.35-6
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-network-perl: 0.7.1
libpve-storage-perl: 7.2-5
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1


Seltsam ist auch, dass ein OSD, der "out" ist noch I/O Werte hat. - siehe unten OSD.43

Code:
root@core05 [kvm]: /var/log/ceph # ceph osd status
ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE          
 0  core01  1519G  5634G     61      540k     10      277k  exists,up      
 1  core01  1570G  5583G      5     1247k      0        0   exists,up      
 2  core01  1595G  5558G     33      372k      3      203k  exists,up      
 3  core01  1530G  5623G     13      105k      2     65.6k  exists,up      
 4  core01  1353G  5800G     17      125k      7      204k  exists,up      
 5  core01  1394G  5759G     11     74.5k      3      177k  exists,up      
 6  core01  1362G  5791G      9      143k      3      216k  exists,up      
 7  core01  1543G  5610G     19      219k      6     1638   exists,up      
 8  core01  1434G  5719G     78      322k      9      674k  exists,up      
 9  core01  1551G  5602G     25      245k      6      409k  exists,up      
10  core02  1546G  5607G      5      182k      0     3276   exists,up      
11  core02  1424G  5729G      6     80.9k      0        0   exists,up      
12  core02  1342G  5811G     31     1042k      0     51.2k  exists,up      
13  core02  1585G  5568G      4     50.5k      7      126k  exists,up      
14  core02  1354G  5799G     18      306k      0      102   exists,up      
15  core02  1372G  5781G      7      374k      0        0   exists,up      
16  core02  1553G  5600G     27      378k      0        0   exists,up      
17  core02  1552G  5601G     26      821k      0        0   exists,up      
18  core02  1562G  5591G     11      111k     15        0   exists,up      
19  core02  1562G  5591G     13      480k      0        0   exists,up      
20  core03  1471G  5682G     30      333k      0      204   exists,up      
21  core03  1550G  5603G     20      181k      2      123k  exists,up      
22  core03  1582G  5571G      9      268k      0     1638   exists,up      
23  core03  1420G  5733G     13     96.7k      0        0   exists,up      
24  core03  1442G  5711G     19      151k      0     6553   exists,up      
25  core03  1391G  5762G     32      383k      5     8703   exists,up      
26  core03  1437G  5716G      4     27.3k      4        0   exists,up      
27  core03  1552G  5601G     59      397k      0      819   exists,up      
28  core03  1433G  5720G     36      277k      2     48.7k  exists,up      
29  core03  1573G  5580G     25      155k      0        0   exists,up      
30  core04  1542G  5611G     62      638k      3      196k  exists,up      
31  core04  1485G  5668G     11     67.3k      0     57.5k  exists,up      
32  core04  1451G  5702G     45      400k      0     9419   exists,up      
33  core04  1565G  5588G     29      207k      0     9829   exists,up      
34  core04  1541G  5611G     10     90.8k      0     3276   exists,up      
35  core04  1421G  5732G      8      159k      0        0   exists,up      
36  core04  1402G  5751G     23      180k      0     2457   exists,up      
37  core04  1488G  5665G     22      126k      0        0   exists,up      
38  core04  1552G  5601G     15      273k      4     13.6k  exists,up      
39  core04  1405G  5748G     37      533k      0     11.1k  exists,up      
40  core05  1846G  5307G     17      145k      0        0   exists,up      
41  core05  1872G  5281G     11      191k      7     73.9k  exists,up      
42  core05     0      0       0        0       0        0   autoout,exists 
43  core05     0      0      20      178k      0      844k  autoout,exists 
44  core05  2087G  5066G     43      310k      0     3378   exists,up      
45  core05  1753G  5400G     10      144k      0        0   exists,up      
46  core05  1696G  5457G     37      407k      0        0   exists,up      
47  core05  2073G  5080G     30     1217k      0      819   exists,up      
48  core05  1749G  5404G     14      369k      3      334k  exists,up      
49  core05  1779G  5374G     14      107k      0     1638   exists,up      
50  core06  1460G  5693G     19     1165k      0        0   exists,up      
51  core06  1468G  5685G     17      157k      4     3276   exists,up      
52  core06  1525G  5628G     19      158k      1     4914   exists,up      
53  core06  1411G  5742G     10      106k      7     8293   exists,up      
54  core06  1428G  5725G     10      112k      0        0   exists,up      
55  core06  1549G  5604G     18      266k      5      340k  exists,up      
56  core06  1564G  5589G     15      285k      0     4095   exists,up      
57  core06  1431G  5722G     29      447k      8      259k  exists,up      
58  core06  1453G  5700G     27      664k      0     12.7k  exists,up      
59  core06  1564G  5589G     21      187k      6     55.2k  exists,up      
root@core05 [kvm]: /var/log/ceph #
 
Last edited:
Hi,
Problem is ongoing, but so far, no more OSD are permanently offline,...


Code:
10 daemons have recently crashed
osd.44 crashed on host core05 at 2022-07-24T12:09:34.883425Z
osd.13 crashed on host core02 at 2022-07-25T06:36:33.079214Z
osd.42 crashed on host core05 at 2022-07-25T12:58:31.269559Z
osd.42 crashed on host core05 at 2022-07-25T12:59:10.529714Z
osd.42 crashed on host core05 at 2022-07-25T12:59:50.377356Z
osd.34 crashed on host core04 at 2022-07-25T18:56:03.107681Z
osd.9 crashed on host core01 at 2022-07-26T05:46:28.760537Z
osd.14 crashed on host core02 at 2022-07-26T19:54:15.127832Z
osd.29 crashed on host core03 at 2022-07-27T06:00:26.658249Z
osd.22 crashed on host core03 at 2022-07-27T07:52:16.673682Z

Any ideas, what happens here?
 
One more temp. crashed but not permanently:
Code:
osd.24 crashed on host core03 at 2022-07-27T12:36:49.020455Z
 
still ongoing - does anybody have an idea

Code:
osd.0 crashed on host core01 at 2022-07-27T17:58:46.453045Z
osd.31 crashed on host core04 at 2022-07-28T17:05:23.813339Z
osd.35 crashed on host core04 at 2022-07-29T08:01:46.805904Z