Ceph-OSDs gehen "out"

exagun

New Member
Jul 26, 2022
4
0
1
Hallo,

in den letzten 2 Tagen sind mehrere OSDs gecrashed und dabei wurden auch 2 OSDs permanent "out" genommen.
Hardware-Fehler sind nicht offensichtlich.

Netzwerk und SSDs (NVMe) wurden überprüft.

Ausschnitt aus dem ceph-osd.log von OSD42:

ceph version 16.2.9
proxmox 7.2-5

Code:
7 daemons have recently crashed
osd.44 crashed on host core05 at 2022-07-24T12:09:34.883425Z
osd.13 crashed on host core02 at 2022-07-25T06:36:33.079214Z
osd.42 crashed on host core05 at 2022-07-25T12:58:31.269559Z
osd.42 crashed on host core05 at 2022-07-25T12:59:10.529714Z
osd.42 crashed on host core05 at 2022-07-25T12:59:50.377356Z
osd.34 crashed on host core04 at 2022-07-25T18:56:03.107681Z
osd.9 crashed on host core01 at 2022-07-26T05:46:28.760537Z

Code:
b=10.147726059s) [10,42,55,34,28,8]/[10,NONE,55,34,28,8]p10(0) r=-1 lpr=4287 pi=[3523,4287)/1 crt=4285'7391451 lcod 0'0 mlcod 0'0 remapped NOTIFY pruub 20.191219330s@ m=29 mbc={}] enter Started/ReplicaActive
    -3> 2022-07-25T14:59:50.359+0200 7ff167311700  5 osd.42 pg_epoch: 4288 pg[2.124s1( v 4285'7391451 lc 4273'7391067 (4272'7385969,4285'7391451] local-lis/les=3523/3524 n=21853 ec=430/408 lis/c=4284/3523 les/c/f=4285/3524/0 sis=4287 pruub=10.147726059s) [10,42,55,34,28,8]/[10,NONE,55,34,28,8]p10(0) r=-1 lpr=4287 pi=[3523,4287)/1 crt=4285'7391451 lcod 0'0 mlcod 0'0 remapped NOTIFY pruub 20.191219330s@ m=29 mbc={}] enter Started/ReplicaActive/RepNotRecovering
    -2> 2022-07-25T14:59:50.363+0200 7ff163309700 -1 ./src/os/bluestore/BlueStore.h: In function 'BlueStore::BlobRef BlueStore::ExtentMap::get_spanning_blob(int)' thread 7ff163309700 time 2022-07-25T14:59:50.354555+0200
./src/os/bluestore/BlueStore.h: 858: FAILED ceph_assert(p != spanning_blob_map.end())

 ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x55958bb6de2e]
 2: /usr/bin/ceph-osd(+0xac0fb9) [0x55958bb6dfb9]
 3: (BlueStore::ExtentMap::decode_some(ceph::buffer::v15_2_0::list&)+0x2e4) [0x55958c16ab34]
 4: (BlueStore::Onode::decode(boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list const&)+0x49f) [0x55958c16bedf]
 5: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x352) [0x55958c16c2c2]
 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x154d) [0x55958c1b34dd]
 7: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2e0) [0x55958c1b4430]
 8: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x83) [0x55958bc95fc3]
 9: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x101) [0x55958bc27fc1]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x20c) [0x55958bc445dc]
 11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x55958bea93e5]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x55958bc56367]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55958c2ff3da]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55958c3019b0]
 15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7ff18448eea7]
 16: clone()

    -1> 2022-07-25T14:59:50.363+0200 7ff172327700  3 osd.42 4288 handle_osd_map epochs [4288,4288], i have 4288, src has [3556,4288]
     0> 2022-07-25T14:59:50.375+0200 7ff163309700 -1 *** Caught signal (Aborted) **
 in thread 7ff163309700 thread_name:tp_osd_tp

 ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7ff18449a140]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x55958bb6de78]
 5: /usr/bin/ceph-osd(+0xac0fb9) [0x55958bb6dfb9]
 6: (BlueStore::ExtentMap::decode_some(ceph::buffer::v15_2_0::list&)+0x2e4) [0x55958c16ab34]
 7: (BlueStore::Onode::decode(boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list const&)+0x49f) [0x55958c16bedf]
 8: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x352) [0x55958c16c2c2]
 9: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x154d) [0x55958c1b34dd]
 10: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2e0) [0x55958c1b4430]
 11: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x83) [0x55958bc95fc3]
 12: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x101) [0x55958bc27fc1]
 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x20c) [0x55958bc445dc]
 14: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x55958bea93e5]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x55958bc56367]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55958c2ff3da]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55958c3019b0]
 18: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7ff18448eea7]
 19: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  140674694403840 / osd_srv_heartbt
  140674702796544 / tp_osd_tp
  140674711189248 / tp_osd_tp
  140674719581952 / tp_osd_tp
  140674727974656 / tp_osd_tp
  140674736367360 / tp_osd_tp
  140674744760064 / tp_osd_tp
  140674753152768 / tp_osd_tp
  140674761545472 / tp_osd_tp
  140674769938176 / tp_osd_tp
  140674778330880 / tp_osd_tp
  140674786723584 / tp_osd_tp
  140674795116288 / tp_osd_tp
  140674803508992 / tp_osd_tp
  140674811901696 / tp_osd_tp
  140674820294400 / tp_osd_tp
  140674828687104 / tp_osd_tp
  140674979755776 / ms_dispatch
  140674988148480 / rocksdb:dump_st
  140675004933888 / cfin
  140675013326592 / bstore_kv_sync
  140675049797376 / ms_dispatch
  140675063682816 / bstore_mempool
  140675105707776 / rocksdb:low0
  140675147671296 / fn_anonymous
  140675172849408 / safe_timer
  140675200616192 / io_context_pool
  140675225933568 / io_context_pool
  140675234326272 / admin_socket
  140675242718976 / msgr-worker-2
  140675251111680 / msgr-worker-1
  140675259504384 / msgr-worker-0
  140675276615808 / ceph-osd
  max_recent     10000
  max_new        10000
  log_file /var/log/ceph/ceph-osd.42.log

Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-3-pve)
pve-manager: 7.2-5 (running version: 7.2-5/12f1e639)
pve-kernel-5.15: 7.2-5
pve-kernel-helper: 7.2-5
pve-kernel-5.15.35-3-pve: 5.15.35-6
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-network-perl: 0.7.1
libpve-storage-perl: 7.2-5
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1


Seltsam ist auch, dass ein OSD, der "out" ist noch I/O Werte hat. - siehe unten OSD.43

Code:
root@core05 [kvm]: /var/log/ceph # ceph osd status
ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE          
 0  core01  1519G  5634G     61      540k     10      277k  exists,up      
 1  core01  1570G  5583G      5     1247k      0        0   exists,up      
 2  core01  1595G  5558G     33      372k      3      203k  exists,up      
 3  core01  1530G  5623G     13      105k      2     65.6k  exists,up      
 4  core01  1353G  5800G     17      125k      7      204k  exists,up      
 5  core01  1394G  5759G     11     74.5k      3      177k  exists,up      
 6  core01  1362G  5791G      9      143k      3      216k  exists,up      
 7  core01  1543G  5610G     19      219k      6     1638   exists,up      
 8  core01  1434G  5719G     78      322k      9      674k  exists,up      
 9  core01  1551G  5602G     25      245k      6      409k  exists,up      
10  core02  1546G  5607G      5      182k      0     3276   exists,up      
11  core02  1424G  5729G      6     80.9k      0        0   exists,up      
12  core02  1342G  5811G     31     1042k      0     51.2k  exists,up      
13  core02  1585G  5568G      4     50.5k      7      126k  exists,up      
14  core02  1354G  5799G     18      306k      0      102   exists,up      
15  core02  1372G  5781G      7      374k      0        0   exists,up      
16  core02  1553G  5600G     27      378k      0        0   exists,up      
17  core02  1552G  5601G     26      821k      0        0   exists,up      
18  core02  1562G  5591G     11      111k     15        0   exists,up      
19  core02  1562G  5591G     13      480k      0        0   exists,up      
20  core03  1471G  5682G     30      333k      0      204   exists,up      
21  core03  1550G  5603G     20      181k      2      123k  exists,up      
22  core03  1582G  5571G      9      268k      0     1638   exists,up      
23  core03  1420G  5733G     13     96.7k      0        0   exists,up      
24  core03  1442G  5711G     19      151k      0     6553   exists,up      
25  core03  1391G  5762G     32      383k      5     8703   exists,up      
26  core03  1437G  5716G      4     27.3k      4        0   exists,up      
27  core03  1552G  5601G     59      397k      0      819   exists,up      
28  core03  1433G  5720G     36      277k      2     48.7k  exists,up      
29  core03  1573G  5580G     25      155k      0        0   exists,up      
30  core04  1542G  5611G     62      638k      3      196k  exists,up      
31  core04  1485G  5668G     11     67.3k      0     57.5k  exists,up      
32  core04  1451G  5702G     45      400k      0     9419   exists,up      
33  core04  1565G  5588G     29      207k      0     9829   exists,up      
34  core04  1541G  5611G     10     90.8k      0     3276   exists,up      
35  core04  1421G  5732G      8      159k      0        0   exists,up      
36  core04  1402G  5751G     23      180k      0     2457   exists,up      
37  core04  1488G  5665G     22      126k      0        0   exists,up      
38  core04  1552G  5601G     15      273k      4     13.6k  exists,up      
39  core04  1405G  5748G     37      533k      0     11.1k  exists,up      
40  core05  1846G  5307G     17      145k      0        0   exists,up      
41  core05  1872G  5281G     11      191k      7     73.9k  exists,up      
42  core05     0      0       0        0       0        0   autoout,exists 
43  core05     0      0      20      178k      0      844k  autoout,exists 
44  core05  2087G  5066G     43      310k      0     3378   exists,up      
45  core05  1753G  5400G     10      144k      0        0   exists,up      
46  core05  1696G  5457G     37      407k      0        0   exists,up      
47  core05  2073G  5080G     30     1217k      0      819   exists,up      
48  core05  1749G  5404G     14      369k      3      334k  exists,up      
49  core05  1779G  5374G     14      107k      0     1638   exists,up      
50  core06  1460G  5693G     19     1165k      0        0   exists,up      
51  core06  1468G  5685G     17      157k      4     3276   exists,up      
52  core06  1525G  5628G     19      158k      1     4914   exists,up      
53  core06  1411G  5742G     10      106k      7     8293   exists,up      
54  core06  1428G  5725G     10      112k      0        0   exists,up      
55  core06  1549G  5604G     18      266k      5      340k  exists,up      
56  core06  1564G  5589G     15      285k      0     4095   exists,up      
57  core06  1431G  5722G     29      447k      8      259k  exists,up      
58  core06  1453G  5700G     27      664k      0     12.7k  exists,up      
59  core06  1564G  5589G     21      187k      6     55.2k  exists,up      
root@core05 [kvm]: /var/log/ceph #
 
Last edited:
Hi,
Problem is ongoing, but so far, no more OSD are permanently offline,...


Code:
10 daemons have recently crashed
osd.44 crashed on host core05 at 2022-07-24T12:09:34.883425Z
osd.13 crashed on host core02 at 2022-07-25T06:36:33.079214Z
osd.42 crashed on host core05 at 2022-07-25T12:58:31.269559Z
osd.42 crashed on host core05 at 2022-07-25T12:59:10.529714Z
osd.42 crashed on host core05 at 2022-07-25T12:59:50.377356Z
osd.34 crashed on host core04 at 2022-07-25T18:56:03.107681Z
osd.9 crashed on host core01 at 2022-07-26T05:46:28.760537Z
osd.14 crashed on host core02 at 2022-07-26T19:54:15.127832Z
osd.29 crashed on host core03 at 2022-07-27T06:00:26.658249Z
osd.22 crashed on host core03 at 2022-07-27T07:52:16.673682Z

Any ideas, what happens here?
 
One more temp. crashed but not permanently:
Code:
osd.24 crashed on host core03 at 2022-07-27T12:36:49.020455Z
 
still ongoing - does anybody have an idea

Code:
osd.0 crashed on host core01 at 2022-07-27T17:58:46.453045Z
osd.31 crashed on host core04 at 2022-07-28T17:05:23.813339Z
osd.35 crashed on host core04 at 2022-07-29T08:01:46.805904Z
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!