Hallo,
in den letzten 2 Tagen sind mehrere OSDs gecrashed und dabei wurden auch 2 OSDs permanent "out" genommen.
Hardware-Fehler sind nicht offensichtlich.
Netzwerk und SSDs (NVMe) wurden überprüft.
Ausschnitt aus dem ceph-osd.log von OSD42:
ceph version 16.2.9
proxmox 7.2-5
Seltsam ist auch, dass ein OSD, der "out" ist noch I/O Werte hat. - siehe unten OSD.43
in den letzten 2 Tagen sind mehrere OSDs gecrashed und dabei wurden auch 2 OSDs permanent "out" genommen.
Hardware-Fehler sind nicht offensichtlich.
Netzwerk und SSDs (NVMe) wurden überprüft.
Ausschnitt aus dem ceph-osd.log von OSD42:
ceph version 16.2.9
proxmox 7.2-5
Code:
7 daemons have recently crashed
osd.44 crashed on host core05 at 2022-07-24T12:09:34.883425Z
osd.13 crashed on host core02 at 2022-07-25T06:36:33.079214Z
osd.42 crashed on host core05 at 2022-07-25T12:58:31.269559Z
osd.42 crashed on host core05 at 2022-07-25T12:59:10.529714Z
osd.42 crashed on host core05 at 2022-07-25T12:59:50.377356Z
osd.34 crashed on host core04 at 2022-07-25T18:56:03.107681Z
osd.9 crashed on host core01 at 2022-07-26T05:46:28.760537Z
Code:
b=10.147726059s) [10,42,55,34,28,8]/[10,NONE,55,34,28,8]p10(0) r=-1 lpr=4287 pi=[3523,4287)/1 crt=4285'7391451 lcod 0'0 mlcod 0'0 remapped NOTIFY pruub 20.191219330s@ m=29 mbc={}] enter Started/ReplicaActive
-3> 2022-07-25T14:59:50.359+0200 7ff167311700 5 osd.42 pg_epoch: 4288 pg[2.124s1( v 4285'7391451 lc 4273'7391067 (4272'7385969,4285'7391451] local-lis/les=3523/3524 n=21853 ec=430/408 lis/c=4284/3523 les/c/f=4285/3524/0 sis=4287 pruub=10.147726059s) [10,42,55,34,28,8]/[10,NONE,55,34,28,8]p10(0) r=-1 lpr=4287 pi=[3523,4287)/1 crt=4285'7391451 lcod 0'0 mlcod 0'0 remapped NOTIFY pruub 20.191219330s@ m=29 mbc={}] enter Started/ReplicaActive/RepNotRecovering
-2> 2022-07-25T14:59:50.363+0200 7ff163309700 -1 ./src/os/bluestore/BlueStore.h: In function 'BlueStore::BlobRef BlueStore::ExtentMap::get_spanning_blob(int)' thread 7ff163309700 time 2022-07-25T14:59:50.354555+0200
./src/os/bluestore/BlueStore.h: 858: FAILED ceph_assert(p != spanning_blob_map.end())
ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x55958bb6de2e]
2: /usr/bin/ceph-osd(+0xac0fb9) [0x55958bb6dfb9]
3: (BlueStore::ExtentMap::decode_some(ceph::buffer::v15_2_0::list&)+0x2e4) [0x55958c16ab34]
4: (BlueStore::Onode::decode(boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list const&)+0x49f) [0x55958c16bedf]
5: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x352) [0x55958c16c2c2]
6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x154d) [0x55958c1b34dd]
7: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2e0) [0x55958c1b4430]
8: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x83) [0x55958bc95fc3]
9: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x101) [0x55958bc27fc1]
10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x20c) [0x55958bc445dc]
11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x55958bea93e5]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x55958bc56367]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55958c2ff3da]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55958c3019b0]
15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7ff18448eea7]
16: clone()
-1> 2022-07-25T14:59:50.363+0200 7ff172327700 3 osd.42 4288 handle_osd_map epochs [4288,4288], i have 4288, src has [3556,4288]
0> 2022-07-25T14:59:50.375+0200 7ff163309700 -1 *** Caught signal (Aborted) **
in thread 7ff163309700 thread_name:tp_osd_tp
ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7ff18449a140]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x55958bb6de78]
5: /usr/bin/ceph-osd(+0xac0fb9) [0x55958bb6dfb9]
6: (BlueStore::ExtentMap::decode_some(ceph::buffer::v15_2_0::list&)+0x2e4) [0x55958c16ab34]
7: (BlueStore::Onode::decode(boost::intrusive_ptr<BlueStore::Collection>, ghobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list const&)+0x49f) [0x55958c16bedf]
8: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x352) [0x55958c16c2c2]
9: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x154d) [0x55958c1b34dd]
10: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2e0) [0x55958c1b4430]
11: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x83) [0x55958bc95fc3]
12: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x101) [0x55958bc27fc1]
13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x20c) [0x55958bc445dc]
14: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x55958bea93e5]
15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x55958bc56367]
16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55958c2ff3da]
17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55958c3019b0]
18: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7ff18448eea7]
19: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
140674694403840 / osd_srv_heartbt
140674702796544 / tp_osd_tp
140674711189248 / tp_osd_tp
140674719581952 / tp_osd_tp
140674727974656 / tp_osd_tp
140674736367360 / tp_osd_tp
140674744760064 / tp_osd_tp
140674753152768 / tp_osd_tp
140674761545472 / tp_osd_tp
140674769938176 / tp_osd_tp
140674778330880 / tp_osd_tp
140674786723584 / tp_osd_tp
140674795116288 / tp_osd_tp
140674803508992 / tp_osd_tp
140674811901696 / tp_osd_tp
140674820294400 / tp_osd_tp
140674828687104 / tp_osd_tp
140674979755776 / ms_dispatch
140674988148480 / rocksdb:dump_st
140675004933888 / cfin
140675013326592 / bstore_kv_sync
140675049797376 / ms_dispatch
140675063682816 / bstore_mempool
140675105707776 / rocksdb:low0
140675147671296 / fn_anonymous
140675172849408 / safe_timer
140675200616192 / io_context_pool
140675225933568 / io_context_pool
140675234326272 / admin_socket
140675242718976 / msgr-worker-2
140675251111680 / msgr-worker-1
140675259504384 / msgr-worker-0
140675276615808 / ceph-osd
max_recent 10000
max_new 10000
log_file /var/log/ceph/ceph-osd.42.log
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-3-pve)
pve-manager: 7.2-5 (running version: 7.2-5/12f1e639)
pve-kernel-5.15: 7.2-5
pve-kernel-helper: 7.2-5
pve-kernel-5.15.35-3-pve: 5.15.35-6
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-network-perl: 0.7.1
libpve-storage-perl: 7.2-5
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
Seltsam ist auch, dass ein OSD, der "out" ist noch I/O Werte hat. - siehe unten OSD.43
Code:
root@core05 [kvm]: /var/log/ceph # ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 core01 1519G 5634G 61 540k 10 277k exists,up
1 core01 1570G 5583G 5 1247k 0 0 exists,up
2 core01 1595G 5558G 33 372k 3 203k exists,up
3 core01 1530G 5623G 13 105k 2 65.6k exists,up
4 core01 1353G 5800G 17 125k 7 204k exists,up
5 core01 1394G 5759G 11 74.5k 3 177k exists,up
6 core01 1362G 5791G 9 143k 3 216k exists,up
7 core01 1543G 5610G 19 219k 6 1638 exists,up
8 core01 1434G 5719G 78 322k 9 674k exists,up
9 core01 1551G 5602G 25 245k 6 409k exists,up
10 core02 1546G 5607G 5 182k 0 3276 exists,up
11 core02 1424G 5729G 6 80.9k 0 0 exists,up
12 core02 1342G 5811G 31 1042k 0 51.2k exists,up
13 core02 1585G 5568G 4 50.5k 7 126k exists,up
14 core02 1354G 5799G 18 306k 0 102 exists,up
15 core02 1372G 5781G 7 374k 0 0 exists,up
16 core02 1553G 5600G 27 378k 0 0 exists,up
17 core02 1552G 5601G 26 821k 0 0 exists,up
18 core02 1562G 5591G 11 111k 15 0 exists,up
19 core02 1562G 5591G 13 480k 0 0 exists,up
20 core03 1471G 5682G 30 333k 0 204 exists,up
21 core03 1550G 5603G 20 181k 2 123k exists,up
22 core03 1582G 5571G 9 268k 0 1638 exists,up
23 core03 1420G 5733G 13 96.7k 0 0 exists,up
24 core03 1442G 5711G 19 151k 0 6553 exists,up
25 core03 1391G 5762G 32 383k 5 8703 exists,up
26 core03 1437G 5716G 4 27.3k 4 0 exists,up
27 core03 1552G 5601G 59 397k 0 819 exists,up
28 core03 1433G 5720G 36 277k 2 48.7k exists,up
29 core03 1573G 5580G 25 155k 0 0 exists,up
30 core04 1542G 5611G 62 638k 3 196k exists,up
31 core04 1485G 5668G 11 67.3k 0 57.5k exists,up
32 core04 1451G 5702G 45 400k 0 9419 exists,up
33 core04 1565G 5588G 29 207k 0 9829 exists,up
34 core04 1541G 5611G 10 90.8k 0 3276 exists,up
35 core04 1421G 5732G 8 159k 0 0 exists,up
36 core04 1402G 5751G 23 180k 0 2457 exists,up
37 core04 1488G 5665G 22 126k 0 0 exists,up
38 core04 1552G 5601G 15 273k 4 13.6k exists,up
39 core04 1405G 5748G 37 533k 0 11.1k exists,up
40 core05 1846G 5307G 17 145k 0 0 exists,up
41 core05 1872G 5281G 11 191k 7 73.9k exists,up
42 core05 0 0 0 0 0 0 autoout,exists
43 core05 0 0 20 178k 0 844k autoout,exists
44 core05 2087G 5066G 43 310k 0 3378 exists,up
45 core05 1753G 5400G 10 144k 0 0 exists,up
46 core05 1696G 5457G 37 407k 0 0 exists,up
47 core05 2073G 5080G 30 1217k 0 819 exists,up
48 core05 1749G 5404G 14 369k 3 334k exists,up
49 core05 1779G 5374G 14 107k 0 1638 exists,up
50 core06 1460G 5693G 19 1165k 0 0 exists,up
51 core06 1468G 5685G 17 157k 4 3276 exists,up
52 core06 1525G 5628G 19 158k 1 4914 exists,up
53 core06 1411G 5742G 10 106k 7 8293 exists,up
54 core06 1428G 5725G 10 112k 0 0 exists,up
55 core06 1549G 5604G 18 266k 5 340k exists,up
56 core06 1564G 5589G 15 285k 0 4095 exists,up
57 core06 1431G 5722G 29 447k 8 259k exists,up
58 core06 1453G 5700G 27 664k 0 12.7k exists,up
59 core06 1564G 5589G 21 187k 6 55.2k exists,up
root@core05 [kvm]: /var/log/ceph #
Last edited: