Ceph OSD Down & Out - can't bring back up - *** Caught signal (Segmentation fault) **

breakaway9000

Renowned Member
Dec 20, 2015
91
21
73
Hi,

I noticed that in my 3-node, 12-osd cluster (3 OSD per Node), one node has all 3 of its OSDs marked "Down" and "Out". I tried to bring them back 'In" and "Up", but, this is what the log shows:

My setup is WAL and block.db is on SSD, but the OSD is SATA HDD. Each server has 2 SSDs, each SSD has 3 partitions - one partition is for WAL, one is for block.db, and of course there is SATA disk for OSD.

Any idea what this could be?

Code:
2019-03-11 15:43:43.831453 7f18b8892e00  0 set uid:gid to 64045:64045 (ceph:ceph)
2019-03-11 15:43:43.831468 7f18b8892e00  0 ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable), process ceph-osd, pid 2988913
2019-03-11 15:43:43.836761 7f18b8892e00  0 pidfile_write: ignore empty --pid-file
2019-03-11 15:43:43.844687 7f18b8892e00  0 load: jerasure load: lrc load: isa
2019-03-11 15:43:43.844789 7f18b8892e00  1 bdev create path /var/lib/ceph/osd/ceph-8/block type kernel
2019-03-11 15:43:43.844798 7f18b8892e00  1 bdev(0x563466b4cb40 /var/lib/ceph/osd/ceph-8/block) open path /var/lib/ceph/osd/ceph-8/block
2019-03-11 15:43:43.845001 7f18b8892e00  1 bdev(0x563466b4cb40 /var/lib/ceph/osd/ceph-8/block) open size 6001170317312 (0x57541a00000, 5.46TiB) block_size 4096 (4KiB) rotational
2019-03-11 15:43:43.845283 7f18b8892e00  1 bluestore(/var/lib/ceph/osd/ceph-8) _set_cache_sizes cache_size 1073741824 meta 0.4 kv 0.4 data 0.2
2019-03-11 15:43:43.845299 7f18b8892e00  1 bdev(0x563466b4cb40 /var/lib/ceph/osd/ceph-8/block) close
2019-03-11 15:43:44.169681 7f18b8892e00  1 bluestore(/var/lib/ceph/osd/ceph-8) _mount path /var/lib/ceph/osd/ceph-8
2019-03-11 15:43:44.170038 7f18b8892e00  1 bdev create path /var/lib/ceph/osd/ceph-8/block type kernel
2019-03-11 15:43:44.170043 7f18b8892e00  1 bdev(0x563466b4cd80 /var/lib/ceph/osd/ceph-8/block) open path /var/lib/ceph/osd/ceph-8/block
2019-03-11 15:43:44.170205 7f18b8892e00  1 bdev(0x563466b4cd80 /var/lib/ceph/osd/ceph-8/block) open size 6001170317312 (0x57541a00000, 5.46TiB) block_size 4096 (4KiB) rotational
2019-03-11 15:43:44.170470 7f18b8892e00  1 bluestore(/var/lib/ceph/osd/ceph-8) _set_cache_sizes cache_size 1073741824 meta 0.4 kv 0.4 data 0.2
2019-03-11 15:43:44.170522 7f18b8892e00  1 bdev create path /var/lib/ceph/osd/ceph-8/block.db type kernel
2019-03-11 15:43:44.170526 7f18b8892e00  1 bdev(0x563466b4d200 /var/lib/ceph/osd/ceph-8/block.db) open path /var/lib/ceph/osd/ceph-8/block.db
2019-03-11 15:43:44.170647 7f18b8892e00  1 bdev(0x563466b4d200 /var/lib/ceph/osd/ceph-8/block.db) open size 5997854720 (0x165800000, 5.59GiB) block_size 4096 (4KiB) non-rotational
2019-03-11 15:43:44.170655 7f18b8892e00  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-8/block.db size 5.59GiB
2019-03-11 15:43:44.172927 7f18b8892e00  1 bdev create path /var/lib/ceph/osd/ceph-8/block type kernel
2019-03-11 15:43:44.172937 7f18b8892e00  1 bdev(0x563466b4d440 /var/lib/ceph/osd/ceph-8/block) open path /var/lib/ceph/osd/ceph-8/block
2019-03-11 15:43:44.173124 7f18b8892e00  1 bdev(0x563466b4d440 /var/lib/ceph/osd/ceph-8/block) open size 6001170317312 (0x57541a00000, 5.46TiB) block_size 4096 (4KiB) rotational
2019-03-11 15:43:44.173136 7f18b8892e00  1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-8/block size 5.46TiB
2019-03-11 15:43:44.173171 7f18b8892e00  1 bluefs mount
2019-03-11 15:43:44.178468 7f18b8892e00 -1 *** Caught signal (Segmentation fault) **
 in thread 7f18b8892e00 thread_name:ceph-osd

 ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable)
 1: (()+0xa56bd4) [0x56345cdc8bd4]
 2: (()+0x110c0) [0x7f18b5e980c0]
 3: (BlueFS::_replay(bool)+0x1616) [0x56345cd7fb96]
 4: (BlueFS::mount()+0x1e1) [0x56345cd82aa1]
 5: (BlueStore::_open_db(bool)+0x1698) [0x56345cc8c6b8]
 6: (BlueStore::_mount(bool)+0x2b4) [0x56345ccc5cf4]
 7: (OSD::init()+0x3e2) [0x56345c813fe2]
 8: (main()+0x3092) [0x56345c71d3c2]
 9: (__libc_start_main()+0xf1) [0x7f18b4e4d2e1]
 10: (_start()+0x2a) [0x56345c7a9f9a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -75> 2019-03-11 15:43:43.826473 7f18b8892e00  5 asok(0x563466baf4a0) register_command perfcounters_dump hook 0x563466b4a1c0
   -74> 2019-03-11 15:43:43.826489 7f18b8892e00  5 asok(0x563466baf4a0) register_command 1 hook 0x563466b4a1c0
   -73> 2019-03-11 15:43:43.826491 7f18b8892e00  5 asok(0x563466baf4a0) register_command perf dump hook 0x563466b4a1c0
   -72> 2019-03-11 15:43:43.826494 7f18b8892e00  5 asok(0x563466baf4a0) register_command perfcounters_schema hook 0x563466b4a1c0
   -71> 2019-03-11 15:43:43.826496 7f18b8892e00  5 asok(0x563466baf4a0) register_command perf histogram dump hook 0x563466b4a1c0
   -70> 2019-03-11 15:43:43.826498 7f18b8892e00  5 asok(0x563466baf4a0) register_command 2 hook 0x563466b4a1c0
   -69> 2019-03-11 15:43:43.826499 7f18b8892e00  5 asok(0x563466baf4a0) register_command perf schema hook 0x563466b4a1c0
   -68> 2019-03-11 15:43:43.826501 7f18b8892e00  5 asok(0x563466baf4a0) register_command perf histogram schema hook 0x563466b4a1c0
   -67> 2019-03-11 15:43:43.826503 7f18b8892e00  5 asok(0x563466baf4a0) register_command perf reset hook 0x563466b4a1c0
   -66> 2019-03-11 15:43:43.826511 7f18b8892e00  5 asok(0x563466baf4a0) register_command config show hook 0x563466b4a1c0
   -65> 2019-03-11 15:43:43.826513 7f18b8892e00  5 asok(0x563466baf4a0) register_command config help hook 0x563466b4a1c0
   -64> 2019-03-11 15:43:43.826516 7f18b8892e00  5 asok(0x563466baf4a0) register_command config set hook 0x563466b4a1c0
   -63> 2019-03-11 15:43:43.826518 7f18b8892e00  5 asok(0x563466baf4a0) register_command config get hook 0x563466b4a1c0
   -62> 2019-03-11 15:43:43.826519 7f18b8892e00  5 asok(0x563466baf4a0) register_command config diff hook 0x563466b4a1c0
   -61> 2019-03-11 15:43:43.826522 7f18b8892e00  5 asok(0x563466baf4a0) register_command config diff get hook 0x563466b4a1c0
   -60> 2019-03-11 15:43:43.826524 7f18b8892e00  5 asok(0x563466baf4a0) register_command log flush hook 0x563466b4a1c0
   -59> 2019-03-11 15:43:43.826526 7f18b8892e00  5 asok(0x563466baf4a0) register_command log dump hook 0x563466b4a1c0
   -58> 2019-03-11 15:43:43.826528 7f18b8892e00  5 asok(0x563466baf4a0) register_command log reopen hook 0x563466b4a1c0
   -57> 2019-03-11 15:43:43.826538 7f18b8892e00  5 asok(0x563466baf4a0) register_command dump_mempools hook 0x563466e5ada8
   -56> 2019-03-11 15:43:43.831453 7f18b8892e00  0 set uid:gid to 64045:64045 (ceph:ceph)
   -55> 2019-03-11 15:43:43.831468 7f18b8892e00  0 ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable), process ceph-osd, pid 2988913
   -54> 2019-03-11 15:43:43.831501 7f18b8892e00  5 object store type is bluestore
   -53> 2019-03-11 15:43:43.836104 7f18b2aee700  2 Event(0x563466b4c500 nevent=5000 time_id=1).set_owner idx=0 owner=139744053749504
   -52> 2019-03-11 15:43:43.836145 7f18b22ed700  2 Event(0x563466b4c740 nevent=5000 time_id=1).set_owner idx=1 owner=139744045356800
   -51> 2019-03-11 15:43:43.836152 7f18b1aec700  2 Event(0x563466b4c980 nevent=5000 time_id=1).set_owner idx=2 owner=139744036964096
   -50> 2019-03-11 15:43:43.836545 7f18b8892e00  1 -- 172.17.1.54:0/0 learned_addr learned my addr 172.17.1.54:0/0
   -49> 2019-03-11 15:43:43.836554 7f18b8892e00  1 -- 172.17.1.54:6802/2988913 _finish_bind bind my_inst.addr is 172.17.1.54:6802/2988913
   -48> 2019-03-11 15:43:43.836608 7f18b8892e00  1 -- 10.10.10.5:0/0 learned_addr learned my addr 10.10.10.5:0/0
   -47> 2019-03-11 15:43:43.836615 7f18b8892e00  1 -- 10.10.10.5:6802/2988913 _finish_bind bind my_inst.addr is 10.10.10.5:6802/2988913
   -46> 2019-03-11 15:43:43.836682 7f18b8892e00  1 -- 10.10.10.5:0/0 learned_addr learned my addr 10.10.10.5:0/0
   -45> 2019-03-11 15:43:43.836687 7f18b8892e00  1 -- 10.10.10.5:6803/2988913 _finish_bind bind my_inst.addr is 10.10.10.5:6803/2988913
   -44> 2019-03-11 15:43:43.836754 7f18b8892e00  1 -- 172.17.1.54:0/0 learned_addr learned my addr 172.17.1.54:0/0
   -43> 2019-03-11 15:43:43.836759 7f18b8892e00  1 -- 172.17.1.54:6803/2988913 _finish_bind bind my_inst.addr is 172.17.1.54:6803/2988913
   -42> 2019-03-11 15:43:43.836761 7f18b8892e00  0 pidfile_write: ignore empty --pid-file
   -41> 2019-03-11 15:43:43.838350 7f18b8892e00  5 asok(0x563466baf4a0) init /var/run/ceph/ceph-osd.8.asok
   -40> 2019-03-11 15:43:43.838362 7f18b8892e00  5 asok(0x563466baf4a0) bind_and_listen /var/run/ceph/ceph-osd.8.asok
   -39> 2019-03-11 15:43:43.838411 7f18b8892e00  5 asok(0x563466baf4a0) register_command 0 hook 0x563466b481a8
   -38> 2019-03-11 15:43:43.838419 7f18b8892e00  5 asok(0x563466baf4a0) register_command version hook 0x563466b481a8
   -37> 2019-03-11 15:43:43.838424 7f18b8892e00  5 asok(0x563466baf4a0) register_command git_version hook 0x563466b481a8
   -36> 2019-03-11 15:43:43.838429 7f18b8892e00  5 asok(0x563466baf4a0) register_command help hook 0x563466b4a620
   -35> 2019-03-11 15:43:43.838431 7f18b8892e00  5 asok(0x563466baf4a0) register_command get_command_descriptions hook 0x563466b4a630
   -34> 2019-03-11 15:43:43.838488 7f18b031b700  5 asok(0x563466baf4a0) entry start
   -33> 2019-03-11 15:43:43.838497 7f18b8892e00 10 monclient: build_initial_monmap
   -32> 2019-03-11 15:43:43.844687 7f18b8892e00  0 load: jerasure load: lrc load: isa
   -31> 2019-03-11 15:43:43.844745 7f18b8892e00  5 adding auth protocol: none
   -30> 2019-03-11 15:43:43.844750 7f18b8892e00  5 adding auth protocol: none
   -29> 2019-03-11 15:43:43.844789 7f18b8892e00  1 bdev create path /var/lib/ceph/osd/ceph-8/block type kernel
   -28> 2019-03-11 15:43:43.844798 7f18b8892e00  1 bdev(0x563466b4cb40 /var/lib/ceph/osd/ceph-8/block) open path /var/lib/ceph/osd/ceph-8/block
   -27> 2019-03-11 15:43:43.845001 7f18b8892e00  1 bdev(0x563466b4cb40 /var/lib/ceph/osd/ceph-8/block) open size 6001170317312 (0x57541a00000, 5.46TiB) block_size 4096 (4KiB) rotational
   -26> 2019-03-11 15:43:43.845283 7f18b8892e00  1 bluestore(/var/lib/ceph/osd/ceph-8) _set_cache_sizes cache_size 1073741824 meta 0.4 kv 0.4 data 0.2
   -25> 2019-03-11 15:43:43.845299 7f18b8892e00  1 bdev(0x563466b4cb40 /var/lib/ceph/osd/ceph-8/block) close
   -24> 2019-03-11 15:43:44.169462 7f18b8892e00  5 asok(0x563466baf4a0) register_command objecter_requests hook 0x563466b4a6b0
   -23> 2019-03-11 15:43:44.169528 7f18b8892e00  1 -- 172.17.1.54:6802/2988913 start start
   -22> 2019-03-11 15:43:44.169536 7f18b8892e00  1 -- - start start
   -21> 2019-03-11 15:43:44.169537 7f18b8892e00  1 -- - start start
   -20> 2019-03-11 15:43:44.169538 7f18b8892e00  1 -- 172.17.1.54:6803/2988913 start start
   -19> 2019-03-11 15:43:44.169542 7f18b8892e00  1 -- 10.10.10.5:6803/2988913 start start
   -18> 2019-03-11 15:43:44.169544 7f18b8892e00  1 -- 10.10.10.5:6802/2988913 start start
   -17> 2019-03-11 15:43:44.169547 7f18b8892e00  1 -- - start start
   -16> 2019-03-11 15:43:44.169667 7f18b8892e00  2 osd.8 0 init /var/lib/ceph/osd/ceph-8 (looks like hdd)
   -15> 2019-03-11 15:43:44.169673 7f18b8892e00  2 osd.8 0 journal /var/lib/ceph/osd/ceph-8/journal
   -14> 2019-03-11 15:43:44.169681 7f18b8892e00  1 bluestore(/var/lib/ceph/osd/ceph-8) _mount path /var/lib/ceph/osd/ceph-8
   -13> 2019-03-11 15:43:44.170038 7f18b8892e00  1 bdev create path /var/lib/ceph/osd/ceph-8/block type kernel
   -12> 2019-03-11 15:43:44.170043 7f18b8892e00  1 bdev(0x563466b4cd80 /var/lib/ceph/osd/ceph-8/block) open path /var/lib/ceph/osd/ceph-8/block
   -11> 2019-03-11 15:43:44.170205 7f18b8892e00  1 bdev(0x563466b4cd80 /var/lib/ceph/osd/ceph-8/block) open size 6001170317312 (0x57541a00000, 5.46TiB) block_size 4096 (4KiB) rotational
   -10> 2019-03-11 15:43:44.170470 7f18b8892e00  1 bluestore(/var/lib/ceph/osd/ceph-8) _set_cache_sizes cache_size 1073741824 meta 0.4 kv 0.4 data 0.2
    -9> 2019-03-11 15:43:44.170522 7f18b8892e00  1 bdev create path /var/lib/ceph/osd/ceph-8/block.db type kernel
    -8> 2019-03-11 15:43:44.170526 7f18b8892e00  1 bdev(0x563466b4d200 /var/lib/ceph/osd/ceph-8/block.db) open path /var/lib/ceph/osd/ceph-8/block.db
    -7> 2019-03-11 15:43:44.170647 7f18b8892e00  1 bdev(0x563466b4d200 /var/lib/ceph/osd/ceph-8/block.db) open size 5997854720 (0x165800000, 5.59GiB) block_size 4096 (4KiB) non-rotational
    -6> 2019-03-11 15:43:44.170655 7f18b8892e00  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-8/block.db size 5.59GiB
    -5> 2019-03-11 15:43:44.172927 7f18b8892e00  1 bdev create path /var/lib/ceph/osd/ceph-8/block type kernel
    -4> 2019-03-11 15:43:44.172937 7f18b8892e00  1 bdev(0x563466b4d440 /var/lib/ceph/osd/ceph-8/block) open path /var/lib/ceph/osd/ceph-8/block
    -3> 2019-03-11 15:43:44.173124 7f18b8892e00  1 bdev(0x563466b4d440 /var/lib/ceph/osd/ceph-8/block) open size 6001170317312 (0x57541a00000, 5.46TiB) block_size 4096 (4KiB) rotational
    -2> 2019-03-11 15:43:44.173136 7f18b8892e00  1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-8/block size 5.46TiB
    -1> 2019-03-11 15:43:44.173171 7f18b8892e00  1 bluefs mount
     0> 2019-03-11 15:43:44.178468 7f18b8892e00 -1 *** Caught signal (Segmentation fault) **
 in thread 7f18b8892e00 thread_name:ceph-osd

 ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable)
 1: (()+0xa56bd4) [0x56345cdc8bd4]
 2: (()+0x110c0) [0x7f18b5e980c0]
 3: (BlueFS::_replay(bool)+0x1616) [0x56345cd7fb96]
 4: (BlueFS::mount()+0x1e1) [0x56345cd82aa1]
 5: (BlueStore::_open_db(bool)+0x1698) [0x56345cc8c6b8]
 6: (BlueStore::_mount(bool)+0x2b4) [0x56345ccc5cf4]
 7: (OSD::init()+0x3e2) [0x56345c813fe2]
 8: (main()+0x3092) [0x56345c71d3c2]
 9: (__libc_start_main()+0xf1) [0x7f18b4e4d2e1]
 10: (_start()+0x2a) [0x56345c7a9f9a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.8.log
--- end dump of recent events ---

PVE Versions:

Code:
# pveversion --verbose
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.4-1-pve: 4.13.4-26
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-44
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
 
Are the block.db and wal pointing to the right partition? Are the disks ok (eg. smart errors)?
 
Yes they are - no SMART errors. I just tested them using DD - looks like they are definitely working OK!
 
Anyone have any idea on this? I still can't get these OSDs up. I could destroy and re-create but then without knowing what caused it, it could happen again and if it happens on more hosts than my CRUSH map allows for I could really be in trouble.
 
After struggling for a while with this, it appears that this was caused by one of the SSDs failing. Since all OSDs use the one SSD cache, it caused the segfault on OSD start... wish this was handled slightly better in ceph (like an error stating that there was an issue writing to disk rather than just segfaulting).

I only caught it because during a routine maintenance window the server was rebooted to apply a kernel update which caused the SSD in question to simply disappear and not show up.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!