My apologies in advance for the length of this post!
During a new hardware install, our Ceph node/server is:
Dell PowerEdge R7415:
1x AMD EPYC 7251 8-Core Processor
128GB RAM
HBA330 disk controller (LSI/Broadcom SAS3008, running FW 15.17.09.06 in IT mode)
4x Toshiba THNSF8200CCS 200GB SSD
8x SEAGATE ST8000NM0195 HDD (for OSDs)
Having once again followed the instructions for creating a Ceph cluster here https://pve.proxmox.com/wiki/Manage_Ceph_Services_on_Proxmox_VE_Nodes , after running "pveceph createosd" on the 8 HDDs, only 3 of the OSDs started and came online. After purging the Ceph OSD configuration, I am now unable to create and start ANY OSDs:
The syslog shows tracebacks related to Bluestore:
Stop/starting the ceph-mon service produces the same results, with the addition of these lines in syslog:
Stop/starting the overall ceph service just fails, with similar output.
Software versions:
It's all a little confusing, to be honest - we've not had total failure like this in the last 3 pveceph clusters we've assembled, and I'd have thought that the newest hardware and software would be... super reliable?
Could anyone please shed some light on what might be going wrong?
Edit: Creating the OSD in Filestore mode works immediately!
...
During a new hardware install, our Ceph node/server is:
Dell PowerEdge R7415:
1x AMD EPYC 7251 8-Core Processor
128GB RAM
HBA330 disk controller (LSI/Broadcom SAS3008, running FW 15.17.09.06 in IT mode)
4x Toshiba THNSF8200CCS 200GB SSD
8x SEAGATE ST8000NM0195 HDD (for OSDs)
Having once again followed the instructions for creating a Ceph cluster here https://pve.proxmox.com/wiki/Manage_Ceph_Services_on_Proxmox_VE_Nodes , after running "pveceph createosd" on the 8 HDDs, only 3 of the OSDs started and came online. After purging the Ceph OSD configuration, I am now unable to create and start ANY OSDs:
Code:
root@ceph1m-2:/var/log# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0 root default
root@ceph1m-2:/var/log# dd if=/dev/zero of=/dev/sde bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.99159 s, 211 MB/s
root@ceph1m-2:/var/log# ceph-disk zap /dev/sde
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
root@ceph1m-2:/var/log# date
Mon Oct 22 16:48:44 BST 2018
root@ceph1m-2:/var/log# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0 root default
root@ceph1m-2:/var/log# pveceph createosd /dev/sde
create OSD on /dev/sde (bluestore)
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.
****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
The operation has completed successfully.
meta-data=/dev/sde1 isize=2048 agcount=4, agsize=6400 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
data = bsize=4096 blocks=25600, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=1608, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
root@ceph1m-2:/var/log# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0 root default
0 0 osd.0 down 0 1.00000
root@ceph1m-2:/var/log# ps axf | grep osd
16443 pts/0 S+ 0:00 \_ grep osd
The syslog shows tracebacks related to Bluestore:
Code:
root@ceph1m-2:/var/log# tail -n 180 syslog
Oct 22 16:49:10 ceph1m-2 sh[15999]: subprocess.CalledProcessError: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', '-i', u'0', '--monmap', '/var/lib/ceph/tmp/mnt.aEj6r_/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/mnt.aEj6r_', '--osd-uuid', u'34bad71e-0cd5-48b7-be79-1e8d4a0cb81e', '--setuser', 'ceph', '--setgroup', 'ceph']' returned non-zero exit status -6
Oct 22 16:49:10 ceph1m-2 sh[15999]: Traceback (most recent call last):
Oct 22 16:49:10 ceph1m-2 sh[15999]: File "/usr/sbin/ceph-disk", line 11, in <module>
Oct 22 16:49:10 ceph1m-2 sh[15999]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
Oct 22 16:49:10 ceph1m-2 sh[15999]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
Oct 22 16:49:10 ceph1m-2 sh[15999]: main(sys.argv[1:])
Oct 22 16:49:10 ceph1m-2 sh[15999]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5687, in main
Oct 22 16:49:10 ceph1m-2 sh[15999]: args.func(args)
Oct 22 16:49:10 ceph1m-2 sh[15999]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4890, in main_trigger
Oct 22 16:49:10 ceph1m-2 sh[15999]: raise Error('return code ' + str(ret))
Oct 22 16:49:10 ceph1m-2 sh[15999]: ceph_disk.main.Error: Error: return code 1
Oct 22 16:49:10 ceph1m-2 systemd[1]: Failed to start Ceph disk activation: /dev/sde2.
Oct 22 16:49:10 ceph1m-2 systemd[1]: ceph-disk@dev-sde2.service: Unit entered failed state.
Oct 22 16:49:10 ceph1m-2 systemd[1]: ceph-disk@dev-sde2.service: Failed with result 'exit-code'.
Oct 22 16:49:10 ceph1m-2 kernel: [ 2615.978863] XFS (sde1): Mounting V5 Filesystem
Oct 22 16:49:10 ceph1m-2 kernel: [ 2616.048749] XFS (sde1): Ending clean mount
Oct 22 16:49:11 ceph1m-2 kernel: [ 2616.349806] sd 1:0:4:0: [sde] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 22 16:49:11 ceph1m-2 kernel: [ 2616.349818] sd 1:0:4:0: [sde] tag#0 Sense Key : Aborted Command [current]
Oct 22 16:49:11 ceph1m-2 kernel: [ 2616.349821] sd 1:0:4:0: [sde] tag#0 Add. Sense: Logical block guard check failed
Oct 22 16:49:11 ceph1m-2 kernel: [ 2616.349824] sd 1:0:4:0: [sde] tag#0 CDB: Read(32)
Oct 22 16:49:11 ceph1m-2 kernel: [ 2616.349827] sd 1:0:4:0: [sde] tag#0 CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
Oct 22 16:49:11 ceph1m-2 kernel: [ 2616.349829] sd 1:0:4:0: [sde] tag#0 CDB[10]: 37 e4 1d 00 37 e4 1d 00 00 00 00 00 00 00 01 00
Oct 22 16:49:11 ceph1m-2 kernel: [ 2616.349831] print_req_error: protection error, dev sde, sector 7501572096
Oct 22 16:49:11 ceph1m-2 kernel: [ 2616.467638] XFS (sde1): Unmounting Filesystem
Oct 22 16:49:11 ceph1m-2 sh[16140]: main_trigger:
Oct 22 16:49:11 ceph1m-2 sh[16140]: main_trigger: main_activate: path = /dev/sde1
Oct 22 16:49:11 ceph1m-2 sh[16140]: get_dm_uuid: get_dm_uuid /dev/sde1 uuid path is /sys/dev/block/8:65/dm/uuid
Oct 22 16:49:11 ceph1m-2 sh[16140]: command: Running command: /sbin/blkid -o udev -p /dev/sde1
Oct 22 16:49:11 ceph1m-2 sh[16140]: command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/sde1
Oct 22 16:49:11 ceph1m-2 sh[16140]: command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
Oct 22 16:49:11 ceph1m-2 sh[16140]: command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
Oct 22 16:49:11 ceph1m-2 sh[16140]: mount: Mounting /dev/sde1 on /var/lib/ceph/tmp/mnt.3jaGYg with options noatime,inode64
Oct 22 16:49:11 ceph1m-2 sh[16140]: command_check_call: Running command: /bin/mount -t xfs -o noatime,inode64 -- /dev/sde1 /var/lib/ceph/tmp/mnt.3jaGYg
Oct 22 16:49:11 ceph1m-2 sh[16140]: activate: Cluster uuid is ecf4285f-7a04-4f97-b705-d0194254d317
Oct 22 16:49:11 ceph1m-2 sh[16140]: command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
Oct 22 16:49:11 ceph1m-2 sh[16140]: activate: Cluster name is ceph
Oct 22 16:49:11 ceph1m-2 sh[16140]: activate: OSD uuid is 34bad71e-0cd5-48b7-be79-1e8d4a0cb81e
Oct 22 16:49:11 ceph1m-2 sh[16140]: activate: OSD id is 0
Oct 22 16:49:11 ceph1m-2 sh[16140]: activate: Initializing OSD...
Oct 22 16:49:11 ceph1m-2 sh[16140]: command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/tmp/mnt.3jaGYg/activate.monmap
Oct 22 16:49:11 ceph1m-2 sh[16140]: got monmap epoch 3
Oct 22 16:49:11 ceph1m-2 sh[16140]: command_check_call: Running command: /usr/bin/ceph-osd --cluster ceph --mkfs -i 0 --monmap /var/lib/ceph/tmp/mnt.3jaGYg/activate.monmap --osd-data /var/lib/ceph/tmp/mnt.3jaGYg --osd-uuid 34bad71e-0cd5-48b7-be79-1e8d4a0cb81e --setuser ceph --setgroup ceph
Oct 22 16:49:11 ceph1m-2 sh[16140]: /mnt/npool/a.antreich/ceph/ceph-12.2.8/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, uint64_t, size_t, ceph::bufferlist*, char*)' thread 7f40dfdb7e00 time 2018-10-22 16:49:11.131160
Oct 22 16:49:11 ceph1m-2 sh[16140]: /mnt/npool/a.antreich/ceph/ceph-12.2.8/src/os/bluestore/BlueFS.cc: 976: FAILED assert(r == 0)
Oct 22 16:49:11 ceph1m-2 sh[16140]: ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable)
Oct 22 16:49:11 ceph1m-2 sh[16140]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5582379d2ab2]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0xf7a) [0x558237939aba]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 3: (BlueFS::_replay(bool)+0x22d) [0x55823794134d]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 4: (BlueFS::mount()+0x1e1) [0x558237945641]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 5: (BlueStore::_open_db(bool)+0x1698) [0x5582378535a8]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 6: (BlueStore::mkfs()+0xeb5) [0x55823788da55]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 7: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x346) [0x5582373bc796]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 8: (main()+0x127c) [0x5582372efe2c]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 9: (__libc_start_main()+0xf1) [0x7f40dc3732e1]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 10: (_start()+0x2a) [0x55823737c84a]
Oct 22 16:49:11 ceph1m-2 sh[16140]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 22 16:49:11 ceph1m-2 sh[16140]: 2018-10-22 16:49:11.133961 7f40dfdb7e00 -1 /mnt/npool/a.antreich/ceph/ceph-12.2.8/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, uint64_t, size_t, ceph::bufferlist*, char*)' thread 7f40dfdb7e00 time 2018-10-22 16:49:11.131160
Oct 22 16:49:11 ceph1m-2 sh[16140]: /mnt/npool/a.antreich/ceph/ceph-12.2.8/src/os/bluestore/BlueFS.cc: 976: FAILED assert(r == 0)
Oct 22 16:49:11 ceph1m-2 sh[16140]: ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable)
Oct 22 16:49:11 ceph1m-2 sh[16140]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5582379d2ab2]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0xf7a) [0x558237939aba]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 3: (BlueFS::_replay(bool)+0x22d) [0x55823794134d]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 4: (BlueFS::mount()+0x1e1) [0x558237945641]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 5: (BlueStore::_open_db(bool)+0x1698) [0x5582378535a8]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 6: (BlueStore::mkfs()+0xeb5) [0x55823788da55]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 7: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x346) [0x5582373bc796]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 8: (main()+0x127c) [0x5582372efe2c]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 9: (__libc_start_main()+0xf1) [0x7f40dc3732e1]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 10: (_start()+0x2a) [0x55823737c84a]
Oct 22 16:49:11 ceph1m-2 sh[16140]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 22 16:49:11 ceph1m-2 sh[16140]: 0> 2018-10-22 16:49:11.133961 7f40dfdb7e00 -1 /mnt/npool/a.antreich/ceph/ceph-12.2.8/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, uint64_t, size_t, ceph::bufferlist*, char*)' thread 7f40dfdb7e00 time 2018-10-22 16:49:11.131160
Oct 22 16:49:11 ceph1m-2 sh[16140]: /mnt/npool/a.antreich/ceph/ceph-12.2.8/src/os/bluestore/BlueFS.cc: 976: FAILED assert(r == 0)
Oct 22 16:49:11 ceph1m-2 sh[16140]: ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable)
Oct 22 16:49:11 ceph1m-2 sh[16140]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5582379d2ab2]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0xf7a) [0x558237939aba]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 3: (BlueFS::_replay(bool)+0x22d) [0x55823794134d]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 4: (BlueFS::mount()+0x1e1) [0x558237945641]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 5: (BlueStore::_open_db(bool)+0x1698) [0x5582378535a8]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 6: (BlueStore::mkfs()+0xeb5) [0x55823788da55]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 7: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x346) [0x5582373bc796]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 8: (main()+0x127c) [0x5582372efe2c]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 9: (__libc_start_main()+0xf1) [0x7f40dc3732e1]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 10: (_start()+0x2a) [0x55823737c84a]
Oct 22 16:49:11 ceph1m-2 sh[16140]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 22 16:49:11 ceph1m-2 sh[16140]: *** Caught signal (Aborted) **
Oct 22 16:49:11 ceph1m-2 sh[16140]: in thread 7f40dfdb7e00 thread_name:ceph-osd
Oct 22 16:49:11 ceph1m-2 sh[16140]: ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable)
Oct 22 16:49:11 ceph1m-2 sh[16140]: 1: (()+0xa3bba4) [0x55823798aba4]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 2: (()+0x110c0) [0x7f40dd3be0c0]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 3: (gsignal()+0xcf) [0x7f40dc385fff]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 4: (abort()+0x16a) [0x7f40dc38742a]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x5582379d2c3e]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0xf7a) [0x558237939aba]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 7: (BlueFS::_replay(bool)+0x22d) [0x55823794134d]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 8: (BlueFS::mount()+0x1e1) [0x558237945641]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 9: (BlueStore::_open_db(bool)+0x1698) [0x5582378535a8]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 10: (BlueStore::mkfs()+0xeb5) [0x55823788da55]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 11: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x346) [0x5582373bc796]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 12: (main()+0x127c) [0x5582372efe2c]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 13: (__libc_start_main()+0xf1) [0x7f40dc3732e1]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 14: (_start()+0x2a) [0x55823737c84a]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 2018-10-22 16:49:11.136967 7f40dfdb7e00 -1 *** Caught signal (Aborted) **
Oct 22 16:49:11 ceph1m-2 sh[16140]: in thread 7f40dfdb7e00 thread_name:ceph-osd
Oct 22 16:49:11 ceph1m-2 sh[16140]: ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable)
Oct 22 16:49:11 ceph1m-2 sh[16140]: 1: (()+0xa3bba4) [0x55823798aba4]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 2: (()+0x110c0) [0x7f40dd3be0c0]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 3: (gsignal()+0xcf) [0x7f40dc385fff]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 4: (abort()+0x16a) [0x7f40dc38742a]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x5582379d2c3e]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0xf7a) [0x558237939aba]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 7: (BlueFS::_replay(bool)+0x22d) [0x55823794134d]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 8: (BlueFS::mount()+0x1e1) [0x558237945641]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 9: (BlueStore::_open_db(bool)+0x1698) [0x5582378535a8]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 10: (BlueStore::mkfs()+0xeb5) [0x55823788da55]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 11: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x346) [0x5582373bc796]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 12: (main()+0x127c) [0x5582372efe2c]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 13: (__libc_start_main()+0xf1) [0x7f40dc3732e1]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 14: (_start()+0x2a) [0x55823737c84a]
Oct 22 16:49:11 ceph1m-2 sh[16140]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 22 16:49:11 ceph1m-2 sh[16140]: 0> 2018-10-22 16:49:11.136967 7f40dfdb7e00 -1 *** Caught signal (Aborted) **
Oct 22 16:49:11 ceph1m-2 sh[16140]: in thread 7f40dfdb7e00 thread_name:ceph-osd
Oct 22 16:49:11 ceph1m-2 sh[16140]: ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable)
Oct 22 16:49:11 ceph1m-2 sh[16140]: 1: (()+0xa3bba4) [0x55823798aba4]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 2: (()+0x110c0) [0x7f40dd3be0c0]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 3: (gsignal()+0xcf) [0x7f40dc385fff]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 4: (abort()+0x16a) [0x7f40dc38742a]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x5582379d2c3e]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0xf7a) [0x558237939aba]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 7: (BlueFS::_replay(bool)+0x22d) [0x55823794134d]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 8: (BlueFS::mount()+0x1e1) [0x558237945641]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 9: (BlueStore::_open_db(bool)+0x1698) [0x5582378535a8]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 10: (BlueStore::mkfs()+0xeb5) [0x55823788da55]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 11: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x346) [0x5582373bc796]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 12: (main()+0x127c) [0x5582372efe2c]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 13: (__libc_start_main()+0xf1) [0x7f40dc3732e1]
Oct 22 16:49:11 ceph1m-2 sh[16140]: 14: (_start()+0x2a) [0x55823737c84a]
Oct 22 16:49:11 ceph1m-2 sh[16140]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 22 16:49:11 ceph1m-2 sh[16140]: mount_activate: Failed to activate
Oct 22 16:49:11 ceph1m-2 sh[16140]: unmount: Unmounting /var/lib/ceph/tmp/mnt.3jaGYg
Oct 22 16:49:11 ceph1m-2 sh[16140]: command_check_call: Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.3jaGYg
Oct 22 16:49:11 ceph1m-2 sh[16140]: Traceback (most recent call last):
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/sbin/ceph-disk", line 11, in <module>
Oct 22 16:49:11 ceph1m-2 sh[16140]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
Oct 22 16:49:11 ceph1m-2 sh[16140]: main(sys.argv[1:])
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5687, in main
Oct 22 16:49:11 ceph1m-2 sh[16140]: args.func(args)
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3777, in main_activate
Oct 22 16:49:11 ceph1m-2 sh[16140]: reactivate=args.reactivate,
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3540, in mount_activate
Oct 22 16:49:11 ceph1m-2 sh[16140]: (osd_id, cluster) = activate(path, activate_key_template, init)
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3717, in activate
Oct 22 16:49:11 ceph1m-2 sh[16140]: keyring=keyring,
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3169, in mkfs
Oct 22 16:49:11 ceph1m-2 sh[16140]: '--setgroup', get_ceph_group(),
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 566, in command_check_call
Oct 22 16:49:11 ceph1m-2 sh[16140]: return subprocess.check_call(arguments)
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
Oct 22 16:49:11 ceph1m-2 sh[16140]: raise CalledProcessError(retcode, cmd)
Oct 22 16:49:11 ceph1m-2 sh[16140]: subprocess.CalledProcessError: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', '-i', u'0', '--monmap', '/var/lib/ceph/tmp/mnt.3jaGYg/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/mnt.3jaGYg', '--osd-uuid', u'34bad71e-0cd5-48b7-be79-1e8d4a0cb81e', '--setuser', 'ceph', '--setgroup', 'ceph']' returned non-zero exit status -6
Oct 22 16:49:11 ceph1m-2 sh[16140]: Traceback (most recent call last):
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/sbin/ceph-disk", line 11, in <module>
Oct 22 16:49:11 ceph1m-2 sh[16140]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
Oct 22 16:49:11 ceph1m-2 sh[16140]: main(sys.argv[1:])
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5687, in main
Oct 22 16:49:11 ceph1m-2 sh[16140]: args.func(args)
Oct 22 16:49:11 ceph1m-2 sh[16140]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4890, in main_trigger
Oct 22 16:49:11 ceph1m-2 sh[16140]: raise Error('return code ' + str(ret))
Oct 22 16:49:11 ceph1m-2 sh[16140]: ceph_disk.main.Error: Error: return code 1
Oct 22 16:49:11 ceph1m-2 systemd[1]: ceph-disk@dev-sde1.service: Main process exited, code=exited, status=1/FAILURE
Oct 22 16:49:11 ceph1m-2 systemd[1]: Failed to start Ceph disk activation: /dev/sde1.
Oct 22 16:49:11 ceph1m-2 systemd[1]: ceph-disk@dev-sde1.service: Unit entered failed state.
Oct 22 16:49:11 ceph1m-2 systemd[1]: ceph-disk@dev-sde1.service: Failed with result 'exit-code'.
Oct 22 16:49:17 ceph1m-2 corosync[2699]: notice [TOTEM ] Retransmit List: 3623
Oct 22 16:49:17 ceph1m-2 corosync[2699]: [TOTEM ] Retransmit List: 3623
Oct 22 16:49:47 ceph1m-2 corosync[2699]: notice [TOTEM ] Retransmit List: 36b3
Oct 22 16:49:47 ceph1m-2 corosync[2699]: [TOTEM ] Retransmit List: 36b3
Oct 22 16:49:54 ceph1m-2 corosync[2699]: notice [TOTEM ] Retransmit List: 36d3
Oct 22 16:49:54 ceph1m-2 corosync[2699]: [TOTEM ] Retransmit List: 36d3
Stop/starting the ceph-mon service produces the same results, with the addition of these lines in syslog:
Code:
Oct 22 17:14:46 ceph1m-2 kernel: [ 4152.140167] sd 1:0:4:0: [sde] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 22 17:14:46 ceph1m-2 kernel: [ 4152.140170] sd 1:0:4:0: [sde] tag#1 Sense Key : Aborted Command [current]
Oct 22 17:14:46 ceph1m-2 kernel: [ 4152.140172] sd 1:0:4:0: [sde] tag#1 Add. Sense: Logical block guard check failed
Oct 22 17:14:46 ceph1m-2 kernel: [ 4152.140174] sd 1:0:4:0: [sde] tag#1 CDB: Read(32)
Oct 22 17:14:46 ceph1m-2 kernel: [ 4152.140177] sd 1:0:4:0: [sde] tag#1 CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
Oct 22 17:14:46 ceph1m-2 kernel: [ 4152.140178] sd 1:0:4:0: [sde] tag#1 CDB[10]: 37 e4 1d 80 37 e4 1d 80 00 00 00 00 00 00 00 80
Oct 22 17:14:46 ceph1m-2 kernel: [ 4152.140180] print_req_error: protection error, dev sde, sector 7501573120
Oct 22 17:14:47 ceph1m-2 kernel: [ 4152.235159] sd 1:0:4:0: [sde] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 22 17:14:47 ceph1m-2 kernel: [ 4152.235165] sd 1:0:4:0: [sde] tag#0 Sense Key : Aborted Command [current]
Oct 22 17:14:47 ceph1m-2 kernel: [ 4152.235168] sd 1:0:4:0: [sde] tag#0 Add. Sense: Logical block guard check failed
Oct 22 17:14:47 ceph1m-2 kernel: [ 4152.235170] sd 1:0:4:0: [sde] tag#0 CDB: Read(32)
Oct 22 17:14:47 ceph1m-2 kernel: [ 4152.235173] sd 1:0:4:0: [sde] tag#0 CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
Oct 22 17:14:47 ceph1m-2 kernel: [ 4152.235175] sd 1:0:4:0: [sde] tag#0 CDB[10]: 37 e4 1d 00 37 e4 1d 00 00 00 00 00 00 00 00 80
Stop/starting the overall ceph service just fails, with similar output.
Software versions:
Code:
root@ceph1m-2:/var/log# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1
It's all a little confusing, to be honest - we've not had total failure like this in the last 3 pveceph clusters we've assembled, and I'd have thought that the newest hardware and software would be... super reliable?
Could anyone please shed some light on what might be going wrong?
Edit: Creating the OSD in Filestore mode works immediately!
Code:
root@ceph1m-2:/var/log# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0 root default
root@ceph1m-2:/var/log# ceph-disk zap /dev/sde
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
root@ceph1m-2:/var/log# pveceph createosd /dev/sde -bluestore 0
create OSD on /dev/sde (xfs)
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.
****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
meta-data=/dev/sde1 isize=2048 agcount=8, agsize=268435455 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
data = bsize=4096 blocks=1952195665, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
root@ceph1m-2:/var/log# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.27060 root default
-3 7.27060 host ceph1m-2
0 hdd 7.27060 osd.0 up 1.00000 1.00000
...
Last edited: