CEPH problems. No boot

Javier Estellés Dasí · Sep 7, 2016

We have alredy configured Proxmox Ha Cluster with Ceph storage. The cluster was stopped without any problems and now (after summer), none of our proxmox servers boot the ceph storage. Seems to be a problem with OSD's but I cannot find a solution.

Thanks.

SRV Ceph start Messages

=== osd.0 ===
2016-09-07 14:11:20.721132 7f2a67869700 0 -- :/1138065289 >> 192.168.1.239:6789/0 pipe(0x7f2a6c061550 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f2a6c05a3f0).fault
2016-09-07 14:11:26.721138 7f2a67667700 0 -- 192.168.1.240:0/1138065289 >> 192.168.1.239:6789/0 pipe(0x7f2a5c006e20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f2a5c00b0c0).fault
2016-09-07 14:11:29.653123 7f2a67869700 0 -- 192.168.1.240:0/1138065289 >> 192.168.1.238:6789/0 pipe(0x7f2a5c000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f2a5c00cd90).fault
2016-09-07 14:11:32.721115 7f2a67667700 0 -- 192.168.1.240:0/1138065289 >> 192.168.1.239:6789/0 pipe(0x7f2a5c006e20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f2a5c00e6f0).fault
2016-09-07 14:11:38.653117 7f2a67768700 0 -- 192.168.1.240:0/1138065289 >> 192.168.1.238:6789/0 pipe(0x7f2a5c006e20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f2a5c00b570).fault
2016-09-07 14:11:44.653124 7f2a67869700 0 -- 192.168.1.240:0/1138065289 >> 192.168.1.238:6789/0 pipe(0x7f2a5c006e20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f2a5c00c030).fault
2016-09-07 14:11:47.721053 7f2a67768700 0 -- 192.168.1.240:0/1138065289 >> 192.168.1.239:6789/0 pipe(0x7f2a5c000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f2a5c010a50).fault
failed: 'timeout 30 /usr/bin/ceph -c /etc/pve/ceph.conf --name=osd.0 --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 1.82 host=46024953-HV1 root=default'
=== mon.0 ===
Starting Ceph mon.0 on 46024953-HV1...already running
=== osd.0 ===
2016-09-07 14:11:53.721126 7f06d429b700 0 -- 192.168.1.240:0/1105037462 >> 192.168.1.239:6789/0 pipe(0x7f06c0000c00 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f06c0004ef0).fault
2016-09-07 14:11:59.653122 7f06d449d700 0 -- 192.168.1.240:0/1105037462 >> 192.168.1.238:6789/0 pipe(0x7f06c0000c00 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f06c0006470).fault
2016-09-07 14:12:02.721151 7f06d429b700 0 -- 192.168.1.240:0/1105037462 >> 192.168.1.239:6789/0 pipe(0x7f06c00080e0 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f06c00054e0).fault
2016-09-07 14:12:08.653111 7f06d439c700 0 -- 192.168.1.240:0/1105037462 >> 192.168.1.238:6789/0 pipe(0x7f06c00080e0 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f06c0005750).fault
2016-09-07 14:12:14.653121 7f06d449d700 0 -- 192.168.1.240:0/1105037462 >> 192.168.1.238:6789/0 pipe(0x7f06c00080e0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f06c0011350).fault
2016-09-07 14:12:17.721113 7f06d439c700 0 -- 192.168.1.240:0/1105037462 >> 192.168.1.239:6789/0 pipe(0x7f06c0000c00 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f06c0004ea0).fault
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 1.82 host=46024953-HV1 root=default'
ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.0']' returned non-zero exit status 1
ceph-disk: Error: One or more partitions failed to activate
TASK ERROR: command 'setsid service ceph -c /etc/pve/ceph.conf start ''' failed: exit code 1

CTCcloud · Sep 7, 2016

Please post your setup

How many servers make up the Ceph cluster?

How many OSDs per node?

How many monitors do you have running?

At this point, the only thing that can be abstracted from your post is that you have a failed OSD, as the error says "ceph-disk: Error: One or more partitions failed to activate"

spirit · Sep 7, 2016

do you have enable proxmox firewall ? if yes, do you have open ceph ports ?

Javier Estellés Dasí · Sep 8, 2016

ctcknows said:
Please post your setup

How many servers make up the Ceph cluster?
3
How many OSDs per node?
1
How many monitors do you have running?
3
At this point, the only thing that can be abstracted from your post is that you have a failed OSD, as the error says "ceph-disk: Error: One or more partitions failed to activate"

failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 1.82 host=46024953-HV1 root=default'
ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.0']' returned non-zero exit status 1
ceph-disk: Error: One or more partitions failed to activate

in any case, this error occurs on all hypervisors

udo · Sep 8, 2016

Javier Estellés Dasí said:
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 1.82 host=46024953-HV1 root=default'
ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.0']' returned non-zero exit status 1
ceph-disk: Error: One or more partitions failed to activate

in any case, this error occurs on all hypervisors

Hi,
like spirit allready wrote, to help we need further infos - repeating the known osd.0 one don't help very much!

Go on the node with osd.0 an post the result of following commands

Code:

df -h
ceph osd tree
dmesg | grep sd
cat /etc/ceph/ceph.conf

Udo

Javier Estellés Dasí · Sep 8, 2016

udo said:

Hi,
like spirit allready wrote, to help we need further infos - repeating the known osd.0 one don't help very much!

Go on the node with osd.0 an post the result of following commands

Code:

df -h

[code]
root@46024953-HV1:~# df -h
Filesystem  Size  Used Avail Use% Mounted on
udev  10M  0  10M  0% /dev
tmpfs  6.3G  9.0M  6.3G  1% /run
/dev/dm-0  197G  12G  176G  6% /
tmpfs  16G  63M  16G  1% /dev/shm
tmpfs  5.0M  0  5.0M  0% /run/lock
tmpfs  16G  0  16G  0% /sys/fs/cgroup
tmpfs  100K  0  100K  0% /run/lxcfs/controllers
cgmfs  100K  0  100K  0% /run/cgmanager/fs
/dev/fuse  30M  32K  30M  1% /etc/pve
/dev/sdb1  1.9T  580G  1.3T  32% /var/lib/ceph/osd/ceph-0

Code:

ceph osd tree

root@46024953-HV1:~# ceph osd tree
2016-09-08 13:21:39.407334 7fd19c12e700  1 -- :/0 messenger.start
2016-09-08 13:21:39.407897 7fd19c12e700  1 -- :/1152328551 --> 192.168.1.238:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7fd19405db60 con 0x7fd19405ad80
2016-09-08 13:21:42.243337 7fd198199700  0 -- :/1152328551 >> 192.168.1.238:6789/0 pipe(0x7fd194061550 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd19405ad80).fault
2016-09-08 13:21:42.408007 7fd193fff700  1 -- :/1152328551 mark_down 0x7fd19405ad80 -- 0x7fd194061550
2016-09-08 13:21:42.408078 7fd193fff700  1 -- :/1152328551 --> 192.168.1.240:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7fd1800063a0 con 0x7fd180004ef0
2016-09-08 13:21:42.408263 7fd192ffd700  1 -- 192.168.1.240:0/1152328551 learned my addr 192.168.1.240:0/1152328551
2016-09-08 13:21:45.408192 7fd193fff700  1 -- 192.168.1.240:0/1152328551 mark_down 0x7fd180004ef0 -- 0x7fd180000c00
2016-09-08 13:21:45.408284 7fd193fff700  1 -- 192.168.1.240:0/1152328551 --> 192.168.1.239:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7fd18000c390 con 0x7fd18000b0c0
2016-09-08 13:21:46.035307 7fd192efc700  0 -- 192.168.1.240:0/1152328551 >> 192.168.1.239:6789/0 pipe(0x7fd180006e20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd18000b0c0).fault
2016-09-08 13:21:48.408418 7fd193fff700  1 -- 192.168.1.240:0/1152328551 mark_down 0x7fd18000b0c0 -- 0x7fd180006e20
2016-09-08 13:21:48.408493 7fd193fff700  1 -- 192.168.1.240:0/1152328551 --> 192.168.1.238:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7fd18000e010 con 0x7fd18000cd90
2016-09-08 13:21:51.243350 7fd198199700  0 -- 192.168.1.240:0/1152328551 >> 192.168.1.238:6789/0 pipe(0x7fd180000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd18000cd90).fault
2016-09-08 13:21:51.408629 7fd193fff700  1 -- 192.168.1.240:0/1152328551 mark_down 0x7fd18000cd90 -- 0x7fd180000c00
....
2016-09-08 13:26:30.428172 7fd193fff700  1 -- 192.168.1.240:0/1152328551 --> 192.168.1.239:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7fd18000ecf0 con 0x7fd18000fe60
2016-09-08 13:26:31.043352 7fd192efc700  0 -- 192.168.1.240:0/1152328551 >> 192.168.1.239:6789/0 pipe(0x7fd180000990 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd18000fe60).fault
2016-09-08 13:26:33.428308 7fd193fff700  1 -- 192.168.1.240:0/1152328551 mark_down 0x7fd18000fe60 -- 0x7fd180000990
2016-09-08 13:26:33.428393 7fd193fff700  1 -- 192.168.1.240:0/1152328551 --> 192.168.1.238:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7fd18001b480 con 0x7fd180013050
2016-09-08 13:26:36.243317 7fd192ffd700  0 -- 192.168.1.240:0/1152328551 >> 192.168.1.238:6789/0 pipe(0x7fd180006e20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd180013050).fault
2016-09-08 13:26:36.428534 7fd193fff700  1 -- 192.168.1.240:0/1152328551 mark_down 0x7fd180013050 -- 0x7fd180006e20
2016-09-08 13:26:36.428624 7fd193fff700  1 -- 192.168.1.240:0/1152328551 --> 192.168.1.240:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7fd18000ecf0 con 0x7fd18000fe60
2016-09-08 13:26:39.407998 7fd19c12e700  0 monclient(hunting): authenticate timed out after 300
2016-09-08 13:26:39.408022 7fd19c12e700  0 librados: client.admin authentication error (110) Connection timed out
2016-09-08 13:26:39.408254 7fd19c12e700  1 -- 192.168.1.240:0/1152328551 mark_down 0x7fd18000fe60 -- 0x7fd180000990
2016-09-08 13:26:39.408310 7fd19c12e700  1 -- 192.168.1.240:0/1152328551 mark_down_all
2016-09-08 13:26:39.408391 7fd19c12e700  1 -- 192.168.1.240:0/1152328551 shutdown complete.

Code:

dmesg | grep sd
root@46024953-HV1:~# dmesg | grep sd

[  0.000000] ACPI: SSDT 0x00000000DB840960 003110 (v01 SaSsdt SaSsdt  00003000 INTL 20091112)
[  1.994302] sd 0:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[  1.994305] sd 0:0:0:0: Attached scsi generic sg0 type 0
[  1.994455] sd 0:0:0:0: [sda] 4096-byte physical blocks
[  1.994615] sd 1:0:0:0: Attached scsi generic sg1 type 0
[  1.994617] sd 1:0:0:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[  1.994618] sd 1:0:0:0: [sdb] 4096-byte physical blocks
[  1.994651] sd 0:0:0:0: [sda] Write Protect is off
[  1.994652] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[  1.994688] sd 1:0:0:0: [sdb] Write Protect is off
[  1.994689] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  1.994689] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[  1.994716] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  2.014451] sd 3:0:0:0: [sdc] 234441648 512-byte logical blocks: (120 GB/112 GiB)
[  2.014462] sd 3:0:0:0: Attached scsi generic sg3 type 0
[  2.014666] sd 3:0:0:0: [sdc] Write Protect is off
[  2.014700] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[  2.014719] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  2.016964]  sdc: sdc1
[  2.017382] sd 3:0:0:0: [sdc] Attached SCSI disk
[  2.032022]  sdb: sdb1
[  2.032435] sd 1:0:0:0: [sdb] Attached SCSI disk
[  2.034329]  sda: sda1 sda2 sda3
[  2.034620] sd 0:0:0:0: [sda] Attached SCSI disk
[  7.500078] sd 6:0:0:0: Attached scsi generic sg4 type 0
[  7.500227] sd 6:0:0:0: [sdd] 7864320 512-byte logical blocks: (4.03 GB/3.75 GiB)
[  7.500393] sd 6:0:0:0: [sdd] Write Protect is off
[  7.500428] sd 6:0:0:0: [sdd] Mode Sense: 0b 00 00 08
[  7.500557] sd 6:0:0:0: [sdd] No Caching mode page found
[  7.500592] sd 6:0:0:0: [sdd] Assuming drive cache: write through
[  7.551860]  sdd: sdd1 sdd2
[  7.552646] sd 6:0:0:0: [sdd] Attached SCSI removable disk
[  13.793045] device-mapper: thin: Data device (dm-3) discard unsupported: Disabling discard passdown.
[  20.053025] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[  22.745099] XFS (sdb1): Mounting V4 Filesystem
[  22.849240] XFS (sdb1): Ending clean mount

Code:

cat /etc/ceph/ceph.conf

root@46024953-HV3:~# cat /etc/ceph/ceph.conf

[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  cluster network = 192.168.1.0/24
  filestore xattr use omap = true
  fsid = 9503a0d2-888b-482e-8b92-f8669a0172cd
  keyring = /etc/pve/priv/$cluster.$name.keyring
  osd journal size = 16384
  osd pool default min size = 1
  public network = 192.168.1.0/24
  debug ms = 1/5

[mon]
  keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.0]
  host = 46024953-HV1
  mon addr = 192.168.1.240:6789

[mon.2]
  host = 46024953-HV3
  mon addr = 192.168.1.238:6789

[mon.1]
  host = 46024953-HV2
  mon addr = 192.168.1.239:6789

Udo

udo · Sep 8, 2016

Hi,
try to start the ceph-mon first on all three mon-nodes.

if the mon is running, you see an ceph-mon process and "ceph osd tree" will produce an output.

Udo

Search

Search

CEPH problems. No boot

Javier Estellés Dasí

Member

CTCcloud

Renowned Member

spirit

Distinguished Member

Javier Estellés Dasí

Member

udo

Distinguished Member

Javier Estellés Dasí

Member

udo

Distinguished Member

We value your privacy