Ceph help - will pay

Jeremiah Connelly

New Member
Feb 14, 2018
3
0
1
48
Hi all. I have a cluster that had 4 nodes and one died but now the ceph osd will not start, I am new to being the admin and the old admin is just as green as me on ceph. I am looking for paid help to get this going quckly. Right now I have a 100 Budget.

root@px1:~# ceph -v
ceph version 0.94.10

Proxmox Virtual Environment 4.4-22/2728f613

Code:
root@px1:~# systemctl status ceph
● ceph.service - LSB: Start Ceph distributed file system daemons at boot time
   Loaded: loaded (/etc/init.d/ceph)
   Active: active (exited) since Wed 2018-02-14 13:48:39 CST; 38min ago
  Process: 1376 ExecStart=/etc/init.d/ceph start (code=exited, status=0/SUCCESS)

Feb 14 13:48:09 px1 ceph[1376]: === mon.0 ===
Feb 14 13:48:09 px1 ceph[1376]: Starting Ceph mon.0 on px1...
Feb 14 13:48:09 px1 ceph[1376]: Running as unit ceph-mon.0.1518637689.639641050.service.
Feb 14 13:48:09 px1 ceph[1376]: Starting ceph-create-keys on px1...
Feb 14 13:48:10 px1 ceph[1376]: === osd.0 ===
Feb 14 13:48:14 px1 ceph[1376]: 2018-02-14 13:48:14.655908 7fe4984c5700  0 -- :/3597916411 >> 10.10.30.222:6789/0 pipe(0x7fe48c000c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c004ef0).fault
Feb 14 13:48:19 px1 ceph[1376]: 2018-02-14 13:48:19.739853 7fe4985c6700  0 -- :/3597916411 >> 10.10.30.221:6789/0 pipe(0x7fe48c0080e0 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c00c380).fault
Feb 14 13:48:25 px1 ceph[1376]: 2018-02-14 13:48:25.739876 7fe4983c4700  0 -- 10.10.30.220:0/3597916411 >> 10.10.30.221:6789/0 pipe(0x7fe48c0080e0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c005300).fault
Feb 14 13:48:26 px1 ceph[1376]: 2018-02-14 13:48:26.655881 7fe4985c6700  0 -- 10.10.30.220:0/3597916411 >> 10.10.30.222:6789/0 pipe(0x7fe48c000c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c005d60).fault
Feb 14 13:48:32 px1 ceph[1376]: 2018-02-14 13:48:32.655887 7fe4984c5700  0 -- 10.10.30.220:0/3597916411 >> 10.10.30.222:6789/0 pipe(0x7fe48c000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c006fc0).fault
Feb 14 13:48:39 px1 ceph[1376]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/etc/pve/priv/ceph/ceph-rbd.keyring osd crush create-or-move -- 0 0.80 host=px1 root=default'
Feb 14 13:48:39 px1 ceph[1376]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.0']' returned non-zero exit status 1
Feb 14 13:48:39 px1 ceph[1376]: ceph-disk: Error: One or more partitions failed to activate
Feb 14 13:48:39 px1 systemd[1]: Started LSB: Start Ceph distributed file system daemons at boot time.
root@px1:~#
 
Are you sure the disk is still available in the server? Check with “lsblk”. If you see the disk with “lsblk” (in this example as /dev/sdb) and the disk is not mounted (not showing up with “df -h”), try to manually mount the disk:

# mount /dev/sdb2 /var/lib/ceph/osd/ceph-0
# start ceph-osd id=0

Or try using PVE GUI.

Now check with “ceph -s” if all OSDs are UP and IN. If they are, wait for health_ok and reboot the node to see if everything is still working ok after a reboot. If not, remove the OSD using PVE GUI and re-add the OSD and try again. May take some time for rebalancing (wait for health_ok).

If problem is not solved, please post output of:

# lsblk
# df-h
# ls -l /var/lib/ceph/osd/ceph-0
# ceph -s
# cat /etc/pve/ceph.conf

If this helped you solving your problem, don’t send the 100 dollars to me, but send them as a donation to the PVE project (PayPal: office@maurer-it.com). So we all benefit from it. Thanks!
 
where did you get sdb2?
I am showing ├─cciss!c0d1p1

start ceph-osd id=0
-bash: start: command not found


Code:
root@px1:~# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sr0                 11:0    1  1024M  0 rom
cciss!c0d0         104:0    0 136.7G  0 disk
├─cciss!c0d0p1     104:1    0     1M  0 part
├─cciss!c0d0p2     104:2    0   256M  0 part
└─cciss!c0d0p3     104:3    0 136.5G  0 part
  ├─pve-root       251:0    0    34G  0 lvm  /
  ├─pve-swap       251:1    0     8G  0 lvm  [SWAP]
  ├─pve-data_tmeta 251:2    0    80M  0 lvm
  │ └─pve-data     251:4    0  78.5G  0 lvm
  └─pve-data_tdata 251:3    0  78.5G  0 lvm
    └─pve-data     251:4    0  78.5G  0 lvm
cciss!c0d1         104:16   0 820.2G  0 disk
├─cciss!c0d1p1     104:17   0 815.2G  0 part /var/lib/ceph/osd/ceph-0
└─cciss!c0d1p2     104:18   0     5G  0 part

Code:
root@px1:~#  df -h
Filesystem         Size  Used Avail Use% Mounted on
udev                10M     0   10M   0% /dev
tmpfs              6.3G  8.8M  6.3G   1% /run
/dev/dm-0           34G   11G   21G  34% /
tmpfs               16G   48M   16G   1% /dev/shm
tmpfs              5.0M     0  5.0M   0% /run/lock
tmpfs               16G     0   16G   0% /sys/fs/cgroup
/dev/fuse           30M   36K   30M   1% /etc/pve
/dev/cciss/c0d1p1  815G  149G  666G  19% /var/lib/ceph/osd/ceph-0

Code:
root@px1:~# ls -l /var/lib/ceph/osd/ceph-0
total 44
-rw-r--r--   1 root root  469 Jan 13  2017 activate.monmap
-rw-r--r--   1 root root    3 Jan 13  2017 active
-rw-r--r--   1 root root   37 Jan 13  2017 ceph_fsid
drwxr-xr-x 103 root root 1621 Jan 22 00:53 current
-rw-r--r--   1 root root   37 Jan 13  2017 fsid
lrwxrwxrwx   1 root root   58 Jan 13  2017 journal -> /dev/disk/by-partuuid/4ec3c8fd-611d-4b86-8d97-b6035095f14a
-rw-r--r--   1 root root   37 Jan 13  2017 journal_uuid
-rw-------   1 root root   56 Jan 13  2017 keyring
-rw-r--r--   1 root root   21 Jan 13  2017 magic
-rw-r--r--   1 root root    6 Jan 13  2017 ready
-rw-r--r--   1 root root    4 Jan 13  2017 store_version
-rw-r--r--   1 root root   53 Jan 13  2017 superblock
-rw-r--r--   1 root root    0 Feb 14 13:48 sysvinit
-rw-r--r--   1 root root    2 Jan 13  2017 whoami

Code:
root@px1:~# ceph -s
2018-02-15 14:23:23.078094 7f615a5fa700  0 -- 10.10.30.220:0/2742571302 >> 10.10.30.220:6789/0 pipe(0x7f614c000da0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f614c005040).fault
2018-02-15 14:23:32.078769 7f615a6fb700  0 -- 10.10.30.220:0/2742571302 >> 10.10.30.220:6789/0 pipe(0x7f614c006e20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f614c005bd0).fault
2018-02-15 14:23:38.079248 7f615a7fc700  0 -- 10.10.30.220:0/2742571302 >> 10.10.30.220:6789/0 pipe(0x7f614c006e20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f614c00ebd0).fault
2018-02-15 14:23:50.080162 7f615a5fa700  0 -- 10.10.30.220:0/2742571302 >> 10.10.30.220:6789/0 pipe(0x7f614c006e20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f614c011bf0).fault
keeps going

Code:
root@px1:~#  cat /etc/pve/ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.10.30.0/24
         filestore xattr use omap = true
         fsid = 1945fcfb-35e0-46da-89b3-f7b0ff909786
         keyring = /etc/pve/priv/$cluster.$name.keyring
         osd journal size = 5120
         osd pool default min size = 1
         public network = 10.10.30.0/24

[osd]
#        keyring = /var/lib/ceph/osd/ceph-$id/keyring
         keyring = /etc/pve/priv/ceph/ceph-rbd.keyring

[mon.0]
         host = px1
         mon addr = 10.10.30.220:6789

[mon.1]
         host = px2
         mon addr = 10.10.30.221:6789

[mon.2]
         host = px3
         mon addr = 10.10.30.222:6789
 
/dev/sdb2 was just an example, in your case it’s /dev/cciss/c0d1p1.

You mean when you stop and remove the OSD using GUI it time out? The OSD is mounted (and visible to the server), so I think no hardware issue, probably just a faulty filesystem. When you have not much knowhow about this kind of issues, most simple way is to stop and remove the OSD, and re-create after that.

Edit: you also can try to start the OSD manually, maybe working a bit different on your system. Check:

# systemctl status | grep osd
You should get service-name output, something like “ceph-osd@0.nodename.service”, then try:
# systemctl restart ceph-osd@0.nodename.service