Ceph help - will pay

Jeremiah Connelly

New Member
Feb 14, 2018
3
0
1
48
Hi all. I have a cluster that had 4 nodes and one died but now the ceph osd will not start, I am new to being the admin and the old admin is just as green as me on ceph. I am looking for paid help to get this going quckly. Right now I have a 100 Budget.

root@px1:~# ceph -v
ceph version 0.94.10

Proxmox Virtual Environment 4.4-22/2728f613

Code:
root@px1:~# systemctl status ceph
● ceph.service - LSB: Start Ceph distributed file system daemons at boot time
   Loaded: loaded (/etc/init.d/ceph)
   Active: active (exited) since Wed 2018-02-14 13:48:39 CST; 38min ago
  Process: 1376 ExecStart=/etc/init.d/ceph start (code=exited, status=0/SUCCESS)

Feb 14 13:48:09 px1 ceph[1376]: === mon.0 ===
Feb 14 13:48:09 px1 ceph[1376]: Starting Ceph mon.0 on px1...
Feb 14 13:48:09 px1 ceph[1376]: Running as unit ceph-mon.0.1518637689.639641050.service.
Feb 14 13:48:09 px1 ceph[1376]: Starting ceph-create-keys on px1...
Feb 14 13:48:10 px1 ceph[1376]: === osd.0 ===
Feb 14 13:48:14 px1 ceph[1376]: 2018-02-14 13:48:14.655908 7fe4984c5700  0 -- :/3597916411 >> 10.10.30.222:6789/0 pipe(0x7fe48c000c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c004ef0).fault
Feb 14 13:48:19 px1 ceph[1376]: 2018-02-14 13:48:19.739853 7fe4985c6700  0 -- :/3597916411 >> 10.10.30.221:6789/0 pipe(0x7fe48c0080e0 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c00c380).fault
Feb 14 13:48:25 px1 ceph[1376]: 2018-02-14 13:48:25.739876 7fe4983c4700  0 -- 10.10.30.220:0/3597916411 >> 10.10.30.221:6789/0 pipe(0x7fe48c0080e0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c005300).fault
Feb 14 13:48:26 px1 ceph[1376]: 2018-02-14 13:48:26.655881 7fe4985c6700  0 -- 10.10.30.220:0/3597916411 >> 10.10.30.222:6789/0 pipe(0x7fe48c000c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c005d60).fault
Feb 14 13:48:32 px1 ceph[1376]: 2018-02-14 13:48:32.655887 7fe4984c5700  0 -- 10.10.30.220:0/3597916411 >> 10.10.30.222:6789/0 pipe(0x7fe48c000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe48c006fc0).fault
Feb 14 13:48:39 px1 ceph[1376]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/etc/pve/priv/ceph/ceph-rbd.keyring osd crush create-or-move -- 0 0.80 host=px1 root=default'
Feb 14 13:48:39 px1 ceph[1376]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.0']' returned non-zero exit status 1
Feb 14 13:48:39 px1 ceph[1376]: ceph-disk: Error: One or more partitions failed to activate
Feb 14 13:48:39 px1 systemd[1]: Started LSB: Start Ceph distributed file system daemons at boot time.
root@px1:~#
 
Are you sure the disk is still available in the server? Check with “lsblk”. If you see the disk with “lsblk” (in this example as /dev/sdb) and the disk is not mounted (not showing up with “df -h”), try to manually mount the disk:

# mount /dev/sdb2 /var/lib/ceph/osd/ceph-0
# start ceph-osd id=0

Or try using PVE GUI.

Now check with “ceph -s” if all OSDs are UP and IN. If they are, wait for health_ok and reboot the node to see if everything is still working ok after a reboot. If not, remove the OSD using PVE GUI and re-add the OSD and try again. May take some time for rebalancing (wait for health_ok).

If problem is not solved, please post output of:

# lsblk
# df-h
# ls -l /var/lib/ceph/osd/ceph-0
# ceph -s
# cat /etc/pve/ceph.conf

If this helped you solving your problem, don’t send the 100 dollars to me, but send them as a donation to the PVE project (PayPal: office@maurer-it.com). So we all benefit from it. Thanks!
 
where did you get sdb2?
I am showing ├─cciss!c0d1p1

start ceph-osd id=0
-bash: start: command not found


Code:
root@px1:~# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sr0                 11:0    1  1024M  0 rom
cciss!c0d0         104:0    0 136.7G  0 disk
├─cciss!c0d0p1     104:1    0     1M  0 part
├─cciss!c0d0p2     104:2    0   256M  0 part
└─cciss!c0d0p3     104:3    0 136.5G  0 part
  ├─pve-root       251:0    0    34G  0 lvm  /
  ├─pve-swap       251:1    0     8G  0 lvm  [SWAP]
  ├─pve-data_tmeta 251:2    0    80M  0 lvm
  │ └─pve-data     251:4    0  78.5G  0 lvm
  └─pve-data_tdata 251:3    0  78.5G  0 lvm
    └─pve-data     251:4    0  78.5G  0 lvm
cciss!c0d1         104:16   0 820.2G  0 disk
├─cciss!c0d1p1     104:17   0 815.2G  0 part /var/lib/ceph/osd/ceph-0
└─cciss!c0d1p2     104:18   0     5G  0 part

Code:
root@px1:~#  df -h
Filesystem         Size  Used Avail Use% Mounted on
udev                10M     0   10M   0% /dev
tmpfs              6.3G  8.8M  6.3G   1% /run
/dev/dm-0           34G   11G   21G  34% /
tmpfs               16G   48M   16G   1% /dev/shm
tmpfs              5.0M     0  5.0M   0% /run/lock
tmpfs               16G     0   16G   0% /sys/fs/cgroup
/dev/fuse           30M   36K   30M   1% /etc/pve
/dev/cciss/c0d1p1  815G  149G  666G  19% /var/lib/ceph/osd/ceph-0

Code:
root@px1:~# ls -l /var/lib/ceph/osd/ceph-0
total 44
-rw-r--r--   1 root root  469 Jan 13  2017 activate.monmap
-rw-r--r--   1 root root    3 Jan 13  2017 active
-rw-r--r--   1 root root   37 Jan 13  2017 ceph_fsid
drwxr-xr-x 103 root root 1621 Jan 22 00:53 current
-rw-r--r--   1 root root   37 Jan 13  2017 fsid
lrwxrwxrwx   1 root root   58 Jan 13  2017 journal -> /dev/disk/by-partuuid/4ec3c8fd-611d-4b86-8d97-b6035095f14a
-rw-r--r--   1 root root   37 Jan 13  2017 journal_uuid
-rw-------   1 root root   56 Jan 13  2017 keyring
-rw-r--r--   1 root root   21 Jan 13  2017 magic
-rw-r--r--   1 root root    6 Jan 13  2017 ready
-rw-r--r--   1 root root    4 Jan 13  2017 store_version
-rw-r--r--   1 root root   53 Jan 13  2017 superblock
-rw-r--r--   1 root root    0 Feb 14 13:48 sysvinit
-rw-r--r--   1 root root    2 Jan 13  2017 whoami

Code:
root@px1:~# ceph -s
2018-02-15 14:23:23.078094 7f615a5fa700  0 -- 10.10.30.220:0/2742571302 >> 10.10.30.220:6789/0 pipe(0x7f614c000da0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f614c005040).fault
2018-02-15 14:23:32.078769 7f615a6fb700  0 -- 10.10.30.220:0/2742571302 >> 10.10.30.220:6789/0 pipe(0x7f614c006e20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f614c005bd0).fault
2018-02-15 14:23:38.079248 7f615a7fc700  0 -- 10.10.30.220:0/2742571302 >> 10.10.30.220:6789/0 pipe(0x7f614c006e20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f614c00ebd0).fault
2018-02-15 14:23:50.080162 7f615a5fa700  0 -- 10.10.30.220:0/2742571302 >> 10.10.30.220:6789/0 pipe(0x7f614c006e20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f614c011bf0).fault
keeps going

Code:
root@px1:~#  cat /etc/pve/ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.10.30.0/24
         filestore xattr use omap = true
         fsid = 1945fcfb-35e0-46da-89b3-f7b0ff909786
         keyring = /etc/pve/priv/$cluster.$name.keyring
         osd journal size = 5120
         osd pool default min size = 1
         public network = 10.10.30.0/24

[osd]
#        keyring = /var/lib/ceph/osd/ceph-$id/keyring
         keyring = /etc/pve/priv/ceph/ceph-rbd.keyring

[mon.0]
         host = px1
         mon addr = 10.10.30.220:6789

[mon.1]
         host = px2
         mon addr = 10.10.30.221:6789

[mon.2]
         host = px3
         mon addr = 10.10.30.222:6789
 
/dev/sdb2 was just an example, in your case it’s /dev/cciss/c0d1p1.

You mean when you stop and remove the OSD using GUI it time out? The OSD is mounted (and visible to the server), so I think no hardware issue, probably just a faulty filesystem. When you have not much knowhow about this kind of issues, most simple way is to stop and remove the OSD, and re-create after that.

Edit: you also can try to start the OSD manually, maybe working a bit different on your system. Check:

# systemctl status | grep osd
You should get service-name output, something like “ceph-osd@0.nodename.service”, then try:
# systemctl restart ceph-osd@0.nodename.service
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!