Failed to start Ceph disk activation: /dev/sd* and OSD's down after Proxmox upgrade to v6

Kaboom

Well-Known Member
Mar 5, 2019
120
11
58
52
Configuration: 6 nodes with Ceph and Proxmox 5.

I am currently upgrading Proxmox to version 6 (running corosync version 3 now). I did NOT update Ceph yet. But after upgrading the first node I get this Ceph error, but everything is still up except these osd's:

systemctl status ceph-disk@dev-sdb1.service
ceph-disk@dev-sdb1.service - Ceph disk activation: /dev/sdb1
Loaded: loaded (/lib/systemd/system/ceph-disk@.service; static; vendor preset: enabled)
Drop-In: /lib/systemd/system/ceph-disk@.service.d
└─ceph-after-pve-cluster.conf
Active: inactive (dead)

=====

systemctl status ceph-disk@dev-sde1.service
ceph-disk@dev-sde1.service - Ceph disk activation: /dev/sde1
Loaded: loaded (/lib/systemd/system/ceph-disk@.service; static; vendor preset: enabled)
Drop-In: /lib/systemd/system/ceph-disk@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: exit-code) since Thu 2019-10-24 19:32:04 CEST; 34min ago
Main PID: 3600 (code=exited, status=1/FAILURE)

Oct 24 19:32:04 node002 sh[3600]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
Oct 24 19:32:04 node002 sh[3600]: main(sys.argv[1:])
Oct 24 19:32:04 node002 sh[3600]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5687, in main
Oct 24 19:32:04 node002 sh[3600]: args.func(args)
Oct 24 19:32:04 node002 sh[3600]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4890, in main_trigger
Oct 24 19:32:04 node002 sh[3600]: raise Error('return code ' + str(ret))
Oct 24 19:32:04 node002 sh[3600]: ceph_disk.main.Error: Error: return code 1
Oct 24 19:32:04 node002 systemd[1]: ceph-disk@dev-sde1.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 19:32:04 node002 systemd[1]: ceph-disk@dev-sde1.service: Failed with result 'exit-code'.
Oct 24 19:32:04 node002 systemd[1]: Failed to start Ceph disk activation: /dev/sde1.


Thanks,
Tom
 
Last edited:
The OSD's are down on this node, but Ceph is still running. I also can't run any 'Ceph' commands on this node.

On other node 'ceph health':
HEALTH_WARN 1/3 mons down, quorum node003,node004

On other node 'ceph -s':

cluster:
id: 09935430-cfe7-48d4-ac66-c02e0455d95de
health: HEALTH_WARN
1/3 mons down, quorum node003,node004

services:
mon: 3 daemons, quorum node003,node004, out of quorum: node002
mgr: node003(active), standbys: node004
osd: 36 osds: 30 up, 30 in

data:
pools: 1 pools, 1024 pgs
objects: 942.07k objects, 3.18TiB
usage: 9.49TiB used, 3.61TiB / 13.1TiB avail
pgs: 1024 active+clean

io:
client: 632KiB/s rd, 12.1MiB/s wr, 67op/s rd, 457op/s wr

=====

I thought before upgrading Ceph everything has to be healthy... or can or must I just continue to upgrade to Ceph Nautilus?
 
Last edited:
Yes I followed the upgrade. Hereby my output:

proxmox-ve: 6.0-2 (running kernel: 5.0.21-3-pve)
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-5.0: 6.0-9
pve-kernel-helper: 6.0-9
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.12-pve1
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3 lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-7
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve1
 
Last edited:
The only think that went wrong was that I installed kernelcare, you have to delete this before you start the upgrade so I had to restart the upgrade but it finished.
 
Yes I followed the upgrade. Hereby my output:
Can you please post the output as quote or code, it is not readable. :/

I can't run 'ceph versions' on this node with failed OSD's. Doesn't give any output.
Run it on a node that works.

The only think that went wrong was that I installed kernelcare, you have to delete this before you start the upgrade so I had to restart the upgrade but it finished.
Well.

HEALTH_WARN 1/3 mons down, quorum node003,node004
No Ceph service is running on that node. It seems the upgrade didn't install everything correctly. See the logs of apt, there may be some information to what didn't get updated properly.
 
ceph versions
{
"mon": {
"ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)": 2
},
"mgr": {
"ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)": 3
},
"osd": {
"ceph version 12.2.11 (c96e82ac735a75ae99d4847983711e1f2dbf12e5) luminous (stable)": 6,
"ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)": 24
},
"mds": {},
"overall": {
"ceph version 12.2.11 (c96e82ac735a75ae99d4847983711e1f2dbf12e5) luminous (stable)": 6,
"ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)": 29
}
}
 
ceph -s

cluster:
id: 09935360-cfe7-48d4-ac76-c02e0fdd95de
health: HEALTH_OK

services:
mon: 2 daemons, quorum node003,node004
mgr: node003(active), standbys: node004, node006
osd: 36 osds: 30 up, 30 in

data:
pools: 1 pools, 1024 pgs
objects: 941.03k objects, 3.15TiB
usage: 9.39TiB used, 3.70TiB / 13.1TiB avail
pgs: 1024 active+clean

io:
client: 670KiB/s rd, 10.8MiB/s wr, 91op/s rd, 494op/s wr
 
"ceph version 12.2.11 (c96e82ac735a75ae99d4847983711e1f2dbf12e5) luminous (stable)": 6,
It seems, these are the OSDs in trouble. Try to start them and check the output of journalctl -u ceph-osd@<ID>.service .

mon: 2 daemons, quorum node003,node004
Did you remove the MON? Please re-create it to get proper quorum.

Run debsums -s to check the MD5 sums of the installed packages
 
No output with debsums -s.

I added another monitor on another node (node002 is the problem node):

ceph -s
cluster:
id: 09935360-cfe7-48d4-ac76-c02e0fdd95de
health: HEALTH_OK

services:
mon: 2 daemons, quorum node003,node004
mgr: node003(active), standbys: node004, node006
osd: 36 osds: 30 up, 30 in

data:
pools: 1 pools, 1024 pgs
objects: 943.17k objects, 3.17TiB
usage: 9.44TiB used, 3.66TiB / 13.1TiB avail
pgs: 1024 active+clean

io:
client: 1.20MiB/s rd, 11.0MiB/s wr, 75op/s rd, 522op/s wr
 
Last edited:
root@node002:~# sudo systemctl start ceph-osd@osd.0
Job for ceph-osd@osd.0.service failed because the control process exited with error code.
See "systemctl status ceph-osd@osd.0.service" and "journalctl -xe" for details.

root@node002:~# systemctl status ceph-osd@osd.0.service
● ceph-osd@osd.0.service - Ceph object storage daemon osd.osd.0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: enabled)
Drop-In: /lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: activating (auto-restart) (Result: exit-code) since Fri 2019-10-25 14:24:40 CEST; 7s ago
Process: 320883 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id osd.0 (code=exited, status=1/FAILURE)
 
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 15.71759 root default
-3 2.61960 host node002
0 ssd 0.43660 osd.0 down 0 1.00000
1 ssd 0.43660 osd.1 down 0 1.00000
2 ssd 0.43660 osd.2 down 0 1.00000
3 ssd 0.43660 osd.3 down 0 1.00000
4 ssd 0.43660 osd.4 down 0 1.00000
5 ssd 0.43660 osd.5 down 0 1.00000
-5 2.61960 host node003
6 ssd 0.43660 osd.6 up 1.00000 1.00000
7 ssd 0.43660 osd.7 up 1.00000 1.00000
8 ssd 0.43660 osd.8 up 1.00000 1.00000
9 ssd 0.43660 osd.9 up 1.00000 1.00000
10 ssd 0.43660 osd.10 up 1.00000 1.00000
11 ssd 0.43660 osd.11 up 1.00000 1.00000
-7 2.61960 host node004
12 ssd 0.43660 osd.12 up 1.00000 1.00000
13 ssd 0.43660 osd.13 up 1.00000 1.00000
14 ssd 0.43660 osd.14 up 1.00000 1.00000
15 ssd 0.43660 osd.15 up 1.00000 1.00000
16 ssd 0.43660 osd.16 up 1.00000 1.00000
17 ssd 0.43660 osd.17 up 1.00000 1.00000
-9 2.61960 host node005
18 ssd 0.43660 osd.18 up 1.00000 1.00000
19 ssd 0.43660 osd.19 up 1.00000 1.00000
20 ssd 0.43660 osd.20 up 1.00000 1.00000
21 ssd 0.43660 osd.21 up 1.00000 1.00000
22 ssd 0.43660 osd.22 up 1.00000 1.00000
23 ssd 0.43660 osd.23 up 1.00000 1.00000
-11 2.61960 host node006
24 ssd 0.43660 osd.24 up 1.00000 1.00000
25 ssd 0.43660 osd.25 up 1.00000 1.00000
26 ssd 0.43660 osd.26 up 1.00000 1.00000
27 ssd 0.43660 osd.27 up 1.00000 1.00000
28 ssd 0.43660 osd.28 up 1.00000 1.00000
29 ssd 0.43660 osd.29 up 1.00000 1.00000
-13 2.61960 host node007
30 ssd 0.43660 osd.30 up 1.00000 1.00000
31 ssd 0.43660 osd.31 up 1.00000 1.00000
32 ssd 0.43660 osd.32 up 1.00000 1.00000
33 ssd 0.43660 osd.33 up 1.00000 1.00000
34 ssd 0.43660 osd.34 up 1.00000 1.00000
35 ssd 0.43660 osd.35 up 1.00000 1.00000
 
ceph-disk list
mount: /var/lib/ceph/tmp/mnt.tsdm0D: /dev/sdc1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.Z4VfLh: /dev/sdd1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.Qk3ToO: /dev/sde1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.iwaDkL: /dev/sdf1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.DyMtmo: /dev/sdg1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.OViAlw: /dev/sdh1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.PrLjuK: /dev/sdc1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.QnZ3pH: /dev/sdd1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.QxgYrA: /dev/sde1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.oArdoP: /dev/sdf1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.5bN2dY: /dev/sdg1 already mounted or mount point busy.
mount: /var/lib/ceph/tmp/mnt.NnPp00: /dev/sdh1 already mounted or mount point busy.
/dev/dm-0 :
/dev/dm-2 ceph data, active, cluster ceph, osd.0, block /dev/sdc2
/dev/dm-3 ceph block, for /dev/dm-2
/dev/dm-1 :
/dev/dm-5 ceph data, active, cluster ceph, osd.1, block /dev/sdd2
/dev/dm-6 ceph block, for /dev/dm-5
/dev/dm-10 :
/dev/dm-14 ceph data, active, cluster ceph, osd.4, block /dev/sdg2
/dev/dm-15 ceph block, for /dev/dm-14
/dev/dm-11 ceph data, active, cluster ceph, osd.3, block /dev/sdf2
/dev/dm-12 ceph block, for /dev/dm-11
/dev/dm-13 :
/dev/dm-16 ceph data, active, cluster ceph, osd.5, block /dev/sdh2
/dev/dm-17 ceph block, for /dev/dm-16
/dev/dm-14 ceph data, active, cluster ceph, osd.4, block /dev/sdg2
/dev/dm-15 ceph block, for /dev/dm-14
/dev/dm-16 ceph data, active, cluster ceph, osd.5, block /dev/sdh2
/dev/dm-17 ceph block, for /dev/dm-16
/dev/dm-2 ceph data, active, cluster ceph, osd.0, block /dev/sdc2
/dev/dm-3 ceph block, for /dev/dm-2
/dev/dm-4 :
/dev/dm-8 ceph data, active, cluster ceph, osd.2, block /dev/sde2
/dev/dm-9 ceph block, for /dev/dm-8
/dev/dm-5 ceph data, active, cluster ceph, osd.1, block /dev/sdd2
/dev/dm-6 ceph block, for /dev/dm-5
/dev/dm-7 :
/dev/dm-11 ceph data, active, cluster ceph, osd.3, block /dev/sdf2
/dev/dm-12 ceph block, for /dev/dm-11
/dev/dm-8 ceph data, active, cluster ceph, osd.2, block /dev/sde2
/dev/dm-9 ceph block, for /dev/dm-8
/dev/loop0 other, unknown
/dev/loop1 other, unknown
/dev/loop2 other, unknown
/dev/loop3 other, unknown
/dev/loop4 other, unknown
/dev/loop5 other, unknown
/dev/loop6 other, unknown
/dev/loop7 other, unknown
/dev/sda :
/dev/sda1 other, 21686148-6449-6e6f-744e-656564454649
/dev/sda2 other, zfs_member
/dev/sda9 other, 6a945a3b-1dd2-11b2-99a6-080020736631
/dev/sdb :
/dev/sdb1 other, 21686148-6449-6e6f-744e-656564454649
/dev/sdb2 other, zfs_member
/dev/sdb9 other, 6a945a3b-1dd2-11b2-99a6-080020736631
/dev/sdc :
/dev/sdc1 ceph data, unprepared
/dev/sdc2 ceph block, for /dev/dm-2
/dev/sdd :
/dev/sdd1 ceph data, unprepared
/dev/sdd2 ceph block, for /dev/dm-5
/dev/sde :
/dev/sde1 ceph data, unprepared
/dev/sde2 ceph block, for /dev/dm-8
/dev/sdf :
/dev/sdf1 ceph data, unprepared
/dev/sdf2 ceph block, for /dev/dm-11
/dev/sdg :
/dev/sdg1 ceph data, unprepared
/dev/sdg2 ceph block, for /dev/dm-14
/dev/sdh :
/dev/sdh1 ceph data, unprepared
/dev/sdh2 ceph block, for /dev/dm-16
/dev/zd0 swap, swap
 
So what is the output of journalctl -u ceph-osd@0.service?

mon: 2 daemons, quorum node003,node004
This shows that only two MONs are setup, a third node is missing. It should be shown as down.
 
journalctl -u ceph-osd@0.service
-- Logs begin at Fri 2019-10-25 15:16:09 CEST, end at Fri 2019-10-25 15:59:42 CEST. --
Oct 25 15:16:31 node002 systemd[1]: Starting Ceph object storage daemon osd.0...
Oct 25 15:16:31 node002 systemd[1]: Started Ceph object storage daemon osd.0.
Oct 25 15:16:31 node002 ceph-osd[5305]: starting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
Oct 25 15:16:37 node002 ceph-osd[5305]: 2019-10-25 15:16:37.166854 7f9063aa4e80 -1 osd.0 15510 log_to_monitors {default=true}
 
Is this particular OSD still shown as down?
 
Yes look at this:

ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 15.71759 root default
-3 2.61960 host node002
0 ssd 0.43660 osd.0 down 0 1.00000
1 ssd 0.43660 osd.1 down 0 1.00000
2 ssd 0.43660 osd.2 down 0 1.00000
3 ssd 0.43660 osd.3 down 0 1.00000
4 ssd 0.43660 osd.4 down 0 1.00000
5 ssd 0.43660 osd.5 down 0 1.00000
-5 2.61960 host node003
6 ssd 0.43660 osd.6 up 1.00000 1.00000
7 ssd 0.43660 osd.7 up 1.00000 1.00000
8 ssd 0.43660 osd.8 up 1.00000 1.00000
9 ssd 0.43660 osd.9 up 1.00000 1.00000
10 ssd 0.43660 osd.10 up 1.00000 1.00000
11 ssd 0.43660 osd.11 up 1.00000 1.00000
-7 2.61960 host node004
12 ssd 0.43660 osd.12 up 1.00000 1.00000
13 ssd 0.43660 osd.13 up 1.00000 1.00000
14 ssd 0.43660 osd.14 up 1.00000 1.00000
15 ssd 0.43660 osd.15 up 1.00000 1.00000
16 ssd 0.43660 osd.16 up 1.00000 1.00000
17 ssd 0.43660 osd.17 up 1.00000 1.00000
-9 2.61960 host node005
18 ssd 0.43660 osd.18 up 1.00000 1.00000
19 ssd 0.43660 osd.19 up 1.00000 1.00000
20 ssd 0.43660 osd.20 up 1.00000 1.00000
21 ssd 0.43660 osd.21 up 1.00000 1.00000
22 ssd 0.43660 osd.22 up 1.00000 1.00000
23 ssd 0.43660 osd.23 up 1.00000 1.00000
-11 2.61960 host node006
24 ssd 0.43660 osd.24 up 1.00000 1.00000
25 ssd 0.43660 osd.25 up 1.00000 1.00000
26 ssd 0.43660 osd.26 up 1.00000 1.00000
27 ssd 0.43660 osd.27 up 1.00000 1.00000
28 ssd 0.43660 osd.28 up 1.00000 1.00000
29 ssd 0.43660 osd.29 up 1.00000 1.00000
-13 2.61960 host node007
30 ssd 0.43660 osd.30 up 1.00000 1.00000
31 ssd 0.43660 osd.31 up 1.00000 1.00000
32 ssd 0.43660 osd.32 up 1.00000 1.00000
33 ssd 0.43660 osd.33 up 1.00000 1.00000
34 ssd 0.43660 osd.34 up 1.00000 1.00000
35 ssd 0.43660 osd.35 up 1.00000 1.00000
 
Please check in the /var/log/ceph/ceph-osd.0.log for more clues. It must say why the OSDs seem still down.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!