Ceph Luminous 12.1.0 -> 12.1.1 (upgrade issue)

devinacosta · Aug 3, 2017

I was running Ceph 12.1.0 and upgraded 12.1.1 this morning on all of my nodes and rebooted. All my services started but my ceph-disk processes won't come up.

● ceph-disk@dev-sdb2.service loaded failed failed Ceph disk activation: /dev/sdb2
● ceph-disk@dev-sdc2.service loaded failed failed Ceph disk activation: /dev/sdc2
ceph-mon@0.service loaded active running Ceph cluster monitor daemon
ceph-osd@0.service loaded active running Ceph object storage daemon osd.0
ceph-osd@1.service loaded active running Ceph object storage daemon osd.1

When I try to start the disk I get:

Aug 03 10:09:38 pve sh[20832]: main_trigger: trigger /dev/sdb2 parttype cafecafe-9b03-4f30-b4c6-b4b80ceff106 uuid d61cdaae-e388-422e-bef5-96eadd763f95
Aug 03 10:09:38 pve sh[20832]: command: Running command: /usr/sbin/ceph-disk --verbose activate-block /dev/sdb2
Aug 03 10:09:38 pve sh[20832]: main_trigger:
Aug 03 10:09:38 pve sh[20832]: main_trigger: Traceback (most recent call last):
Aug 03 10:09:38 pve sh[20832]: File "/usr/sbin/ceph-disk", line 11, in <module>
Aug 03 10:09:38 pve sh[20832]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5731, in run
Aug 03 10:09:38 pve sh[20832]: main(sys.argv[1:])
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5682, in main
Aug 03 10:09:38 pve sh[20832]: args.func(args)
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5438, in <lambda>
Aug 03 10:09:38 pve sh[20832]: func=lambda args: main_activate_space(name, args),
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4160, in main_activate_space
Aug 03 10:09:38 pve sh[20832]: osd_uuid = get_space_osd_uuid(name, dev)
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4115, in get_space_osd_uuid
Aug 03 10:09:38 pve sh[20832]: raise Error('%s is not a block device' % path)
Aug 03 10:09:38 pve sh[20832]: ceph_disk.main.Error: Error: /dev/sdb2 is not a block device
Aug 03 10:09:38 pve sh[20832]: Traceback (most recent call last):
Aug 03 10:09:38 pve sh[20832]: File "/usr/sbin/ceph-disk", line 11, in <module>
Aug 03 10:09:38 pve sh[20832]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5731, in run
Aug 03 10:09:38 pve sh[20832]: main(sys.argv[1:])
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5682, in main
Aug 03 10:09:38 pve sh[20832]: args.func(args)
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4891, in main_trigger
Aug 03 10:09:38 pve sh[20832]: raise Error('return code ' + str(ret))
Aug 03 10:09:38 pve sh[20832]: ceph_disk.main.Error: Error: return code 1
Aug 03 10:09:38 pve systemd[1]: ceph-disk@dev-sdb2.service: Main process exited, code=exited, status=1/FAILURE
Aug 03 10:09:38 pve systemd[1]: Failed to start Ceph disk activation: /dev/sdb2.
-- Subject: Unit ceph-disk@dev-sdb2.service has failed

However the disks are block storage.

root@pve:/var/log/ceph# fdisk -l /dev/sdb
Disk /dev/sdb: 1.1 TiB, 1200210141184 bytes, 2344160432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: gpt
Disk identifier: C62137AA-BE1F-4544-958F-370580155B2E

Device Start End Sectors Size Type
/dev/sdb1 2048 206847 204800 100M Ceph OSD
/dev/sdb2 206848 2344160398 2343953551 1.1T unknown

This is the log from my monitor:

2017-08-03 10:05:56.793725 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:05:59.733745 7fdae38ff700 0 mon.0@0(leader).data_health(26) update_stats avail 83% total 98252 MB, used 15838 MB, avail 82413 MB
2017-08-03 10:06:00.793037 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:00.793249 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1196: no daemons active
2017-08-03 10:06:01.795457 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:05.793640 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:05.793788 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1197: no daemons active
2017-08-03 10:06:06.795670 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:10.794535 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:10.794708 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1198: no daemons active
2017-08-03 10:06:11.796827 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:15.795481 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:15.795659 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1199: no daemons active
2017-08-03 10:06:16.797294 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:20.796496 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:20.796675 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1200: no daemons active
2017-08-03 10:06:21.798873 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?

Any ideas?

bradkollmyer · Aug 3, 2017

You have to make sure all your machines in the ceph cluster are using the latest code. The previous build was a development branch, and did not support nodes with different versions. The latest version is a release candidate (RC).

I was bit by this yesterday too. Had to shut everything down and make sure I had the same ceph version on all my nodes.

I also had a problem with ixgbe handling of unsupported SFP changing again. Took me most of the day to realize one of my ethernet ports was down.

fabian · Aug 4, 2017

that is a harmless error that's spamming the logs - your OSDs are up after all

it is fixed in 12.1.2, which will be available soon via the usual channels.

Search

Search

Ceph Luminous 12.1.0 -> 12.1.1 (upgrade issue)

devinacosta

Active Member

bradkollmyer

Renowned Member

fabian

Proxmox Staff Member