Ceph Luminous 12.1.0 -> 12.1.1 (upgrade issue)

devinacosta

Active Member
Aug 3, 2017
65
11
28
47
I was running Ceph 12.1.0 and upgraded 12.1.1 this morning on all of my nodes and rebooted. All my services started but my ceph-disk processes won't come up.

ceph-disk@dev-sdb2.service loaded failed failed Ceph disk activation: /dev/sdb2
ceph-disk@dev-sdc2.service loaded failed failed Ceph disk activation: /dev/sdc2
ceph-mon@0.service loaded active running Ceph cluster monitor daemon
ceph-osd@0.service loaded active running Ceph object storage daemon osd.0
ceph-osd@1.service loaded active running Ceph object storage daemon osd.1

When I try to start the disk I get:

Aug 03 10:09:38 pve sh[20832]: main_trigger: trigger /dev/sdb2 parttype cafecafe-9b03-4f30-b4c6-b4b80ceff106 uuid d61cdaae-e388-422e-bef5-96eadd763f95
Aug 03 10:09:38 pve sh[20832]: command: Running command: /usr/sbin/ceph-disk --verbose activate-block /dev/sdb2
Aug 03 10:09:38 pve sh[20832]: main_trigger:
Aug 03 10:09:38 pve sh[20832]: main_trigger: Traceback (most recent call last):
Aug 03 10:09:38 pve sh[20832]: File "/usr/sbin/ceph-disk", line 11, in <module>
Aug 03 10:09:38 pve sh[20832]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5731, in run
Aug 03 10:09:38 pve sh[20832]: main(sys.argv[1:])
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5682, in main
Aug 03 10:09:38 pve sh[20832]: args.func(args)
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5438, in <lambda>
Aug 03 10:09:38 pve sh[20832]: func=lambda args: main_activate_space(name, args),
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4160, in main_activate_space
Aug 03 10:09:38 pve sh[20832]: osd_uuid = get_space_osd_uuid(name, dev)
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4115, in get_space_osd_uuid
Aug 03 10:09:38 pve sh[20832]: raise Error('%s is not a block device' % path)
Aug 03 10:09:38 pve sh[20832]: ceph_disk.main.Error: Error: /dev/sdb2 is not a block device
Aug 03 10:09:38 pve sh[20832]: Traceback (most recent call last):
Aug 03 10:09:38 pve sh[20832]: File "/usr/sbin/ceph-disk", line 11, in <module>
Aug 03 10:09:38 pve sh[20832]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5731, in run
Aug 03 10:09:38 pve sh[20832]: main(sys.argv[1:])
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5682, in main
Aug 03 10:09:38 pve sh[20832]: args.func(args)
Aug 03 10:09:38 pve sh[20832]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4891, in main_trigger
Aug 03 10:09:38 pve sh[20832]: raise Error('return code ' + str(ret))
Aug 03 10:09:38 pve sh[20832]: ceph_disk.main.Error: Error: return code 1
Aug 03 10:09:38 pve systemd[1]: ceph-disk@dev-sdb2.service: Main process exited, code=exited, status=1/FAILURE
Aug 03 10:09:38 pve systemd[1]: Failed to start Ceph disk activation: /dev/sdb2.
-- Subject: Unit ceph-disk@dev-sdb2.service has failed

However the disks are block storage.

root@pve:/var/log/ceph# fdisk -l /dev/sdb
Disk /dev/sdb: 1.1 TiB, 1200210141184 bytes, 2344160432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: gpt
Disk identifier: C62137AA-BE1F-4544-958F-370580155B2E

Device Start End Sectors Size Type
/dev/sdb1 2048 206847 204800 100M Ceph OSD
/dev/sdb2 206848 2344160398 2343953551 1.1T unknown

This is the log from my monitor:

2017-08-03 10:05:56.793725 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:05:59.733745 7fdae38ff700 0 mon.0@0(leader).data_health(26) update_stats avail 83% total 98252 MB, used 15838 MB, avail 82413 MB
2017-08-03 10:06:00.793037 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:00.793249 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1196: no daemons active
2017-08-03 10:06:01.795457 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:05.793640 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:05.793788 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1197: no daemons active
2017-08-03 10:06:06.795670 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:10.794535 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:10.794708 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1198: no daemons active
2017-08-03 10:06:11.796827 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:15.795481 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:15.795659 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1199: no daemons active
2017-08-03 10:06:16.797294 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:20.796496 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?
2017-08-03 10:06:20.796675 7fdadc8f1700 0 log_channel(cluster) log [DBG] : mgrmap e1200: no daemons active
2017-08-03 10:06:21.798873 7fdadc8f1700 -1 mon.0@0(leader).mgrstat failed to decode mgrstat state; luminous dev version?

Any ideas?
 
You have to make sure all your machines in the ceph cluster are using the latest code. The previous build was a development branch, and did not support nodes with different versions. The latest version is a release candidate (RC).

I was bit by this yesterday too. Had to shut everything down and make sure I had the same ceph version on all my nodes.

I also had a problem with ixgbe handling of unsupported SFP changing again. Took me most of the day to realize one of my ethernet ports was down.
 
that is a harmless error that's spamming the logs - your OSDs are up after all ;) it is fixed in 12.1.2, which will be available soon via the usual channels.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!