ceph hammer to jewel upgrade stuck

May 29, 2019
2
0
21
53
Hi everybody,

I'm still new to ceph
After checking the forums I am still unable to solve a ceph related problem, I managed to follow the upgrade procedure from PVE 4.1 hammer to PVE 4.4 (Virtual Environment 4.4-24)
jewel for two of my three nodes. The third node failed to upgrade due to a mistake on my end, it was unable to boot because some boot partition could not be found.

Following various forum posts I reinstalled the node and managed to remove the original node info and then re-add the freshly installed node to the PVE cluster.

Now I would like to re-add the 5 osd to the ceph cluster but they do not not show up, i see in the crush map that there are some kind of place holders for the original osd from the failed node.

crush-map.png
[see attached screenshot 1]

I zapped and re-created the osd on the re-installed node but they just don't appear, also i see that on the other two old nodes the disks show "osd.X" under the "use" column in the GUI wheras the fresh node shows "Partitions"

ceph-status.png

What am I missing? Do I need to edit the crush map?

I want to get the cluster healthy before proceeding the upgrade to ceph luminous and pve 5.4, how should I proceed?

Kind regards,
Lorenz
 
What I gathered so far. The OSDs still exist in the cluster, they have to be removed to clear the health state, partially.

From the top of my head, do the following.
Code:
ceph osd out <osd.id>
ceph osd rm <osd.id>
ceph osd crush rm <osd.id>
ceph auth del <osd.id>

Then check if you have three MONs running, if the clock skew doesn't go away then you may need to restart the MON from the "new" node.

After that follow the rest of the upgrade guide to set the tunables to get rid of the rest of the health warnings.

Please provide more details if in doubt (eg. ceph osd tree, ceph -s and logs).

For reference an older version of the wiki page.
https://pve.proxmox.com/mediawiki/index.php?title=Ceph_Server&direction=prev&oldid=9997
 
Hi Alvin,

thanks for the quick response, I have followed those steps and it has left me with those "device 0 device0" entries in the crush map which i can not remove with the commands you mentioned..

Yes, those warnings about clock skew disappeared already and legacy tunables should go away once I can complete the procedure, but right now i don't know how to get past the issue with the osd of the new node not taking their place in the cluster...

here is the output of ceph osd tree


Code:
{
    "nodes": [
        {
            "id": -1,
            "name": "default",
            "type": "root",
            "type_id": 10,
            "children": [
                -4,
                -3,
                -2
            ]
        },
        {
            "id": -2,
            "name": "pve02",
            "type": "host",
            "type_id": 1,
            "children": []
        },
        {
            "id": -3,
            "name": "pve03",
            "type": "host",
            "type_id": 1,
            "children": [
                9,
                8,
                7,
                6,
                5
            ]
        },
        {
            "id": 5,
            "name": "osd.5",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.209991,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        },
        {
            "id": 6,
            "name": "osd.6",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.209991,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        },
        {
            "id": 7,
            "name": "osd.7",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.209991,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        },
        {
            "id": 8,
            "name": "osd.8",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.209991,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        },
        {
            "id": 9,
            "name": "osd.9",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.089996,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        },
        {
            "id": -4,
            "name": "pve04",
            "type": "host",
            "type_id": 1,
            "children": [
                11,
                10,
                14,
                13,
                12
            ]
        },
        {
            "id": 12,
            "name": "osd.12",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.209991,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        },
        {
            "id": 13,
            "name": "osd.13",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.209991,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        },
        {
            "id": 14,
            "name": "osd.14",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.089996,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        },
        {
            "id": 10,
            "name": "osd.10",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.209991,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        },
        {
            "id": 11,
            "name": "osd.11",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.209991,
            "depth": 2,
            "exists": 1,
            "status": "up",
            "reweight": 1.000000,
            "primary_affinity": 1.000000
        }
    ],
    "stray": []
}

and ceph -s produces:

Code:
cluster fc88e526-76f9-4dc0-ad21-dd32837d3f73
     health HEALTH_WARN
            126 pgs degraded
            126 pgs stuck degraded
            300 pgs stuck unclean
            126 pgs stuck undersized
            126 pgs undersized
            recovery 29010/300885 objects degraded (9.642%)
            recovery 57409/300885 objects misplaced (19.080%)
            crush map has legacy tunables (require bobtail, min is firefly)
            no legacy OSD present but 'sortbitwise' flag is not set
            all OSDs are running jewel or later but the 'require_jewel_osds' osdmap flag is not set
     monmap e5: 3 mons at {0=10.0.100.4:6789/0,1=10.0.100.3:6789/0,pve02=10.0.100.2:6789/0}
            election epoch 1162, quorum 0,1,2 pve02,1,0
     osdmap e7872: 10 osds: 10 up, 10 in; 174 remapped pgs
      pgmap v65005248: 300 pgs, 1 pools, 388 GB data, 100295 objects
            1003 GB used, 914 GB / 1918 GB avail
            29010/300885 objects degraded (9.642%)
            57409/300885 objects misplaced (19.080%)
                 174 active+remapped
                 126 active+undersized+degraded
  client io 18491 B/s rd, 563 kB/s wr, 4 op/s rd, 92 op/s wr

any pointer is greatly appreciated.

Kind regards,
Lorenz
 
thanks for the quick response, I have followed those steps and it has left me with those "device 0 device0" entries in the crush map which i can not remove with the commands you mentioned..
These are placeholder and can be ignored.

Depending on how and to what software release you re-installed, it may not be possible to add the OSDs. But check the ceph logs for more information when you add the OSDs. But as far as I can see, the old OSDs don't exist in the cluster anymore and therefore need to be added from scratch.

For reference, the upgrade guide [1] and for ceph[2].
[1] https://pve.proxmox.com/wiki/Upgrade_from_4.x_to_5.0
[2] https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!