[SOLVED] Problem upgrading from ceph Luminous to Nautilus : crush map has legacy tunables (require firefly, min is hammer)

alain

Renowned Member
May 17, 2009
223
2
83
France/Paris
Hi all,

And happy new year to all.

I just took advantage of the new year period, where there were few people in the lab, to finally upgrade my 5.4 clusters to 6.1. I tested the procedure on test clusters, and all went fine. I followed the guide to upgrade from 5.4 to 6.0. I have also ceph clusters, and I followed the guide to upgrade from luminous to nautilus.
https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus

I then went to apply the procedure on a production cluster, and unfortunetaly, I made a little mistake. I forgot to apply the two commands :
ceph-volume simple scan
ceph-volume simple activate --all

on one node. Then I got the error :

Code:
root@prox1orsay:~# ceph -s
  cluster:
    id:     b5a08127-b65a-430c-ad34-810752429977
    health: HEALTH_WARN
            crush map has legacy tunables (require firefly, min is hammer)

  services:
    mon: 3 daemons, quorum 0,1,2 (age 17s)
    mgr: prox1orsay(active, since 35m)
    osd: 24 osds: 24 up, 24 in

  data:
    pools:   3 pools, 1188 pgs
    objects: 169.70k objects, 658 GiB
    usage:   1.9 TiB used, 9.0 TiB / 11 TiB avail
    pgs:     1188 active+clean

  io:
    client:   90 KiB/s wr, 0 op/s rd, 7 op/s wr

So I have the warning message I mentionned in the title. I thought I had applied the same commands on every nodes, but when comparing the history, was forced to conclude it was not the case. I think it is the reason why I get the warning on legacy tunables, because these OSDs have not been upgraded to Nautilus. And then I issued the command : 'ceph osd require-osd-release nautilus'.

It is a three node cluster, with each 8 OSDs. OSDs are filestore type yet. On the first and third node, command history is similar :

Code:
468  ceph osd set noout
  469  ceph osd dump | grep ^flags
  470  cd /etc/pve
  471  mc
  472  sed -i 's/luminous/nautilus/' /etc/apt/sources.list.d/ceph.list
  473  cd
  474  cat /etc/apt/sources.list.d/ceph.list
  475  apt update
  476  apt list --upgradable
  477  apt-dist upgrade
  478  apt dist-upgrade
  479  systemctl restart ceph-mon.target
  480  ceph mon dump | grep min_mon_release
  481  systemctl restart ceph-mgr.target
  482  ceph -s
  483  ceph -s
  484  ceph -s
  485  systemctl restart ceph-osd.target
  486  ceph-volume simple scan
  487  ceph-volume simple activate --all
  488  ceph -s
  489  ceph osd require-osd-release nautilus
  490  cep osd unset noout
  491  ceph osd unset noout
  492  ceph -s
  493  ceph config set mon mon_crush_min_required_version firefly
  494  ceph -s
  495  ceph mon enable-msgr2

But on the second, I did not have the two ceph-volume commands :

Code:
482  ceph osd dump | grep ^flags
  483  sed -i 's/luminous/nautilus/' /etc/apt/sources.list.d/ceph.list
  484  cat /etc/apt/sources.list.d/ceph.list
  485  apt update
  486  apt list --upgradable
  487  apt dist-upgrade
  488  systemctl restart ceph-mon.target
  489  systemctl restart ceph-mgr.target
  490  systemctl restart ceph-mgr.target
  491  ceph -s
  492  ps aux | grep ceph.mgr
  493  systemctl status ceph-mgr.target
  494  systemctl restart ceph-osd.target
  495  ceph mon dump
  496  ceph-volume simple scan
  497  ceph -s

So I tried to apply the command 'ceph-volume simple scan' again on node 2, but got this time an error :

Code:
root@prox2orsay:~# ceph-volume simple scan
 stderr: lsblk: /var/lib/ceph/osd/ceph-10: not a block device
 stderr: Bad argument "/var/lib/ceph/osd/ceph-10", expected an absolute path in /dev/ or /sys or a unit name: Invalid argument
Running command: /sbin/cryptsetup status /dev/sdd1
-->  RuntimeError: --force was not used and OSD metadata file exists: /etc/ceph/osd/10-2543a4e3-d8ef-476f-a6fa-2cfaa0b2fb6b.json

At this point I need some advices. Is it still possible to upgrade the OSDs to nautilus (without removing re re-adding the OSDs) ? Should I revert the command 'ceph osd require-osd-release nautilus' to luminous in order to do so ? Is there another possibility ?

Here is the osd tree. I am now in Proxmox 6.1, latest version...

Code:
root@prox1orsay:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME           STATUS REWEIGHT PRI-AFF
-1       10.91016 root default
-2        3.63672     host prox1orsay
 0   hdd  0.45459         osd.0           up  1.00000 1.00000
 3   hdd  0.45459         osd.3           up  1.00000 1.00000
 4   hdd  0.45459         osd.4           up  1.00000 1.00000
 5   hdd  0.45459         osd.5           up  1.00000 1.00000
 6   hdd  0.45459         osd.6           up  1.00000 1.00000
 7   hdd  0.45459         osd.7           up  1.00000 1.00000
 8   hdd  0.45459         osd.8           up  1.00000 1.00000
 9   hdd  0.45459         osd.9           up  1.00000 1.00000
-3        3.63672     host prox2orsay
 1   hdd  0.45459         osd.1           up  1.00000 1.00000
10   hdd  0.45459         osd.10          up  1.00000 1.00000
11   hdd  0.45459         osd.11          up  1.00000 1.00000
12   hdd  0.45459         osd.12          up  1.00000 1.00000
13   hdd  0.45459         osd.13          up  1.00000 1.00000
14   hdd  0.45459         osd.14          up  1.00000 1.00000
15   hdd  0.45459         osd.15          up  1.00000 1.00000
16   hdd  0.45459         osd.16          up  1.00000 1.00000
-4        3.63672     host prox3orsay
 2   hdd  0.45459         osd.2           up  1.00000 1.00000
17   hdd  0.45459         osd.17          up  1.00000 1.00000
18   hdd  0.45459         osd.18          up  1.00000 1.00000
19   hdd  0.45459         osd.19          up  1.00000 1.00000
20   hdd  0.45459         osd.20          up  1.00000 1.00000
21   hdd  0.45459         osd.21          up  1.00000 1.00000
22   hdd  0.45459         osd.22          up  1.00000 1.00000
23   hdd  0.45459         osd.23          up  1.00000 1.00000

As I had a warning, I entered the command 'ceph config set mon mon_crush_min_required_version firefly', but it did not help.
As you perhaps noted, I have also a little problem with managers, as ceph -s does not display standbys managers. It was already the case on a test cluster (also filestore).

I would be grateful if someone can help solve this issue (which is my fault...).

Thanks in advance
Alain
 
For the commande min version firefly, I see a lot of messages in the logs saying :
"set_mon_vals failed to set mon_crush_min_required_version = firefly: Configuration option 'mon_crush_min_required_version' may not be modified at runtime"
 
I just verified, and all OSDs are indeed in nautilus version. So, why the warning ?

Code:
root@prox2orsay:~# ceph tell osd.* version
osd.0: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.1: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.2: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.3: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.4: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.5: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.6: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.7: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.8: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.9: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.10: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.11: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.12: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.13: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.14: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.15: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.16: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.17: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.18: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.19: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.20: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.21: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.22: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
osd.23: {
    "version": "ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)"
}
 
Hi all,

This thread is ratehr old, but today, I solved the problem, abd I thought it would be a good thing to let know how I did it.

The problem was not slved during all this time, but it was a warning, and seems harmless, and I let it as it was.
But today, I upgraded to Proxmox 6.4, and then from nautilus to Octopus. I hoped that it would solve the problem, but it did not.

So I took a closer look. I have two other ceph test cluster where the problem did not happen. I looked at the differences between them (the three just upgraded to Octopus). Looking at the crus map, may faulty cluster has these settings :

Code:
# ceph osd crush dump
...
   "tunables": {
        "choose_local_tries": 0,
        "choose_local_fallback_tries": 0,
        "choose_total_tries": 50,
        "chooseleaf_descend_once": 1,
        "chooseleaf_vary_r": 1,
        "chooseleaf_stable": 0,
        "straw_calc_version": 1,
        "allowed_bucket_algs": 22,
        "profile": "firefly",
        "optimal_tunables": 0,
        "legacy_tunables": 0,
        "minimum_required_version": "firefly",
        "require_feature_tunables": 1,
        "require_feature_tunables2": 1,
        "has_v2_rules": 0,
        "require_feature_tunables3": 1,
        "has_v3_rules": 0,
        "has_v4_buckets": 0,
        "require_feature_tunables5": 0,
        "has_v5_rules": 0
    },

So, optimal tunables were at 0, and minimum_required_version to firefly. On the other clusters, there were 'optimal_tunable' to 1, and mimimum_required_version to to jewel. So reading back to the documentation of an old upgrade from 2018 (Jewel to Luminous) :
https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous

I read that :
Also it is recommended to set the tunable to optimal, but this will produce a massive rebalance.

ceph osd set-require-min-compat-client jewel

ceph osd crush tunables optimal


And certainly due to the fear of the massive rebalance on my production cluster, I certainly did not this last step.
This time, I decided to try it, and the warning message then disappeared (after 4 hours of rebalancin), and now the crush table looks the same as my test clusters.
So the solution is for me :

ceph osd set-require-min-compat-client jewel ceph osd crush tunables optimal

So problem finally solved.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!