[SOLVED] Problem upgrading Ceph to Nautilus

mariodt · Aug 22, 2019

Hello,
I'm following this guide https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus to upgrade Ceph on my 6 nodes cluster.
Everything goes fine till "systemctl restart ceph-mgr.target", at this point the 3 managers don't restart.
Issuing "ceph -s" shows:
services:
...
mgr: no daemons active
...
Furthermore, issuing "systemctl restart ceph-osd.target" on one node, puts down the 4 OSDs on that node without starting them.

Thank you for your help!
Mario

Stoiko Ivanov · Aug 22, 2019

There should be some hints to what is not working in:
* the journal: `journalctl -r`
* the ceph logs (all files under '/var/log/ceph')

please check there and if it does not resolve the issue - post the logs here

mariodt · Aug 22, 2019

The hints are thousands, I paste the more interesting.
It seems to be a keyring/authentication problem after adapted the ceph.conf as stated in the guide.

journalctl -r:
...
Aug 22 08:46:00 c01 systemd[1]: Starting Proxmox VE replication runner...
Aug 22 08:45:26 c01 systemd[1]: Failed to start Ceph cluster manager daemon.
Aug 22 08:45:26 c01 systemd[1]: ceph-mgr@c01.service: Failed with result 'exit-code'.
Aug 22 08:45:26 c01 systemd[1]: ceph-mgr@c01.service: Start request repeated too quickly.
Aug 22 08:45:26 c01 systemd[1]: Stopped Ceph cluster manager daemon.
Aug 22 08:45:26 c01 systemd[1]: ceph-mgr@c01.service: Scheduled restart job, restart counter is at 3.
Aug 22 08:45:26 c01 systemd[1]: ceph-mgr@c01.service: Service RestartSec=10s expired, scheduling restart.
Aug 22 08:45:16 c01 systemd[1]: ceph-mgr@c01.service: Failed with result 'exit-code'.
Aug 22 08:45:16 c01 systemd[1]: ceph-mgr@c01.service: Main process exited, code=exited, status=1/FAILURE
Aug 22 08:45:16 c01 ceph-mgr[1997]: failed to fetch mon config (--no-mon-config to skip)
Aug 22 08:45:16 c01 ceph-mgr[1997]: 2019-08-22 08:45:16.499 7fa0dcabadc0 -1 AuthRegistry(0x7fffcd42d838) no keyring found at /etc/pve/priv/ceph.mgr.c01.keyring, disabling cephx
Aug 22 08:45:16 c01 ceph-mgr[1997]: 2019-08-22 08:45:16.499 7fa0dcabadc0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.mgr.c01.keyring: (2) No such file or directory
Aug 22 08:45:16 c01 ceph-mgr[1997]: 2019-08-22 08:45:16.495 7fa0dcabadc0 -1 AuthRegistry(0x556dd8b18140) no keyring found at /etc/pve/priv/ceph.mgr.c01.keyring, disabling cephx
Aug 22 08:45:16 c01 ceph-mgr[1997]: 2019-08-22 08:45:16.495 7fa0dcabadc0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.mgr.c01.keyring: (2) No such file or directory
Aug 22 08:45:16 c01 systemd[1]: Started Ceph cluster manager daemon.
Aug 22 08:45:16 c01 systemd[1]: Stopped Ceph cluster manager daemon.
...

ceph.log:
...
2019-08-22 08:29:30.161504 mon.0 mon.0 10.10.0.75:6789/0 607236 : cluster [DBG] mgrmap e204: c02(active)
2019-08-22 08:29:35.158691 mon.0 mon.0 10.10.0.75:6789/0 607239 : cluster [INF] Manager daemon c02 is unresponsive. No standby daemons available.
2019-08-22 08:29:35.158837 mon.0 mon.0 10.10.0.75:6789/0 607240 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
2019-08-22 08:29:35.164213 mon.0 mon.0 10.10.0.75:6789/0 607241 : cluster [DBG] mgrmap e205: no daemons active
2019-08-22 08:30:19.923578 mon.0 mon.0 10.10.0.75:6789/0 607270 : cluster [INF] osd.21 marked itself down
2019-08-22 08:30:19.923689 mon.0 mon.0 10.10.0.75:6789/0 607271 : cluster [INF] osd.5 marked itself down
2019-08-22 08:30:19.924382 mon.0 mon.0 10.10.0.75:6789/0 607272 : cluster [INF] osd.6 marked itself down
2019-08-22 08:30:19.924461 mon.0 mon.0 10.10.0.75:6789/0 607273 : cluster [INF] osd.4 marked itself down
2019-08-22 08:30:20.111718 mon.0 mon.0 10.10.0.75:6789/0 607275 : cluster [WRN] Health check failed: 4 osds down (OSD_DOWN)
2019-08-22 08:30:20.111764 mon.0 mon.0 10.10.0.75:6789/0 607276 : cluster [WRN] Health check failed: 1 host (4 osds) down (OSD_HOST_DOWN)
...

Stoiko Ivanov · Aug 22, 2019

seems it might be related to a wrong path for the keyring? (/etc/pve/priv/ usually contains the client keyring) - please check the upgrade docs in detail - specifically https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus#Adapt_.2Fetc.2Fpve.2Fceph.conf

if this does not help, please post your ceph config

mariodt · Aug 22, 2019

New configuration after upgrade:

root@c01:/etc# cat /etc/pve/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
auth supported = cephx
cluster network = 192.168.0.0/22
filestore xattr use omap = true
fsid = a88c6025-37fe-4f8b-95fe-13abd6237306
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
mon allow pool delete = true
mon health preluminous compat warning = false
public network = 10.10.0.0/22
setuser match path = /var/lib/ceph/$type/$cluster-$id

[mon.0]
host = c00
mon addr = 10.10.0.75:6789

[mon.1]
host = c01
mon addr = 10.10.0.76:6789

[mon.2]
host = c02
mon addr = 10.10.0.77:6789

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

root@c01:/etc# ls -l /etc/pve/priv/
total 4
-rw------- 1 root www-data 1679 Aug 21 12:57 authkey.key
-rw------- 1 root www-data 2731 Aug 22 08:38 authorized_keys
drwx------ 2 root www-data 0 Feb 13 2015 ceph
-rw------- 1 root www-data 63 Feb 13 2015 ceph.client.admin.keyring
-rw------- 1 root www-data 214 Feb 13 2015 ceph.mon.keyring
-rw------- 1 root www-data 6188 Aug 22 08:38 known_hosts
drwx------ 2 root www-data 0 Feb 13 2015 lock
-rw------- 1 root www-data 1675 Feb 12 2015 pve-root-ca.key
-rw------- 1 root www-data 3 Nov 4 2016 pve-root-ca.srl

root@c01:/etc# find . -name "*keyring"
./ceph/ceph.client.admin.keyring
./pve/priv/ceph.client.admin.keyring
./pve/priv/ceph/prodstorage.keyring
./pve/priv/ceph.mon.keyring

Stoiko Ivanov · Aug 22, 2019

mariodt said:
[global]
....
keyring = /etc/pve/priv/$cluster.$name.keyring

I think this line does not belong here (as explained in the upgrade-docs)

mariodt · Aug 22, 2019

You are right, I missed it.
Now all the managers are running.
The problem persists for OSDs (filestore type): they are in "down/in" state and rebooting or issuing "systemctl restart ceph-osd.target" does not solve.
Note that in the old ceph.conf file there was the following option:
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
and now there isn't.

Any suggestions?
Thank you

Stoiko Ivanov · Aug 22, 2019

mariodt said:
keyring = /var/lib/ceph/osd/ceph-$id/keyring

try adding that again?
also - did you follow https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus#Restart_the_OSD_daemon_on_all_nodes
did the scan-tasks complete successfully ?
see also http://wordpress.hawkless.id.au/ind...ces-block-and-data-not-present-for-bluestore/

hope this helps!

mariodt · Aug 23, 2019

1) option added again, rebooted, not solved, option removed again;
2) I've restarted the OSDs only on one node (of 6) without success, if I run "systemctl restart ceph-osd.target" to restart OSDs on all nodes, all OSDs go down and VMs will be unresponsive. Do I need to shutdown ALL the VMs (about 50)?
3) what if I execute scan-task on one node without restarting the OSDs on the other nodes?
4) I can add the “type”: “filestore” line but only after executing "ceph-volume simple scan" (/etc/ceph/osd/{OSDID}-GUID.json files don't exist now), this depends on the previous task at 3);
5) after the reboot at 1), the monitor on that node is down (see below),

am I in serious trouble?

root@c01:~# ceph status
cluster:
health: HEALTH_WARN
noout flag(s) set
4 osds down
1 host (4 osds) down
Degraded data redundancy: 532934/4247247 objects degraded (12.548%), 386 pgs degraded, 438 pgs undersized
1/3 mons down, quorum 0,2

services:
mon: 3 daemons, quorum 0,2, out of quorum: 1
mgr: c00(active), standbys: c02, c01
osd: 31 osds: 27 up, 31 in
flags noout
...

root@c01:~# journalctl -r
...
Aug 22 14:39:12 c01 ceph-osd[43281]: failed to fetch mon config (--no-mon-config to skip)
...

Stoiko Ivanov · Aug 23, 2019

mariodt said:
option added again, rebooted, not solved, option removed again;

I would put it in place again....

mariodt said:
2) I've restarted the OSDs only on one node (of 6) without success, if I run "systemctl restart ceph-osd.target" to restart OSDs on all nodes, all OSDs go down and VMs will be unresponsive. Do I need to shutdown ALL the VMs (about 50)?

check the logs of why the OSDs are down (I guess there was a problem with the scan step...) - and fix that issue - once your OSDs are up you should be able to get access to your cluster again
As an alternative you could remove the filestore OSDs (one by one) and readd them to the cluster as newly created ones (with bluestore and with ceph volume)

mariodt said:
3) what if I execute scan-task on one node without restarting the OSDs on the other nodes?

IIRC the scan task is there to let the local system know which OSD it hosts - so you should not need to restart the OSD services on the other nodes
try scanning individual filestore partitions e.g.:
`ceph-volume simple scan /dev/sdb1` (as written in the docs)

mariodt said:
5) after the reboot at 1), the monitor on that node is down (see below),

check the logs why it is down

Hope this helps!

mariodt · Aug 24, 2019

Solved: reinserted the osd option, executed the scan-volume, fixed the json files, activated the volumes, manually started the mon.
Now the OSDs and the Mon are up and running, I can continue the upgrade on the other nodes...
Thank you very much Stoiko.
Mario

Stoiko Ivanov · Aug 26, 2019

Glad you found a solution to your issues! Please mark the thread as 'SOLVED' so that others know what to expect.

Thanks!

Robert.H · Oct 24, 2019

So we should KEEP in /etc/pve/ceph.conf:
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

Even though the instructions say:
move the 'keyring' option into the 'client' section, and remove it everywhere else.

Could the instruction be updated please?
https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus#Adapt_/etc/pve/ceph.conf

Thanks!

Alwin · Oct 25, 2019

Robert.H said:
Could the instruction be updated please?
https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus#Adapt_/etc/pve/ceph.conf

I have updated the phrasing, maybe that makes it more clear.

mariodt · Oct 25, 2019

Robert.H said:
So we should KEEP in /etc/pve/ceph.conf:
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

Even though the instructions say:
move the 'keyring' option into the 'client' section, and remove it everywhere else.

Could the instruction be updated please?
https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus#Adapt_/etc/pve/ceph.conf

Thanks!

Now my cluster is up and running WITH this option, from August 24th, BUT I've a mixed osds type environment: bluestore and filestore, is it likely that with only bluestore osds this option is not required?

Alwin · Oct 25, 2019

@mariodt, this should be independent of the osd backend store.

Robert.H · Oct 25, 2019

Alwin said:
I have updated the phrasing, maybe that makes it more clear.

Thanks @Alwin
I was hoping to see a clear indication to keep the keyring option under [osd] since it's using a different path than the client keyring. The ending of the phrase: "[...], and remove it everywhere else." will still cause confusion and should be removed/reworded imho.

Alwin · Oct 28, 2019

Since Nautilus, all daemons use the 'keyring' option for its keyring. Possibly that in some cases, it doesn't account for "old" (pre Nautilus) OSDs.

normic · Jul 25, 2020

@Alwin I must admit that @Robert.H is right, it's still not clear (for me).

I'm currently in the process of upgrading and came to the forum to recheck this step, as I have two other keyring entries in my config. One under [mds] and one under [osd].
So even with the code example it's not clear if the entry under [global] should be the single one in the whole config. As [mds] and [osd] lead to different locations than the [global] one.

Alwin · Jul 27, 2020

normic said:
@Alwin I must admit that @Robert.H is right, it's still not clear (for me).

From the upgrade guide:

Since Nautilus, all daemons use the 'keyring' option for its keyring, so you have to adapt this. The easiest way is to move the global 'keyring' option into the 'client' section, and remove it everywhere else.

https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus#Adapt_/etc/pve/ceph.conf

[SOLVED] Problem upgrading Ceph to Nautilus

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Member

Proxmox Retired Staff