Ceph is Stuck after upgrade to octopus

lsmal808

New Member
Jan 11, 2021
6
0
1
36
Hi All

I have done the ceph upgrade on 2 of our clusters according to Proxmox Nautilus to Octopus upgrade procedure. The first cluster works perfect without any problems but the second cluster doesn't work. The upgrade completed until the point where you restart the OSD's, it then stopped with the OSD upgrade and shows that 3 hosts out of the 4 are down with their OSD's, the hosts are up they have access to each other and the osd services are running. During the upgrade I got 1 backfillfull osd(s) and 2 pool(s) backfillfull errors. I have set the full and backfill ratios but the messages stay the same.

HealthStatus.png


The OSD's seems to be unresponsive if I shutdown the 1 host which OSD's are up then it shows the OSD's are still up. The monitor and manager shows the host going down.
OSD.png
This is the ceph.conf :

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.0/24
fsid = 3d6cfbaa-c7ac-447a-843d-9795f9ab4276
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 10.10.10.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[osd]
osd max scrubs = 1
osd scrub begin hour = 19
osd scrub end hour = 5

[mon.W2]
host = W2
mon addr = 10.10.10.82:6789

[mon.W4]
host = W4
mon addr = 10.10.10.84:6789

[mon.W1]
host = W1
mon addr = 10.10.10.81:6789

[mon.W3]
host = W3
mon addr = 10.10.10.83:6789

The crushmap :

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host W2 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 29.108
alg straw2
hash 0 # rjenkins1
item osd.2 weight 7.277
item osd.3 weight 7.277
item osd.8 weight 7.277
item osd.9 weight 7.277
}
host W3 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 29.108
alg straw2
hash 0 # rjenkins1
item osd.4 weight 7.277
item osd.5 weight 7.277
item osd.10 weight 7.277
item osd.11 weight 7.277
}
host W4 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 29.108
alg straw2
hash 0 # rjenkins1
item osd.12 weight 7.277
item osd.13 weight 7.277
item osd.14 weight 7.277
item osd.15 weight 7.277
}
host W1 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 29.107
alg straw2
hash 0 # rjenkins1
item osd.0 weight 7.277
item osd.1 weight 7.277
item osd.6 weight 7.277
item osd.7 weight 7.277
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 116.428
alg straw2
hash 0 # rjenkins1
item W2 weight 29.107
item W3 weight 29.107
item W4 weight 29.107
item W1 weight 29.107
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

Any advise will be appreciated.
 
Here are some errors that I foundin the log files on the hosts.

On W2,W3 and W4 I get this error in ceph-volume.log

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line 144, in main
conf.ceph = configuration.load(conf.path)
File "/usr/lib/python3/dist-packages/ceph_volume/configuration.py", line 51, in load
raise exceptions.ConfigurationError(abspath=abspath)
ceph_volume.exceptions.ConfigurationError: Unable to load expected Ceph config at: /etc/ceph/ceph.conf
[2021-01-12 08:49:08,137][ceph_volume.main][ERROR ] ignoring inability to load ceph.conf
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line 144, in main
conf.ceph = configuration.load(conf.path)
File "/usr/lib/python3/dist-packages/ceph_volume/configuration.py", line 51, in load
raise exceptions.ConfigurationError(abspath=abspath)
ceph_volume.exceptions.ConfigurationError: Unable to load expected Ceph config at: /etc/ceph/ceph.conf
[2021-01-12 08:49:08,137][ceph_volume.main][ERROR ] ignoring inability to load ceph.conf
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line 144, in main
conf.ceph = configuration.load(conf.path)
File "/usr/lib/python3/dist-packages/ceph_volume/configuration.py", line 51, in load
raise exceptions.ConfigurationError(abspath=abspath)
ceph_volume.exceptions.ConfigurationError: Unable to load expected Ceph config at: /etc/ceph/ceph.conf
[2021-01-12 08:49:08,137][ceph_volume.main][ERROR ] ignoring inability to load ceph.conf
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line 144, in main
conf.ceph = configuration.load(conf.path)
File "/usr/lib/python3/dist-packages/ceph_volume/configuration.py", line 51, in load
raise exceptions.ConfigurationError(abspath=abspath)
ceph_volume.exceptions.ConfigurationError: Unable to load expected Ceph config at: /etc/ceph/ceph.conf

In all the host ceph-mgr *.log I found this error

2021-01-12T08:49:28.088+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.0 ()
2021-01-12T08:49:28.088+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.1 ()
2021-01-12T08:49:28.088+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.2 ()
2021-01-12T08:49:28.088+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.3 ()
2021-01-12T08:49:28.088+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.4 ()
2021-01-12T08:49:28.088+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.5 ()
2021-01-12T08:49:28.088+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.6 ()
2021-01-12T08:49:28.088+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.7 ()
2021-01-12T08:49:28.092+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.8 ()
2021-01-12T08:49:28.092+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.9 ()
2021-01-12T08:49:28.092+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.10 ()
2021-01-12T08:49:28.092+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.11 ()
2021-01-12T08:49:29.060+0200 7fbdfb108700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
 
I see I uploaded the old ceph.conf file here is the current.

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.0/24
fsid = 3d6cfbaa-c7ac-447a-843d-9795f9ab4276
mon allow pool delete = true
mon_host = 10.10.10.81 10.10.10.82 10.10.10.83 10.10.10.84
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 10.10.10.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.W2]
public_addr = 10.10.10.82

[mon.W4]
public_addr = 10.10.10.84

[mon.W1]
public_addr = 10.10.10.81

[mon.W3]
public_addr = 10.10.10.83
 
ceph_volume.exceptions.ConfigurationError: Unable to load expected Ceph config at: /etc/ceph/ceph.conf
There should be a symlink, /etc/ceph/ceph.conf -> /etc/pve/ceph.conf, is it present?

2021-01-12T08:49:28.088+0200 7fbdf9104700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.0 ()
Are the OSD daemons really running? What is the output of journalctl -u ceph-osd@0.service?
 
Yes the link is present and if I change the file on say W1 the changes are pushed to the others.

Here is the results of an osd per host

root@W1:~# journalctl -u ceph-osd@0.service
-- Logs begin at Tue 2021-01-12 08:49:17 SAST, end at Tue 2021-01-12 12:07:35 SAST. --
Jan 12 08:49:22 W1 systemd[1]: Starting Ceph object storage daemon osd.0...
Jan 12 08:49:22 W1 systemd[1]: Started Ceph object storage daemon osd.0.
Jan 12 08:49:33 W1 ceph-osd[2067]: 2021-01-12T08:49:33.996+0200 7f6f7760ce00 -1 osd.0 36214 log_to_monitors {default=true}

root@W2:~# journalctl -u ceph-osd@2.service
-- Logs begin at Tue 2021-01-12 08:48:55 SAST, end at Tue 2021-01-12 12:09:45 SA
Jan 12 08:48:59 W2 systemd[1]: Starting Ceph object storage daemon osd.2...
Jan 12 08:48:59 W2 systemd[1]: Started Ceph object storage daemon osd.2.
Jan 12 08:49:32 W2 ceph-osd[1945]: 2021-01-12T08:49:32.620+0200 7f505cfc9e00 -1

root@W3:~# journalctl -u ceph-osd@4.service
-- Logs begin at Tue 2021-01-12 08:49:10 SAST, end at Tue 2021-01-12 12:11:18 SAST. --
Jan 12 08:49:15 W3 systemd[1]: Starting Ceph object storage daemon osd.4...
Jan 12 08:49:15 W3 systemd[1]: Started Ceph object storage daemon osd.4.
Jan 12 08:49:31 W3 ceph-osd[1935]: 2021-01-12T08:49:31.683+0200 7fb21380ce00 -1 osd.4 36214 log_to_monitors {default=true}

root@W4:~# journalctl -u ceph-osd@12.service
-- Logs begin at Tue 2021-01-12 08:49:05 SAST, end at Tue 2021-01-12 12:12:14 SAST. --
Jan 12 08:49:10 W4 systemd[1]: Starting Ceph object storage daemon osd.12...
Jan 12 08:49:10 W4 systemd[1]: Started Ceph object storage daemon osd.12.
Jan 12 08:49:31 W4 ceph-osd[1987]: 2021-01-12T08:49:31.312+0200 7fe4594dce00 -1 osd.12 36214 log_to_monitors {default=true}
 
Is it possible to purge ceph but keep the OSD's, then try and install the previous version of ceph with the OSD's?
 
Can you restart on of the OSDs and check the state and log of it?

Is it possible to purge ceph but keep the OSD's, then try and install the previous version of ceph with the OSD's?
No. That is not how it works. :)
 
Ok I have restarted OSD 2 on the host the service is active running but still shows down. The journal message is the same as before. What do you mean log of it?

I thought so, just had to ask. Thanks for the help in advance.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!