upgrade from Ceph Pacific to Quincy failed

hrghope · Dec 17, 2023

i want to upgrade from pve7 to pve8,so i upgrade ceph first. I follow the help doc：https://pve.proxmox.com/wiki/Ceph_Pacific_to_Quincy

upgrade process looks like ok,but when i restarted osds,most of osds were in but down.

no more helpful log but this:

2023-12-16T04:16:40.237470+0800 mon.pve (mon.2) 12703 : cluster [INF] disallowing boot of quincy+ OSD osd.0 v2:192.168.3.5:6808/720073 because require_osd_release < octopus

it seams cluster limit osd release,when osd upgraded to quincy,it broken.
when i run this script:

ceph osd require-osd-release quincy

it stucked and all mon will be dead

it's my mon dump:

it's my osd dump,i omit some pool infos.

some other log about osd release:

2023-12-16T04:59:59.999+0800 7f83cf1dd700 0 log_channel(cluster) log [WRN] : [WRN] OSD_UPGRADE_FINISHED: all OSDs are running pacific or later but require_osd_release < pacific
2023-12-16T04:59:59.999+0800 7f83cf1dd700 0 log_channel(cluster) log [WRN] : all OSDs are running pacific or later but require_osd_release < pacific

3-12-16T05:57:55.469+0800 7f262efbf3c0 0 osd.3 2065970 crush map has features 288514119978713088, adjusting msgr requires for clients
2023-12-16T05:57:55.469+0800 7f262efbf3c0 0 osd.3 2065970 crush map has features 288514119978713088 was 8705, adjusting msgr requires for mons
2023-12-16T05:57:55.469+0800 7f262efbf3c0 0 osd.3 2065970 crush map has features 3314933069571702784, adjusting msgr requires for osds
2023-12-16T05:57:55.469+0800 7f262efbf3c0 1 osd.3 2065970 check_osdmap_features require_osd_release unknown -> nautilus

anyone can help me? will i lose my data?
thanks a lot.

my pve version

pve-manager/7.4-17/513c62be (running kernel: 5.15.131-2-pve)
root@pve6:~# pveversions
-bash: pveversions: command not found
root@pve6:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.131-2-pve)
pve-manager: 7.4-17 (running version: 7.4-17/513c62be)
pve-kernel-5.15: 7.4-9
pve-kernel-5.13: 7.1-9
pve-kernel-5.4: 6.4-6
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.126-1-pve: 5.15.126-1
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.4-1
proxmox-backup-file-restore: 2.4.4-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-6
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.14-pve1

ceph version

root@pve6:~# ceph versions
{
"mon": {
"ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)": 3
},
"mgr": {
"ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)": 3
},
"osd": {
"ceph version 16.2.13 (b81a1d7f978c8d41cf452da7af14e190542d2ee2) pacific (stable)": 2
},
"mds": {},
"overall": {
"ceph version 16.2.13 (b81a1d7f978c8d41cf452da7af14e190542d2ee2) pacific (stable)": 2,
"ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)": 6
}
}

why my osds version are not quincy?

gurubert · Dec 18, 2023

Have you tried running ceph osd require-osd-release pacific?
It looks like not everything is already on quincy.

sb-jw · Dec 18, 2023

It also looks to me like a step was missed somewhere. I have upgraded several times and have always had success with the instructions. I therefore always recommend reading the instructions carefully and completely before even laying a hand on the infrastructure.

It is also important that you first update and restart the Mons and then the OSDs themselves. Only when everything is at the current version can you set the minimum version on the cluster. But also remember that all clients must be up to date - if this is not the case, it can lead to problems.

hrghope · Dec 18, 2023

gurubert said:
Have you tried running ceph osd require-osd-release pacific?
It looks like not everything is already on quincy.

yes，i tried，but it stucked too，all mons were dead

i just tried again:

hrghope · Dec 18, 2023

sb-jw said:
It also looks to me like a step was missed somewhere. I have upgraded several times and have always had success with the instructions. I therefore always recommend reading the instructions carefully and completely before even laying a hand on the infrastructure.

It is also important that you first update and restart the Mons and then the OSDs themselves. Only when everything is at the current version can you set the minimum version on the cluster. But also remember that all clients must be up to date - if this is not the case, it can lead to problems.

i trust i followed the upgrade doc strictly,the doc is https://pve.proxmox.com/wiki/Ceph_Pacific_to_Quincy

Only when everything is at the current version can you set the minimum version on the cluster.

i was confused by result.after i upgraed by run "apt full-apgraed"

ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1

the ceph' version was ok,but "ceph versions" shows:

"osd": {
"ceph version 16.2.13 (b81a1d7f978c8d41cf452da7af14e190542d2ee2) pacific (stable)": 2
},

it seams something wrong?

But also remember that all clients must be up to date

Click to expand...

i just used the ceph as pve storage

sb-jw · Dec 18, 2023

ceph osd require-osd-release octopus probably doesn't work either?

hrghope said:
i just used the ceph as pve storage

Anyone who wants anything from CEPH is a client and connects to a specific version. For example, I have an snmp collector running to collect metrics, which is also connected directly to the CEPH - this is also a client and needs to be upgraded. Accordingly, your nodes are all a client. So you have to bring all nodes to the same status before you change it.

Just as a side note, if e.g. B. other clusters use the same storage or CephFS is used.

hrghope · Dec 18, 2023

sb-jw said:
ceph osd require-osd-release octopus probably doesn't work either?

Anyone who wants anything from CEPH is a client and connects to a specific version. For example, I have an snmp collector running to collect metrics, which is also connected directly to the CEPH - this is also a client and needs to be upgraded. Accordingly, your nodes are all a client. So you have to bring all nodes to the same status before you change it.

Just as a side note, if e.g. B. other clusters use the same storage or CephFS is used.

yes ,it stucked too

one of mon was dead,the log shows:

sb-jw · Dec 18, 2023

ceph osd set-require-osd-release octopus osd.ID

Can you try setting it explicitly for an OSD? Ideally, one that is running and one that is no longer running.

Can you repost the output of ceph osd dump? You still had jewel and nautilus in there, I would be interested to know whether that has changed in the meantime due to the execution of the commands or not.

hrghope · Dec 18, 2023

it seams can't setting it for an spefic osd

the osd dump:

epoch 2065994
fsid 3c48698c-fb70-4a69-a6a6-4d50d9c484bd
created 2021-02-06T19:22:24.572099+0800
modified 2023-12-17T06:30:57.608593+0800
flags nodown,noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 262
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release nautilus
stretch_mode_enabled false
pool 1 'ceph_data_l0' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 3713 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps [1~3]
pool 5 'ceph_data_l2' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 3713 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps [1~f]
pool 6 'ceph_work_l2' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 2065705 lfor 0/0/1154 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps [1~3,5~5,c~9,17~5,1d~3,21~2,24~6,2b~12,3f~4,44~c,53~f,63~2,67~7,6f~1,71~2,74~1]
pool 9 'ceph_nasdata_l1' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 3713 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps [1~3]
pool 13 'ceph_chia' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 3713 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps [1~7]
pool 14 'ceph_chia_temp' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 3713 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps [1~3]
pool 15 'k8s_l0' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 3713 flags hashpspool,selfmanaged_snaps stripe_width 0
removed_snaps [1~3]
pool 16 'k8s_l2' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 3713 flags hashpspool,selfmanaged_snaps stripe_width 0
removed_snaps [1~3]
pool 17 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 2065680 flags hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth
pool 18 'ceph_work_test' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1788989 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps [1~13]
pool 21 'ceph_nasdata' erasure profile default size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 245072 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192
removed_snaps [1~3]
pool 22 'ceph_nasdata_meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 246825 flags hashpspool stripe_width 0 application rbd
max_osd 9
osd.0 down in weight 1 up_from 2065425 up_thru 2065702 down_at 2065949 last_clean_interval [2030487,2065413) [v2:192.168.3.5:6804/1562,v1:192.168.3.5:6805/1562] [v2:172.16.1.5:6804/1562,v1:172.16.1.5:6805/1562] exists ce93fd01-0bed-4f8b-a33d-9f10bb1b8c8f
osd.1 up in weight 1 up_from 2065432 up_thru 2065703 down_at 2065414 last_clean_interval [2030494,2065413) [v2:192.168.3.5:6800/1555,v1:192.168.3.5:6801/1555] [v2:172.16.1.5:6800/1555,v1:172.16.1.5:6801/1555] exists,up 843d105a-aebc-45c4-86c7-d97f101c8034
osd.2 up in weight 1 up_from 2065432 up_thru 2065930 down_at 2065414 last_clean_interval [2030494,2065413) [v2:192.168.3.5:6808/1558,v1:192.168.3.5:6809/1558] [v2:172.16.1.5:6808/1558,v1:172.16.1.5:6809/1558] exists,up f29395db-651d-4bed-a2a7-5836573c24f7
osd.3 down in weight 1 up_from 2030502 up_thru 2065435 down_at 2065682 last_clean_interval [1893743,2030475) [v2:192.168.3.8:6808/1541,v1:192.168.3.8:6809/1541] [v2:172.16.1.8:6808/1541,v1:172.16.1.8:6809/1541] exists 53e4ce6a-f4d6-4ae4-a091-753e43fafa0f
osd.4 down in weight 1 up_from 2030491 up_thru 2065434 down_at 2065682 last_clean_interval [1893745,2030475) [v2:192.168.3.8:6800/1546,v1:192.168.3.8:6801/1546] [v2:172.16.1.8:6800/1546,v1:172.16.1.8:6801/1546] exists be4cf232-80e9-4ded-805d-bf72c55a3c2d
osd.5 down in weight 1 up_from 2030499 up_thru 2065682 down_at 2065684 last_clean_interval [1641889,2030475) [v2:192.168.3.6:6804/1509,v1:192.168.3.6:6805/1509] [v2:172.16.1.6:6804/1509,v1:172.16.1.6:6805/1509] exists 2abf28b1-4a7f-4c01-a692-54f646ae20ef
osd.6 down in weight 1 up_from 2030494 up_thru 2065682 down_at 2065684 last_clean_interval [1641887,2030475) [v2:192.168.3.6:6808/1504,v1:192.168.3.6:6809/1504] [v2:172.16.1.6:6808/1504,v1:172.16.1.6:6809/1504] exists d06692c4-e9b2-4789-924c-9ad7e6274bf8
osd.7 down in weight 1 up_from 2030495 up_thru 2065432 down_at 2065682 last_clean_interval [1893747,2030475) [v2:192.168.3.8:6804/1548,v1:192.168.3.8:6805/1548] [v2:172.16.1.8:6804/1548,v1:172.16.1.8:6805/1548] exists 99c11298-d70c-4244-a03b-d43a0335e186
osd.8 down in weight 1 up_from 2030491 up_thru 2065682 down_at 2065684 last_clean_interval [1641884,2030475) [v2:192.168.3.6:6800/1505,v1:192.168.3.6:6801/1505] [v2:172.16.1.6:6800/1505,v1:172.16.1.6:6801/1505] exists fb29028c-03c3-410a-90f0-59935f54a55d

sb-jw said:
ceph osd set-require-osd-release octopus osd.ID

Can you try setting it explicitly for an OSD? Ideally, one that is running and one that is no longer running.

Can you repost the output of ceph osd dump? You still had jewel and nautilus in there, I would be interested to know whether that has changed in the meantime due to the execution of the commands or not.

hrghope · Dec 18, 2023

i was confused

when i run

ceph osd set-require-osd-release octopus

the log shows:

./src/mon/OSDMonitor.cc: 11631: FAILED ceph_assert(osdmap.require_osd_release >= ceph_release_t:ctopus)
Dec 18 17:28:04 pve8 ceph-mon[3143740]: ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)

the log show,osd was upgrade to version :quincy,but assert failed by "quincy>octopus"?

sb-jw · Dec 18, 2023

Please take a look at the log files from CEPH and, if necessary, syslog to see if there is anything there as to why the command takes so long to execute.

Has the cluster been reinstalled to version 16 or has it already been upgraded?

hrghope · Dec 18, 2023

sb-jw said:
Please take a look at the log files from CEPH and, if necessary, syslog to see if there is anything there as to why the command takes so long to execute.

Has the cluster been reinstalled to version 16 or has it already been upgraded?

the command "ceph osd set-require-osd-release octopus" execute quickly ,but mons were down.maybe it cause the command looks like taking a long time.the mon log shows:

./src/mon/OSDMonitor.cc: 11631: FAILED ceph_assert(osdmap.require_osd_release >= ceph_release_t:ctopus)
Dec 18 17:28:04 pve8 ceph-mon[3143740]: ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)

after i upgraded failed,i reinstaledl nothing.

sb-jw · Dec 18, 2023

Have you tried restarting the mons? What do you see under journalctl from the mons?

sb-jw · Dec 18, 2023

hrghope said:
after i upgraded failed,i reinstaledl nothing.

What I meant: Was the cluster newly installed with CEPH 16 or was it previously upgraded?

hrghope · Dec 18, 2023

sb-jw said:
What I meant: Was the cluster newly installed with CEPH 16 or was it previously upgraded?

from Nautilus,perhaps. i am pve 6.X user

hrghope · Dec 18, 2023

sb-jw said:
Have you tried restarting the mons? What do you see under journalctl from the mons?

i had tried restarting mons much times.

i tried exec command like " journalctl -f -u ceph-mon@pve.service" on every node.the output was nothing interesting,like:

Dec 18 17:30:40 pve ceph-mon[374562]: 2023-12-18T17:30:40.278+0800 7f618a4be700 -1 mon.pve@2(probing) e16 get_health_metrics reporting 673 slow ops, oldest is log(1 entries from seq 139286 at 2023-12-18T17:27:01.950093+0800)
Dec 18 17:30:45 pve ceph-mon[374562]: 2023-12-18T17:30:45.278+0800 7f618a4be700 -1 mon.pve@2(probing) e16 get_health_metrics reporting 698 slow ops, oldest is log(1 entries from seq 139286 at 2023-12-18T17:27:01.950093+0800)
Dec 18 17:30:50 pve ceph-mon[374562]: 2023-12-18T17:30:50.278+0800 7f618a4be700 -1 mon.pve@2(probing) e16 get_health_metrics reporting 723 slow ops, oldest is log(1 entries from seq 139286 at 2023-12-18T17:27:01.950093+0800)
Dec 18 17:30:55 pve ceph-mon[374562]: 2023-12-18T17:30:55.278+0800 7f618a4be700 -1 mon.pve@2(probing) e16 get_health_metrics reporting 749 slow ops, oldest is log(1 entries from seq 139286 at 2023-12-18T17:27:01.950093+0800)
Dec 18 17:31:00 pve ceph-mon[374562]: 2023-12-18T17:31:00.282+0800 7f618a4be700 -1 mon.pve@2(probing) e16 get_health_metrics reporting 773 slow ops, oldest is log(1 entries from seq 139286 at 2023-12-18T17:27:01.950093+0800)
Dec 18 17:31:05 pve ceph-mon[374562]: 2023-12-18T17:31:05.282+0800 7f618a4be700 -1 mon.pve@2(probing) e16 get_health_metrics reporting 798 slow ops, oldest is log(1 entries from seq 139286 at 2023-12-18T17:27:01.950093+0800)
Dec 18 17:31:10 pve ceph-mon[374562]: 2023-12-18T17:31:10.282+0800 7f618a4be700 -1 mon.pve@2(probing) e16 get_health_metrics reporting 823 slow

i had 3 node,like pve、pve6、pve8，i executed commands like “journalctl -f -u ceph-mon@pve.service” on erver node. then i execute command "ceph osd require-osd-release pacific"
pve node must be stucked,and other node would show logs like:

Dec 18 17:41:49 pve6 ceph-mon[3064191]: ./src/mon/OSDMonitor.cc: 11631: FAILED ceph_assert(osdmap.require_osd_release >= ceph_release_t:ctopus)
Dec 18 17:41:49 pve6 ceph-mon[3064191]: ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7f66add01282]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 2: /usr/lib/ceph/libceph-common.so.2(+0x25b420) [0x7f66add01420]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 3: (OSDMonitor:repare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std:air<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&)+0xbe9e) [0x556845980cee]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 4: (OSDMonitor:repare_command(boost::intrusive_ptr<MonOpRequest>)+0x38f) [0x5568459912df]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 5: (OSDMonitor:repare_update(boost::intrusive_ptr<MonOpRequest>)+0x193) [0x55684599a883]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x2ce) [0x55684591a65e]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 7: (PaxosService::C_RetryMessage::_finish(int)+0x64) [0x55684585bd34]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 8: (Context::complete(int)+0x9) [0x5568457ecad9]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 9: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) [0x556845818ae8]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 10: (Paxos::finish_round()+0x8e) [0x55684591369e]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 11: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x3d3) [0x5568459154b3]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x117b) [0x5568457ea6cb]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 13: (Monitor::_ms_dispatch(Message*)+0x40a) [0x5568457eacfa]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 14: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x59) [0x556845819e39]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 15: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x468) [0x7f66adf4d1d8]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 16: (DispatchQueue::entry()+0x5ef) [0x7f66adf4a8df]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 17: (DispatchQueue:ispatchThread::entry()+0xd) [0x7f66ae00e1cd]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 18: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f66ad7f6ea7]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 19: clone()
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 0> 2023-12-18T17:41:49.602+0800 7f66a5b88700 -1 *** Caught signal (Aborted) **
Dec 18 17:41:49 pve6 ceph-mon[3064191]: in thread 7f66a5b88700 thread_name:ms_dispatch
Dec 18 17:41:49 pve6 ceph-mon[3064191]: ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f66ad802140]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 2: gsignal()
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 3: abort()
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x7f66add012dc]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 5: /usr/lib/ceph/libceph-common.so.2(+0x25b420) [0x7f66add01420]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 6: (OSDMonitor:repare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std:air<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&)+0xbe9e) [0x556845980cee]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 7: (OSDMonitor:repare_command(boost::intrusive_ptr<MonOpRequest>)+0x38f) [0x5568459912df]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 8: (OSDMonitor:repare_update(boost::intrusive_ptr<MonOpRequest>)+0x193) [0x55684599a883]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 9: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x2ce) [0x55684591a65e]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 10: (PaxosService::C_RetryMessage::_finish(int)+0x64) [0x55684585bd34]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 11: (Context::complete(int)+0x9) [0x5568457ecad9]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 12: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) [0x556845818ae8]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 13: (Paxos::finish_round()+0x8e) [0x55684591369e]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 14: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x3d3) [0x5568459154b3]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 15: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x117b) [0x5568457ea6cb]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 16: (Monitor::_ms_dispatch(Message*)+0x40a) [0x5568457eacfa]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 17: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x59) [0x556845819e39]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 18: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x468) [0x7f66adf4d1d8]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 19: (DispatchQueue::entry()+0x5ef) [0x7f66adf4a8df]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 20: (DispatchQueue:ispatchThread::entry()+0xd) [0x7f66ae00e1cd]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f66ad7f6ea7]
Dec 18 17:41:49 pve6 ceph-mon[3064191]: 22: clone()
Dec 18 17:41:49 pve6 ceph-mon[3064191]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Dec 18 17:41:49 pve6 systemd[1]: ceph-mon@pve6.service: Main process exited, code=killed, status=6/ABRT
Dec 18 17:41:49 pve6 systemd[1]: ceph-mon@pve6.service: Failed with result 'signal'.
Dec 18 17:41:59 pve6 systemd[1]: ceph-mon@pve6.service: Scheduled restart job, restart counter is at 4.
Dec 18 17:41:59 pve6 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 18 17:41:59 pve6 systemd[1]: Started Ceph cluster monitor daemon.

there were noting interesing log in the mgr、osd logs（/var/log/ceph）,but ceph.log showed something like :

2023-12-18T21:23:25.658758+0800 mon.pve8 (mon.0) 477 : cluster [INF] Client client.admin marked osd.0 out, after it was down for 430 seconds
2023-12-18T21:23:26.529765+0800 mgr.pve (mgr.1127804197) 113258 : cluster [DBG] pgmap v113144: 609 pgs: 48 down, 180 peering, 381 unknown; 13 TiB data, 44 TiB used, 72 TiB / 116 TiB avail
2023-12-18T21:23:26.533990+0800 mon.pve8 (mon.0) 487 : cluster [WRN] Health check update: 6 osds down (OSD_DOWN)
2023-12-18T21:23:26.543548+0800 mon.pve8 (mon.0) 489 : cluster [DBG] osdmap e2065995: 9 total, 2 up, 8 in

2023-12-18T21:16:17.666675+0800 mon.pve8 (mon.0) 1 : cluster [INF] mon.pve8 calling monitor election
2023-12-18T21:16:17.886080+0800 mon.pve8 (mon.0) 2 : cluster [INF] mon.pve8 is new leader, mons pve8,pve6,pve in quorum (ranks 0,1,2)
2023-12-18T21:16:17.905158+0800 mon.pve8 (mon.0) 3 : cluster [DBG] monmap e16: 3 mons at {pve=[v2:192.168.3.5:3300/0,v1:192.168.3.5:6789/0],pve6=[v2:192.168.3.6:3300/0,v1:192.168.3.6:6789/0],pve8=[v2:192.168.3.8:3300/0,v1:192.168.3.8:6789/0]} removed_ranks: {}
2023-12-18T21:16:17.922234+0800 mon.pve8 (mon.0) 4 : cluster [DBG] fsmap
2023-12-18T21:16:17.922245+0800 mon.pve8 (mon.0) 5 : cluster [DBG] osdmap e2065994: 9 total, 2 up, 9 in
2023-12-18T21:16:17.923600+0800 mon.pve8 (mon.0) 6 : cluster [DBG] mgrmap e813: pve(active, since 2d), standbys: pve6, pve8
2023-12-18T21:16:17.933521+0800 mon.pve8 (mon.0) 7 : cluster [WRN] Health detail: HEALTH_WARN nodown,noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub flag(s) set; 7 osds down; 2 hosts (6 osds) down; all OSDs are running pacific or later but require_osd_release < pacific; Reduced data availability: 609 pgs inactive, 48 pgs down, 180 pgs peering; 3 pool(s) do not have an application enabled; 6 pool(s) have no replicas configured
2023-12-18T21:16:17.933548+0800 mon.pve8 (mon.0) 8 : cluster [WRN] [WRN] OSDMAP_FLAGS: nodown,noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub flag(s) set
2023-12-18T21:16:17.933557+0800 mon.pve8 (mon.0) 9 : cluster [WRN] [WRN] OSD_DOWN: 7 osds down
2023-12-18T21:16:17.933564+0800 mon.pve8 (mon.0) 10 : cluster [WRN] osd.0 (root=default,host=pve) is down
2023-12-18T21:16:17.933577+0800 mon.pve8 (mon.0) 11 : cluster [WRN] osd.3 (root=default,host=pve8) is down
2023-12-18T21:16:17.933584+0800 mon.pve8 (mon.0) 12 : cluster [WRN] osd.4 (root=default,host=pve8) is down
2023-12-18T21:16:17.933591+0800 mon.pve8 (mon.0) 13 : cluster [WRN] osd.5 (root=default,host=pve6) is down
2023-12-18T21:16:17.933598+0800 mon.pve8 (mon.0) 14 : cluster [WRN] osd.6 (root=default,host=pve6) is down
2023-12-18T21:16:17.933605+0800 mon.pve8 (mon.0) 15 : cluster [WRN] osd.7 (root=default,host=pve8) is down
2023-12-18T21:16:17.933612+0800 mon.pve8 (mon.0) 16 : cluster [WRN] osd.8 (root=default,host=pve6) is down
2023-12-18T21:16:17.933619+0800 mon.pve8 (mon.0) 17 : cluster [WRN] [WRN] OSD_HOST_DOWN: 2 hosts (6 osds) down
2023-12-18T21:16:17.933625+0800 mon.pve8 (mon.0) 18 : cluster [WRN] host pve8 (root=default) (3 osds) is down
2023-12-18T21:16:17.933632+0800 mon.pve8 (mon.0) 19 : cluster [WRN] host pve6 (root=default) (3 osds) is down
2023-12-18T21:16:17.933639+0800 mon.pve8 (mon.0) 20 : cluster [WRN] [WRN] OSD_UPGRADE_FINISHED: all OSDs are running pacific or later but require_osd_release < pacific
2023-12-18T21:16:17.933646+0800 mon.pve8 (mon.0) 21 : cluster [WRN] all OSDs are running pacific or later but require_osd_release < pacific
2023-12-18T21:16:17.933654+0800 mon.pve8 (mon.0) 22 : cluster [WRN] [WRN] PG_AVAILABILITY: Reduced data availability: 609 pgs inactive, 48 pgs down, 180 pgs peering

2023-12-18T21:15:10.338363+0800 mgr.pve (mgr.1127804197) 113002 : cluster [DBG] pgmap v112896: 609 pgs: 48 down, 180 peering, 381 unknown; 13 TiB data, 44 TiB used, 72 TiB / 116 TiB avail
2023-12-18T21:15:12.334073+0800 mon.pve6 (mon.1) 12235 : cluster [INF] disallowing boot of quincy+ OSD osd.3 v2:192.168.3.8:6800/187507 because require_osd_release < octopus
2023-12-18T21:15:12.339016+0800 mgr.pve (mgr.1127804197) 113003 : cluster [DBG] pgmap v112897: 609 pgs: 48 down, 180 peering, 381 unknown; 13 TiB data, 44 TiB used, 72 TiB / 116 TiB avail
2023-12-18T21:15:14.339964+0800 mgr.pve (mgr.1127804197) 113004 : cluster [DBG] pgmap v112898: 609 pgs: 48 down, 180 peering, 381 unknown; 13 TiB data, 44 TiB used, 72 TiB / 116 TiB avail

hrghope · Dec 18, 2023

some infomation..

the issue may be same:https://tracker.ceph.com/issues/58156

i found some other problem:
when i executed the command like"ceph osd require-osd-release pacific",the mon will report error:

Dec 18 22:28:31 pve8 ceph-mon[3434097]: ./src/mon/OSDMonitor.cc: 11631: FAILED ceph_assert(osdmap.require_osd_release >= ceph_release_t:ctopus)
Dec 18 22:28:31 pve8 ceph-mon[3434097]: ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)

but in the quincy branch:https://github.com/ceph/ceph/blob/quincy/src/mon/OSDMonitor.cc
there were no code like this:

ceph_assert(osdmap.require_osd_release >= ceph_release_t::luminous);

but in the pacific code:https://github.com/ceph/ceph/blob/pacific-16.2.7_RC1/src/mon/OSDMonitor.cc

ceph_assert(osdmap.require_osd_release >= ceph_release_t::luminous);

ceph_assert(osdmap.require_osd_release >= ceph_release_t:ctopus);

so,quincy's later version fix the issue?

hrghope · Dec 24, 2023

any friend can help me?

hrghope · Jan 5, 2024

i upgraded 1 node's ceph version to newest version for quincy test branch. and "ceph osd set-require-osd-release Pacific" was ok

problem resolved!

ronzle · Feb 14, 2024

Was this ever resolved
I run in the same issue by upgrading from pacific to quincy
Thanks in advance

upgrade from Ceph Pacific to Quincy failed

Member

Attachments

Distinguished Member

Famous Member

Member

Member

Famous Member

Member

Attachments

Famous Member

Member

Member

Famous Member

Member

Famous Member

Famous Member

Member

Member

Member

Member

Member

New Member