After updating ceph 18.2.2 each osds never start

Nexsol

New Member
Apr 7, 2024
6
1
3
Hi there.

Please someone help me!!:eek:
I'm running 7 nodes pve with enterprise subscription. (Cluster A)
Also I'm running other pve cluster for test with no-subscription. (Cluster B)
I updated test pve cluster and noticed that ceph 18.2.2 has been released on proxmox no-subscription.
Cluster B seems to be OK.:(
I just wanted to update only ceph for Cluster A. (the cluster with enterprise subscripion)
So, I turned on no-subscription ceph repository each nodes then updated ceph from 18.2.1 to 18.2.2 each on cluster A.
Then I restarted each mon and mgr.
They all have been 18.2.2.

So, I restarted some OSD on one host (host1) to update 18.2.2 but the OSD couldn't start and still 18.2.1.:eek:

I tried other OSD on the same host but it haven't started.:eek::eek:


Therefor, I restarted the host1 but all OSDs on host1 haven't started.
I tried many times but OSDs on host1 have never started with error message "command '/bin/systemctl start ceph-osd@1' failed: exit code 1".:eek::eek::eek:

Sometimes when I press OSD start button then "Done!" message but it still "Down" status.o_O

For now, all OSDs are still 18.2.1 and OSDs on other hosts have been working but I'm sure they're gonna stop when I restart pve hosts.o_Oo_Oo_O
How can I make all OSDs be updated and start.
Thank you

Nexsol
 
Hi,

I updated from earlier version yesterday to 18.2.2 and had no issues, i updated using the upgrade button in the UI.
I have 3 nodes. I updated node 1, rebooted and checked ceph was ok / waited for it to converge.
I did the same for node 2, then when it converged the same for node 3.

Have you tried updating all packages on one node and rebooting it?

Also tip: never updated all nodes at the same time....
 
Same issue here - MON's and OSD's both failing with similar error message - looks like related to pg upmap feature

Downgrading to 18.2.1-pve2 fixes the issue

Code:
Apr 08 13:19:11 pve-1 systemd[1]: Started ceph-mon@pve-1.service - Ceph cluster monitor daemon.
Apr 08 13:19:11 pve-1 ceph-mon[1424723]: ./src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 71732ffd2d40 time 2024-04-08T13:19:11.296346+0200
Apr 08 13:19:11 pve-1 ceph-mon[1424723]: ./src/osd/OSDMap.cc: 3242: FAILED ceph_assert(pg_upmap_primaries.empty())
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x717331c777d9]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x717331c77974]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  3: (OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long) const+0x1187) [0x71733210e027]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  4: (OSDMonitor::update_from_paxos(bool*)+0x706) [0x5f95346d6746]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  5: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5f95344d33dc]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  6: (Monitor::preinit()+0x9ac) [0x5f95344ffc4c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  7: main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  8: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x71733128724a]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  9: __libc_start_main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  10: _start()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]: *** Caught signal (Aborted) **
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  in thread 71732ffd2d40 thread_name:ceph-mon
Apr 08 13:19:11 pve-1 ceph-mon[1424723]: 2024-04-08T13:19:11.293+0200 71732ffd2d40 -1 ./src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 71732ffd2d40 time 2024-04-08T13:19:11.296346+0200
Apr 08 13:19:11 pve-1 ceph-mon[1424723]: ./src/osd/OSDMap.cc: 3242: FAILED ceph_assert(pg_upmap_primaries.empty())
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x717331c777d9]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x717331c77974]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  3: (OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long) const+0x1187) [0x71733210e027]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  4: (OSDMonitor::update_from_paxos(bool*)+0x706) [0x5f95346d6746]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  5: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5f95344d33dc]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  6: (Monitor::preinit()+0x9ac) [0x5f95344ffc4c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  7: main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  8: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x71733128724a]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  9: __libc_start_main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  10: _start()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x71733129c050]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x7173312eae2c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  3: gsignal()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  4: abort()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x717331c77834]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x717331c77974]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  7: (OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long) const+0x1187) [0x71733210e027]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  8: (OSDMonitor::update_from_paxos(bool*)+0x706) [0x5f95346d6746]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  9: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5f95344d33dc]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  10: (Monitor::preinit()+0x9ac) [0x5f95344ffc4c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  11: main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x71733128724a]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  13: __libc_start_main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  14: _start()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]: 2024-04-08T13:19:11.297+0200 71732ffd2d40 -1 *** Caught signal (Aborted) **
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  in thread 71732ffd2d40 thread_name:ceph-mon
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x71733129c050]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x7173312eae2c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  3: gsignal()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  4: abort()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x717331c77834]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x717331c77974]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  7: (OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long) const+0x1187) [0x71733210e027]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  8: (OSDMonitor::update_from_paxos(bool*)+0x706) [0x5f95346d6746]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  9: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5f95344d33dc]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  10: (Monitor::preinit()+0x9ac) [0x5f95344ffc4c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  11: main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x71733128724a]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  13: __libc_start_main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  14: _start()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:     -1> 2024-04-08T13:19:11.293+0200 71732ffd2d40 -1 ./src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 71732ffd2d40 time 2024-04-08T13:19:11.296346+0200
Apr 08 13:19:11 pve-1 ceph-mon[1424723]: ./src/osd/OSDMap.cc: 3242: FAILED ceph_assert(pg_upmap_primaries.empty())
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x717331c777d9]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x717331c77974]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  3: (OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long) const+0x1187) [0x71733210e027]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  4: (OSDMonitor::update_from_paxos(bool*)+0x706) [0x5f95346d6746]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  5: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5f95344d33dc]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  6: (Monitor::preinit()+0x9ac) [0x5f95344ffc4c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  7: main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  8: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x71733128724a]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  9: __libc_start_main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  10: _start()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:      0> 2024-04-08T13:19:11.297+0200 71732ffd2d40 -1 *** Caught signal (Aborted) **
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  in thread 71732ffd2d40 thread_name:ceph-mon
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x71733129c050]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x7173312eae2c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  3: gsignal()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  4: abort()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x717331c77834]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x717331c77974]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  7: (OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long) const+0x1187) [0x71733210e027]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  8: (OSDMonitor::update_from_paxos(bool*)+0x706) [0x5f95346d6746]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  9: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5f95344d33dc]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  10: (Monitor::preinit()+0x9ac) [0x5f95344ffc4c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  11: main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x71733128724a]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  13: __libc_start_main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  14: _start()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:     -1> 2024-04-08T13:19:11.293+0200 71732ffd2d40 -1 ./src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 71732ffd2d40 time 2024-04-08T13:19:11.296346+0200
Apr 08 13:19:11 pve-1 ceph-mon[1424723]: ./src/osd/OSDMap.cc: 3242: FAILED ceph_assert(pg_upmap_primaries.empty())
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x717331c777d9]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x717331c77974]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  3: (OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long) const+0x1187) [0x71733210e027]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  4: (OSDMonitor::update_from_paxos(bool*)+0x706) [0x5f95346d6746]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  5: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5f95344d33dc]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  6: (Monitor::preinit()+0x9ac) [0x5f95344ffc4c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  7: main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  8: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x71733128724a]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  9: __libc_start_main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  10: _start()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:      0> 2024-04-08T13:19:11.297+0200 71732ffd2d40 -1 *** Caught signal (Aborted) **
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  in thread 71732ffd2d40 thread_name:ceph-mon
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x71733129c050]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x7173312eae2c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  3: gsignal()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  4: abort()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x717331c77834]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x717331c77974]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  7: (OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long) const+0x1187) [0x71733210e027]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  8: (OSDMonitor::update_from_paxos(bool*)+0x706) [0x5f95346d6746]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  9: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5f95344d33dc]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  10: (Monitor::preinit()+0x9ac) [0x5f95344ffc4c]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  11: main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x71733128724a]
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  13: __libc_start_main()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  14: _start()
Apr 08 13:19:11 pve-1 ceph-mon[1424723]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Downgrade with:
Bash:
apt install ceph=18.2.1-pve2 ceph-mon=18.2.1-pve2 ceph-mgr=18.2.1-pve2 ceph-osd=18.2.1-pve2 ceph-mds=18.2.1-pve2 ceph-base=18.2.1-pve2 librados2=18.2.1-pve2 ceph-mgr-modules-core=18.2.1-pve2 libradosstriper1=18.2.1-pve2 libsqlite3-mod-ceph=18.2.1-pve2 librados2=18.2.1-pve2 librbd1=18.2.1-pve2 ceph-common=18.2.1-pve2 libcephfs2=18.2.1-pve2 python3-cephfs=18.2.1-pve2 python3-ceph-argparse=18.2.1-pve2 python3-ceph-common=18.2.1-pve2 python3-rados=18.2.1-pve2 python3-rbd=18.2.1-pve2 python3-rgw=18.2.1-pve2 librgw2=18.2.1-pve2 ceph-fuse=18.2.1-pve2 ceph-volume=18.2.1-pve2
 
Last edited:
I've resolved the issue by reverting upmap's by doing the following:

ceph pg dump -> note all upmap'ed PG's
ceph osd rm-pg-upmap-items <pgnum from prev command>

after all upmaps where gone monitor service came back to life with 18.2.2, OSD's still crashed - after downgrading that one node again to 18.2.1 and waiting for ceph to become healthy the update went flawless and OSD's started again as they should
 
  • Like
Reactions: kcameron
upgraded one node to 18.2.2 and got this error. monitor and osds on this node does not start.
what is the easiest way of downgrading to previous release?

never mind, i should read better
 
The output of the following three commands would be also interesting:

Code:
ceph status
ceph features
ceph balancer status

Downgrading back to 18.2.1-pve2 worked but i would like to get this issue solved or find a workaround so i can update.
See below for the requested info.


ceph status
Code:
  cluster:
    id:     07f85e29-1217-46f1-a392-5cccfe47cd8c
    health: HEALTH_WARN
            Module 'dashboard' has failed dependency: PyO3 modules may only be initialized once per interpreter process

  services:
    mon: 3 daemons, quorum pve3,pve2,pve4 (age 4d)
    mgr: pve3(active, since 6d), standbys: pve1, pve2
    mds: 1/1 daemons up, 2 standby
    osd: 15 osds: 15 up (since 4d), 15 in (since 2w)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 369 pgs
    objects: 1.87M objects, 7.1 TiB
    usage:   22 TiB used, 19 TiB / 42 TiB avail
    pgs:     368 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   341 B/s rd, 104 KiB/s wr, 0 op/s rd, 13 op/s wr


ceph features
JSON:
{
    "mon": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 3
        }
    ],
    "mds": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 3
        }
    ],
    "osd": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 15
        }
    ],
    "client": [
        {
            "features": "0x2f018fb87aa4aafe",
            "release": "luminous",
            "num": 4
        },
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 16
        }
    ],
    "mgr": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 3
        }
    ]
}

ceph balancer status
JSON:
{
    "active": true,
    "last_optimize_duration": "0:00:00.001862",
    "last_optimize_started": "Wed May  1 12:28:18 2024",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}
 
  • Like
Reactions: Nexsol
I had the exact same thing happen to me when I first tried updating a node to 18.2.2-pve1. I was seeing the same error messages in the logs. Downgrading that node back to 18.2.1-pve2 as described by blade5502 got that node's MON and OSDs back online. For reference, I have the balancer mode set to upmap and I have previously used the osdmaptool utility to offline balance as described here and here.

ceph status
Code:
  cluster:
    id:     ***
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ***,***,*** (age 29m)
    mgr: ***(active, since 106m)
    osd: 15 osds: 15 up (since 28m), 15 in (since 6M)

  data:
    pools:   3 pools, 641 pgs
    objects: 306.78k objects, 1.1 TiB
    usage:   3.1 TiB used, 4.5 TiB / 7.6 TiB avail
    pgs:     641 active+clean

  io:
    client:   11 KiB/s rd, 467 KiB/s wr, 1 op/s rd, 46 op/s wr

ceph features
Code:
{
    "mon": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 3
        }
    ],
    "osd": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 15
        }
    ],
    "client": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 15
        }
    ],
    "mgr": [
        {
            "features": "0x3f01cfbffffdffff",
            "release": "luminous",
            "num": 1
        }
    ]
}

ceph balancer status
Code:
{
    "active": true,
    "last_optimize_duration": "0:00:00.002891",
    "last_optimize_started": "Mon May 6 08:11:42 2024",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}

Before updating ceph again, I removed all pg-upmap-items and pg-upmap-primary entries from my cluster. I also disabled the balancer. After doing all of that, I was able to successfully update the node that had previously failed. All of the other nodes were also updated successfully with no issue.

I have not tried to reenable the balancer yet, but I plan to do that during the next maintenance window (just in case it causes the whole cluster to crash).
 
Last edited:
  • Like
Reactions: Nexsol
I had the exact same thing happen to me when I first tried updating a node to 18.2.2-pve1. I was seeing the same error messages in the logs. Downgrading that node back to 18.2.1-pve2 as described by blade5502 got that node's MON and OSDs back online. For reference, I have the balancer mode set to upmap and I have previously used the osdmaptool utility to offline balance as described here and here.

Before updating ceph again, I removed all pg-upmap-items and pg-upmap-primary entries from my cluster. I also disabled the balancer. After doing all of that, I was able to successfully update the node that had previously failed. All of the other nodes were also updated successfully with no issue.

I have not tried to reenable the balancer yet, but I plan to do that during the next maintenance window (just in case it causes the whole cluster to crash).
Can you elaborate on how you removed the pg-upmap entries?
I'm pretty sure i have also used the osdmaptool and set balancer to upmap in the past.
 
fixed it.
ceph balancer off
ceph osd dump
ceph osd rm-pg-upmap-primary <each upmap primary id from dump>
ceph osd rm-pg-upmap-items <each upmap item from dump>
waited until cluster was finished with backfills and then then upgrade, appears to work.
 
  • Like
Reactions: Nexsol and kcameron
fixed it.
ceph balancer off
ceph osd dump
ceph osd rm-pg-upmap-primary <each upmap primary id from dump>
ceph osd rm-pg-upmap-items <each upmap item from dump>
waited until cluster was finished with backfills and then then upgrade, appears to work.

This is exactly what I did to fix my cluster. I've turned the balancer back on as well and everything seems fine.
 
  • Like
Reactions: Nexsol and flotho
It is safe to execute the commands

ceph osd rm-pg-upmap-primary
ceph osd rm-pg-upmap-items
In a production environment, I mean won't you remove the cluster data from the virtual machines?
 
  • Like
Reactions: Nexsol
It is safe to execute the commands

ceph osd rm-pg-upmap-primary
ceph osd rm-pg-upmap-items
In a production environment, I mean won't you remove the cluster data from the virtual machines?
Those commands do not remove any data. They just remove the upmap entries from the osdmap that override the default placement of those particular PGs.

After removing the upmaps, the cluster will backfill those PGs back to their normal locations. You can watch the progress of the backfill by running ceph status.
 
fixed it.
ceph balancer off
ceph osd dump
ceph osd rm-pg-upmap-primary <each upmap primary id from dump>
ceph osd rm-pg-upmap-items <each upmap item from dump>
waited until cluster was finished with backfills and then then upgrade, appears to work.
Sorry for late replying.
I gave up and recreate ceph cluster and recovered all VM from backup.
I should try your advice.
Thanks a lot!
 
Thanks guys!
I gave up and recreated cluster again.:confused:
But I really appreciate this forum.
Thanks again!!:D:D:D
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!