ceph warning post upgrade to v8

devedse · Aug 20, 2023

@spirit , can you maybe poke the mailing list to see if they can backport the fix to Quincy? I checked their information and it seems the CEPH REEF (18.x) release is waiting on a backport of a bug when installing ca-certificates-java to Debian stable. Which probably means CEPH 18 isn't coming anytime soon to proxmox.

devopstales · Aug 28, 2023

Is there any update on this?

Max Carrara · Aug 30, 2023

Blaiserman said:
Is there any update on this?

I'm currently looking into this.

So far I haven't been successful in locating the actual culprit; the only thing we know for sure is that it's supposed to be fixed in the latest version of Ceph (as of writing this), which is a reply to what @spirit had posted. How it got fixed remains a mystery, however.

My current best guess is that there's a Python package that has another Python package as transitive dependency that uses PyO3, as neither Ceph nor its own Python modules depend on it anywhere.

It will probably take a while for me to dig that up (and maybe it's a thing that's not even compatible with Debian Bookworm), but I'll keep you posted.

Max Carrara · Sep 6, 2023

So, I got an update. Things are looking rather bleak as of right now, unfortunately.

I've managed to find a Python traceback in the systemd journal, which occurs if you try to enable the dashboard via ceph mgr module enable dashboard:

Code:

Sep 04 18:39:51 ceph-01 ceph-mgr[15669]: 2023-09-04T18:39:51.438+0200 7fecdc91e000 -1 mgr[py] Traceback (most recent call last):
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/usr/share/ceph/mgr/dashboard/__init__.py", line 60, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from .module import Module, StandbyModule  # noqa: F401
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/usr/share/ceph/mgr/dashboard/module.py", line 30, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from .controllers import Router, json_error_page
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/usr/share/ceph/mgr/dashboard/controllers/__init__.py", line 1, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from ._api_router import APIRouter
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/usr/share/ceph/mgr/dashboard/controllers/_api_router.py", line 1, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from ._router import Router
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/usr/share/ceph/mgr/dashboard/controllers/_router.py", line 7, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from ._base_controller import BaseController
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/usr/share/ceph/mgr/dashboard/controllers/_base_controller.py", line 11, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from ..services.auth import AuthManager, JwtManager
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/usr/share/ceph/mgr/dashboard/services/auth.py", line 12, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     import jwt
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/lib/python3/dist-packages/jwt/__init__.py", line 1, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from .api_jwk import PyJWK, PyJWKSet
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/lib/python3/dist-packages/jwt/api_jwk.py", line 6, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from .algorithms import get_default_algorithms
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/lib/python3/dist-packages/jwt/algorithms.py", line 6, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from .utils import (
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/lib/python3/dist-packages/jwt/utils.py", line 7, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from cryptography.hazmat.primitives.asymmetric.ec import EllipticCurve
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/lib/python3/dist-packages/cryptography/hazmat/primitives/asymmetric/ec.py", line 11, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from cryptography.hazmat._oid import ObjectIdentifier
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:   File "/lib/python3/dist-packages/cryptography/hazmat/_oid.py", line 7, in <module>
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]:     from cryptography.hazmat.bindings._rust import (
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]: ImportError: PyO3 modules may only be initialized once per interpreter process
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]: 2023-09-04T18:39:51.438+0200 7fecdc91e000 -1 mgr[py] Class not found in module 'dashboard'
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]: 2023-09-04T18:39:51.438+0200 7fecdc91e000 -1 mgr[py] Error loading module 'dashboard': (2) No such file or directory
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]: 2023-09-04T18:39:51.470+0200 7fecdc91e000 -1 mgr[py] Module progress has missing NOTIFY_TYPES member
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]: 2023-09-04T18:39:51.502+0200 7fecdc91e000 -1 mgr[py] Module iostat has missing NOTIFY_TYPES member
Sep 04 18:39:51 ceph-01 ceph-mgr[15669]: 2023-09-04T18:39:51.502+0200 7fecdc91e000 -1 log_channel(cluster) log [ERR] : Failed to load ceph-mgr modules: dashboard

This traceback reveals the dependency chain between the Ceph dashboard and PyO3.

To sum the dependencies up:

The Ceph dashboard uses PyJWT for authentication, which is a Python library for JSON Web Token.
PyJWT in turn uses cryptography, a Python library for cryptographic primitives.
A part of cryptography's source is written in Rust, and in order to be able to interoperate with that Rust code, it makes use of PyO3.
(That Rust code calls OpenSSL among other things, which is written in C, so it really is turtles all the way down.)

My initial question was whether the issue was because of PyO3 itself or how it was used, so I went down the dependency chain. I eventually found issue 9016 on cryptography's side, and it seems that the Ceph dashboard isn't the only thing that's affected by this.

Following issue 9016 all the way down, it seems that the AUR maintainer for Ceph has also stumbled across this while rebuilding the aur-ceph package for Ceph v18. The maintainer made a separate issue specifically for this problem. Check it out if you want to have all the details.

To summarize all the issues above, it seems like the change in PyO3 that @spirit had found is indeed the cause for all of this. Basically, if your Python application uses sub-interpreters, which Python supports in its C-API, any module that uses bindings made via PyO3 blows up and throws an ImportError if it's loaded twice. The ceph mgr uses such a sub-interpreter model for running mgr modules. In the case of Ceph, this makes perfect sense (from my point of view) as it's better to keep every module isolated - if module A crashes, you don't want it to take down module B, C, D, etc. as well, but you still want to share some data (e.g. libraries) between them. (I'm not yet entirely sure what Ceph does there under the hood, but that's my assumption at least.)

However, to cut the PyO3 developers some slack, their decision does make sense on their end, because they'd have to otherwise redesign PyO3. Redesigning an entire library is quite a herculean effort, as you can imagine. Again, all the details can be found in the AUR maintainer's tracking issue.

So, what happens now? There are a couple possible options.

I'm going to keep looking into this to see if I can at least fix it on our side, but such a fix probably wouldn't be permanent or would have to be maintained for future versions of Ceph. This, however, doesn't fix the root cause of PyO3 not allowing sub-interpreters.

So, another idea is to contribute to PyO3 upstream, as that wouldn't just fix it in our case, but would "trickle down" to all libraries that use PyO3. Since I'm experienced in both Python and Rust, I could probably bridge the gap there. That being said, I'm not too familiar with all the nitty-gritty details of the Python C-API and internals of the CPython implementation, but it's nothing that I can't teach myself. I'm not yet sure what the PyO3 maintainers' stance on fixing this is yet, but we'll see.

I'm confident that a fix is possible, but I cannot yet say whether it will come from our side (specifically for PVE), from PyO3's side, or if Ceph will work it out on their end, or when this will be fixed. For what it's worth, I've also reported my findings to the Ceph mailing list as a response to the existing thread, so maybe it will also gain some traction there. Maybe Ceph already does have a fix for this, but since the AUR maintainer is struggling with the same issue on Ceph 18.2, I doubt they do.

bootsie123 · Sep 11, 2023

Wow! Thanks so much for the detailed update @Max Carrara ! I (and I'm pretty sure everyone else in this thread) greatly appreciate all of the time you spent down this rabbit hole. Also thanks for all of the links and references! Truly above and beyond

Max Carrara · Sep 11, 2023

bootsie123 said:
Wow! Thanks so much for the detailed update @Max Carrara ! I (and I'm pretty sure everyone else in this thread) greatly appreciate all of the time you spent down this rabbit hole. Also thanks for all of the links and references! Truly above and beyond

You're welcome!

To give another short update: I'm now poking around PyO3 to see if I could contribute; I've offered my help on their side. There aren't really any other options left in my opinion. For example, you can't really change Ceph's sub-interpreter model (that would require a big rewrite) and you can't really "fix" it in libraries that use PyO3 either, from my perspective. So, might as well address the whole thing at its root.

Side note, a little "fun fact" to add to my write-up above: Debian Bookworm actually ships python3-cryptography version 38, which shouldn't be affected, but it is built using PyO3 version 0.17, which does include the check that throws the ImportError above. Obviously the Debian Rust maintainers aren't to blame here, as they probably just went with the latest stable compatible version (or similar), which is very reasonable to do.

Either way, I still can't really promise anything, but hey, at least there's a plan now. I'll keep you posted, but don't expect too many updates. My gut tells me that this will probably take a while to get fixed and trickle downstream.

niko2 · Sep 13, 2023

This is just a warning
Ceph is running well and dashboard as well

I am using ceph dashboard because proxmox provides access to ceph status only to admin users
With ceph dashboard, I can give read-only access to any user

LinuxLoader · Sep 14, 2023

Something to add for the problem that I noticed is that after this , when you try to see images or the rdb pool gives errors. Also terraform provider stoped working for that pool as well ( I tried and created new pool on the same disks all is waking as charm on the new pool and terraform and rbd image listing).

niko2 · Sep 14, 2023

LinuxLoader said:
Something to add for the problem that I noticed is that after this , when you try to see images or the rdb pool gives errors. Also terraform provider stoped working for that pool as well ( I tried and created new pool on the same disks all is waking as charm on the new pool and terraform and rbd image listing).

is it related to this thread ?

devopstales · Sep 15, 2023

niko2 said:
This is just a warning
Ceph is running well and dashboard as well

I am using ceph dashboard because proxmox provides access to ceph status only to admin users
With ceph dashboard, I can give read-only access to any user

The dashboard dose not working for me. Because this "warning" the dashboard dose not activate:

Code:

ceph mgr module ls
MODULE
balancer           on (always on)
crash              on (always on)
devicehealth       on (always on)
orchestrator       on (always on)
pg_autoscaler      on (always on)
progress           on (always on)
rbd_support        on (always on)
status             on (always on)
telemetry          on (always on)
volumes            on (always on)
iostat             on
nfs                on
prometheus         on
alerts             -
dashboard          -
influx             -
insights           -
localpool          -
mirroring          -
osd_perf_query     -
osd_support        -
restful            -
selftest           -
snap_schedule      -
stats              -
telegraf           -
test_orchestrator  -
zabbix             -

scyto · Sep 24, 2023

Blaiserman said:
The dashboard dose not working for me. Because this "warning" the dashboard dose not activate:

I have same issue setting up in PVE8 with reef - v 18.2.0 (aka I never had dashboard installed before I upgraded to reef)

You can enable the dashboard module with ceph mgr module enable dashboard --force but don't get too excited... I still got stuck after that because ceph dashboard commands don't work at all - e.g.

Code:

root@pve1:~# ceph dashboard create-self-signed-cert
no valid command found; 10 closest matches:
pg stat
pg getmap
pg dump [<dumpcontents:all|summary|sum|delta|pools|osds|pgs|pgs_brief>...]
pg dump_json [<dumpcontents:all|summary|sum|pools|osds|pgs>...]
pg dump_pools_json
pg ls-by-pool <poolstr> [<states>...]
pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...]
pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...]
pg ls [<pool:int>] [<states>...]
pg dump_stuck [<stuckops:inactive|unclean|stale|undersized|degraded>...] [<threshold:int>]
Error EINVAL: invalid command

I am not sure if this module supports self-test

Code:

root@pve1:~# ceph mgr module enable selftest
root@pve1:~# ceph mgr self-test module dashboard
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1776, in _handle_command
    return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call
    return self.func(mgr, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/ceph/mgr/selftest/module.py", line 137, in module
    r = self.remote(module, "self_test")
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/ceph/mgr/mgr_module.py", line 2192, in remote
    return self._ceph_dispatch_remote(module_name, method_name,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: Module not found

and yes my module is loaded.... (well it says it is, but patently it isn't ...)

Code:

root@pve1:~# ceph mgr module ls
MODULE                   
balancer           on (always on)
crash              on (always on)
devicehealth       on (always on)
orchestrator       on (always on)
pg_autoscaler      on (always on)
progress           on (always on)
rbd_support        on (always on)
status             on (always on)
telemetry          on (always on)
volumes            on (always on)
dashboard          on    
iostat             on    
nfs                on    
restful            on    
selftest           on    
alerts             -     
influx             -     
insights           -     
localpool          -     
mirroring          -     
osd_perf_query     -     
osd_support        -     
prometheus         -     
snap_schedule      -     
stats              -     
telegraf           -     
test_orchestrator  -     
zabbix             -

attached a log file from time of tryng to enable module with force - do i have a python issue beyond the one causing the UI error?

scyto · Sep 24, 2023

mikefnasr said:
DO NOT USE --force

Code:

ceph mgr module enable dashboard --force

As it disables all the managers, destroy then recreate will be needed

I had this happen (once) in my multiple failed attempts to get the dashboard setup, all that was needed for me was to restart the managers.

scyto · Sep 24, 2023

scyto said:
I had this happen (once) in my multiple failed attempts to get the dashboard setup, all that was needed for me was to restart the managers.

and of course 10 minutes later i did another test to enable the dashboard and yup it broke my managers with TASK ERROR: command '/bin/systemctl start ceph-mgr@pve2' failed: exit code 1 resulting in me needing to destroy and re-create them. I will not play with the dashboard now my cluster is no longer a poc.

ctrl · Sep 28, 2023

This occurs with a completely fresh install of Proxmox 8x and Ceph Reef 18x. Something tells me this is not just a issue with Ceph itself, the Ceph community from what I can search is not complaining about this and the only traction the above mailing list thread got was : https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/XMH2UQBNWWDH2HVWREN6VV523DQHFYOD/ which is not true.

https://forum.proxmox.com/threads/ceph-dashboard-broken-in-proxmox-ve-8.133975/

How is it the case that the only vocal people about the dashboard issue is users of Proxmox.

Max Carrara · Oct 11, 2023

ctrl said:
How is it the case that the only vocal people about the dashboard issue is users of Proxmox.

We're not the only ones.

As per my post above, the maintainer for the ArchLinux package of Ceph (on the AUR) has also stumbled across this. See: https://github.com/bazaah/aur-ceph/issues/20

My suspicion is that if the PyO3 stuff doesn't get sorted out sooner or later, more and more users will start to run into this as they upgrade their clusters. A lot of people in our community are home-labbers that aren't afraid to try out new tech, so it does make sense that people here run into this a little earlier.

Also, short update for the interested: I'm still researching a solution for the whole PyO3 ordeal - luckily I'm not alone and am working on this together with another PyO3 enthusiast.

Because this issue is really, really, really tricky to fix, it will take a long time for this to be solved and also to trickle down into the ecosystem (as I already had suspected). So, I'm also looking for a short-term solution in the meantime for PVE users as well. This fix will probably act more like a band aid on a broken bone, but it will hold things together until we solved things upstream.

If anybody here wants to contribute, check out our tracking issue: https://github.com/PyO3/pyo3/issues/3451
Development discussion is over here: https://github.com/Aequitosh/pyo3/discussions/1

scyto · Oct 11, 2023

Max Carrara said:
We're not the only ones.

thanks for your efforts

My perception of the impact to this community:

For folks upgrading from 7.x with the dashboard installed this is a minor annoyance as it still works.
For folks like myself who just started with proxmox 8 - it is impossible to install the dashboard and get it working at all with this bug (i spent hours trying).

Is this accurate?

Max Carrara · Oct 13, 2023

scyto said:
thanks for your efforts

My perception of the impact to this community:

For folks upgrading from 7.x with the dashboard installed this is a minor annoyance as it still works.

For folks like myself who just started with proxmox 8 - it is impossible to install the dashboard and get it working at all with this bug (i spent hours trying).

Is this accurate?

AFAIK upgrading from 7.x. will break the dashboard as well, though I'll test that again soon just to be sure. It would be very strange if the dashboard does end up working after an upgrade, as the installed packages should be identical in both cases.

scyto · Oct 14, 2023

Max Carrara said:
end up working after an upgrade, as the installed packages should be identical in both cases.

i looked at the logs - I posit the difference is that during the upgrade the bootstrap configuration of the dashboard doesn't need to happen again - the logs on my pve8 showed that it was failing during the bootstrap when creating keys or certificated or something. there are other threads where folks imply they have it running on their upgraded systems just by doing a force enable - but as i have no idea what a working system looks like.... i think @Drallas might be in this scenario as i know he is on 8 and has working dashboard?

Drallas · Oct 15, 2023

scyto said:
i looked at the logs - I posit the difference is that during the upgrade the bootstrap configuration of the dashboard doesn't need to happen again - the logs on my pve8 showed that it was failing during the bootstrap when creating keys or certificated or something. there are other threads where folks imply they have it running on their upgraded systems just by doing a force enable - but as i have no idea what a working system looks like.... i think @Drallas might be in this scenario as i know he is on 8 and has working dashboard?

In addition to my setup, the only thing i did to setup the dahboard is:
https://gist.github.com/Drallas/84ece855dc39b6af33f25d4b9f3a1fe3#setup-ceph-dashboard

Max Carrara · Oct 18, 2023

Drallas said:
In addition to my setup, the only thing i did to setup the dahboard is:
https://gist.github.com/Drallas/84ece855dc39b6af33f25d4b9f3a1fe3#setup-ceph-dashboard

Interesting, thanks for posting this! I'll see if I can replicate these instructions.

ceph warning post upgrade to v8

Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Member

Active Member

Active Member

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

New Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

We value your privacy