Ceph Nautilus and Octopus Security Update for "insecure global_id reclaim" CVE-2021-20288

IF ceph health detail returns "HEALTH_OK", then you can relax.
Thank you Rokaken. I am trying to convince my client to buy support other than community.
Indeed "ceph health detail" returns just "HEALTH OK", so tonight I can sleep quietly.
So far I managed all by myself manually before migrating to Proxmox VE, but need back-up.
 
Last edited:
Hello,
We just upgraded our cluster to 6.4 (and Ceph 15.2.11) yesterday. I restarted all OSDs, MONs and MGRs. Everything went fine.
I was starting to live-migrate all VMs when I saw that I don't have the "client is using insecure global_id reclaim" warning anymore :

Code:
# ceph health detail
HEALTH_WARN mons are allowing insecure global_id reclaim
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim
    mon.vm10 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm12 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm13 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm14 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm18 has auth_allow_insecure_global_id_reclaim set to true

So, can I go set auth_allow_insecure_global_id_reclaim to false right now ?

Could it be possible that every client renewed his ticket and got a secure one ? If I can save the time needed to live-migrate our 120 VMs on this cluster, I will be more than happy !

Thanks,
Julien
 
So, can I go set auth_allow_insecure_global_id_reclaim to false right now ?
Yes.

Could it be possible that every client renewed his ticket and got a secure one ?

Sorta. Live-migration with an already updated node as target is a valid way to get security fix enabled, as mentioned in the original post:
To address those you need to first either ensure all VMs using ceph on a storage without KRBD run the newer client library. For that, either fully restart the VMs (reboot over API or stop ad start), or migrate them to another node in the cluster that has that ceph update already installed.

The reason for that is that new QEMU process started for the incoming live-migration loads the updated librbd library to talk with Ceph, so live-migration can be a valid update path for any new QEMU or library updates.

If I can save the time needed to live-migrate our 120 VMs on this cluster, I will be more than happy !
Yes.
 
Sorry, It seems I've not been clear enough : I didn't live-migrate virtual machines. AFAIK, the running KVM processes have not been restarted for a large majority of our KVM machines. I moved a few of them (3 of 120 actually).

That's what is surprising me (and could save painful work to others as well).

I didn't even restarted pvestatd and pvedaemon on all nodes. But I restarted pve-cluster and pveproxy services on all of them (for a completely different reason : ipv6 support in pveproxy).
 
Am I the only one to see this unexpected (good) behavior after upgrading a Ceph cluster to 15.2.11 ?
 
It is definitively odd, IMO, the existing VMs started before the upgrade have to use the old librbd and thus the old, problematic, auth.

Are you sure that no reboot or migration was involved?

Also, were all ceph services, like monitors, MDS and managers, restarted after the upgrade?
 
Hello,
Yes, that's pretty odd for sure.

What have been done :
  • Upgrade 9 nodes from 6.3-? to 6.4-5 with apt update && apt dist-upgrade
  • Restart all MGR, MDS and OSDs sequentially
At this stage, I got a LOT of "client is using insecure global_id reclaim" warning and one "mons are allowing insecure global_id reclaim" warning per MON (5 actually)
  • Moved 3 VMs and restarted pvestatd and pvedaemon on three nodes to see if the number of 'client' warning is decreasing. As far as I remember, that was the case.
  • Rewrote /etc/hosts on all nodes to get rid of the :ffff:: trick to have pveproxy listen on IPv6 too as those machines are ipv6-only on their public interface (Corosync and Ceph networks are in IPv4 RFC1918 pool).
  • Restarted pve-cluster sequencially on each node to have proper server address shown in the cluster summary view (as in the corosync subnet).
  • Wait ~24h as live-migrate all VMs on sunday evening is not a good practice (Murphy's never far)
  • All 'client' warnings gone in the meantime.
  • Post here my suprise

I checked the ceph health and I see that 3 warnings are back :
Code:
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM: clients are using insecure global_id reclaim
    client.admin at 10.152.12.62:0/2202023404 is using insecure global_id reclaim
    client.admin at 10.152.12.63:0/3054920097 is using insecure global_id reclaim
    client.admin at 10.152.12.62:0/2339559705 is using insecure global_id reclaim

I'll have to read a bit further on the 'insecure ticket id' topic directly in the Ceph doc and forums to understand better the problem, solution and find a clue if I can unset insecure_global_id_reclaim.

I really wonder if someone is experiencing the same strange behavior.
 
Unlikely, but better asking: Did someone (if there are more admins) set the config flag already to disallow insecure migration?

Can you check a VM that has not been migrated, restarted, etc. and check what libraries its main process has open?

Bash:
# qm list prints the PID
qm list
# print all open files of that process, which includes shared libraries
lsof -n -p PID
# only Linux map files that have been deleted
lsof -n -p PID| grep DEL

If you want to post that you could open a new thread, to avoiding crowding this one.
 
Last edited:
Is there any way to check that all clients are effectively using the new, secure, global_id?

Followed the steps, including fully rebooting every Proxmox server and fully rebooting or live migrating every VM. Everything seems fine, besides a "pools have too many placement groups" that I already expected, but would like to be sure that all global_id are good instead of waiting 72h for them to expire and maybe cause some trouble.

Thanks!
 
Is there any way to check that all clients are effectively using the new, secure, global_id?
Yes, once the client warning is gone.

The case from altinea seems a bit odd, but it's really the single one we currently know of behaving in an unexpected, but not confirmed wrong way. Actually it's not clear if the newer libraries are not loaded there, dynamical libraries, but I'll try to find some time to investigate this further and try to run the updates a few times, maybe I can reproduce it and update any recomendations we have here.

That said, for our various test updates and also our internal production cluster all went fine when following the recommendations described in my original first post in this thread and if you live-migrate the VMs to an upgraded host you're def. on the safe side, Altinea just asked if that could be avoided in his case, as it seemed that all VMs already use the new auth.

Followed the steps, including fully rebooting every Proxmox server and fully rebooting or live migrating every VM. Everything seems fine, besides a "pools have too many placement groups" that I already expected, but would like to be sure that all global_id are good instead of waiting 72h for them to expire and maybe cause some trouble.
That sounds good, you could check a VM process to see if they loaded an outdated librbd, use qm list to get the PID and run lsof -n -p PID| grep DEL| grep rbd, it should return nothing, which means the process loaded the rbd library currently installed on the system through the latest upgrade.

The ceph warning is actually the better check, but this ensures you're not in such an odd situation like the other user.
 
we have this system upgraded to the latest version recently;y and with a subscription. with every upgrade, we are rebooting the whole system and then we run this one:

Code:
ceph config set mon auth_allow_insecure_global_id_reclaim false

after 24hrs one of the kvm started acting up, the kvm on the list is up but not accessible. then later I found out that we cant login to the GUI when trying to log in and then later we discover that we can't control any of the kvm and container. So we resulted in rebooting the whole system, we're not sure what is going on, not until we set this again to true

Code:
ceph config set mon auth_allow_insecure_global_id_reclaim true

everything went back online. GUI is accessible again and KVM/Container starts again.


I tried to do this as well:
Code:
# qm list prints the PID
qm list
# print all open files of that process, which includes shared libraries
lsof -n -p PID
# only Linux map files that have been deleted
lsof -n -p PID| grep DEL

We don't have result for (at least on one of our nodes)

Code:
lsof -n -p PID| grep DEL


here's our health details.

Code:
ceph health detail
HEALTH_WARN clients are using insecure global_id reclaim; mons are allowing insecure global_id reclaim
AUTH_INSECURE_GLOBAL_ID_RECLAIM clients are using insecure global_id reclaim
    client.admin at v1:192.168.1.10:0/3797016573 is using insecure global_id reclaim
    client.admin at v1:192.168.1.11:0/2168917268 is using insecure global_id reclaim
    client.admin at v1:192.168.1.11:0/30071291 is using insecure global_id reclaim
    client.admin at v1:192.168.1.12:0/1977813894 is using insecure global_id reclaim
    client.admin at v1:192.168.1.12:0/3901787205 is using insecure global_id reclaim
    client.admin at v1:192.168.1.13:0/2932947210 is using insecure global_id reclaim
    client.admin at v1:192.168.1.13:0/2582316675 is using insecure global_id reclaim
    client.admin at v1:192.168.1.13:0/2173085809 is using insecure global_id reclaim
AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED mons are allowing insecure global_id reclaim
    mon.pxceph has auth_allow_insecure_global_id_reclaim set to true
    mon.pxceph2 has auth_allow_insecure_global_id_reclaim set to true
    mon.pxceph3 has auth_allow_insecure_global_id_reclaim set to true
 
I've been facing same issue (clients using insecure global-id) after upgrade to octopus 15.2.11 & pve-manager 6.4-8, but my config is not really standard, don't know if the issue can help.
My proxmox cluster:
  • 4 storage nodes, whith ceph installed;
  • 3 compute nodes without ceph part deployed
before this last upgrade everething worked fine for 2 years (many ceph upgrades).
king
The problem is, compute node uses debian rados libs:
12.2.11+dfsg1-2.1+b1 500
500 http://ftp.fr.debian.org/debian buster/main amd64 Packages
500 http://ftp.debian.org/debian buster/main amd64 Packages

So, i put the ceph.list apt source, made an update/upgrade, and with the updated lib (15.2.11-pve1) everything is working;
iv also added the link to ceph.conf for enabling ceph commands from CLI.

But I don't deploy the full ceph proxmos pkg on this compute nodes, because they won't do any ceph job, is that config safe ?
 
I am running two clusters, one PVE only for the benefit of having a Ceph cluster, so no VMs on that one. Plus, my actual VM cluster. I updated the Ceph one to the latest PVE/Ceph 6.4.9/14.2.20 and afterwards, I updated my PVEs as well. In that process, I performed live-migrations of all guests, since I was always migrating all guests away before updating/restarting the resp. PVE.

However, after all PVEs have been updateted and restarted, I am still seeing the warnings about client insecure global_id relclaim. Shouldn't the procedure have renewed all tokens or do I still have to wait for 72 hrs, before the warnings go away?

Could it be, that apt-get update/apt-get dist-upgrade doesn't update the base ceph packages beyond 12.2.11?
 
Last edited:
As I can see, my PVE/Ceph cluster pulls the ceph packages from a special source. Is it safe to also do that on my PVE/VM nodes?I'd assume so, but better be safe than sorry.
 
As I can see, my PVE/Ceph cluster pulls the ceph packages from a special source.
Which one would that be?

Should be from http://download.proxmox.com/. And yes, if you do have other nodes that are only Ceph clients, it can help to configure the same repository and then run updates. This should upgrade the Ceph client from the older default one that Debian ships, to the current one we ship. Hopefully this will then make the warning about clients still using the global_id reclaim go away.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!