Ceph Nautilus and Octopus Security Update for "insecure global_id reclaim" CVE-2021-20288

Nico van der Horn · May 4, 2021

t.lamprecht said:
See what I wrote already in this post:

Is it too much to tell how to do that ? ;-) As time is running out...
I do not see any warnings again, also when running "ceph healt detail".
Never felt so insecure before like now about instructions.

RokaKen · May 4, 2021

Nico van der Horn said:
Is it too much to tell how to do that ? ;-) As time is running out...
I do not see any warnings again, also when running "ceph healt detail".
Never felt so insecure before like now about instructions.

IF ceph health detail returns "HEALTH_OK", then you can relax.

Nico van der Horn · May 4, 2021

RokaKen said:
IF ceph health detail returns "HEALTH_OK", then you can relax.

Thank you Rokaken. I am trying to convince my client to buy support other than community.
Indeed "ceph health detail" returns just "HEALTH OK", so tonight I can sleep quietly.
So far I managed all by myself manually before migrating to Proxmox VE, but need back-up.

Altinea · May 10, 2021

Hello,
We just upgraded our cluster to 6.4 (and Ceph 15.2.11) yesterday. I restarted all OSDs, MONs and MGRs. Everything went fine.
I was starting to live-migrate all VMs when I saw that I don't have the "client is using insecure global_id reclaim" warning anymore :

Code:

# ceph health detail
HEALTH_WARN mons are allowing insecure global_id reclaim
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim
    mon.vm10 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm12 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm13 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm14 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm18 has auth_allow_insecure_global_id_reclaim set to true

So, can I go set auth_allow_insecure_global_id_reclaim to false right now ?

Could it be possible that every client renewed his ticket and got a secure one ? If I can save the time needed to live-migrate our 120 VMs on this cluster, I will be more than happy !

Thanks,
Julien

t.lamprecht · May 10, 2021

Altinea said:
So, can I go set auth_allow_insecure_global_id_reclaim to false right now ?

Yes.

Altinea said:
Could it be possible that every client renewed his ticket and got a secure one ?

Sorta. Live-migration with an already updated node as target is a valid way to get security fix enabled, as mentioned in the original post:

t.lamprecht said:
To address those you need to first either ensure all VMs using ceph on a storage without KRBD run the newer client library. For that, either fully restart the VMs (reboot over API or stop ad start), or migrate them to another node in the cluster that has that ceph update already installed.

The reason for that is that new QEMU process started for the incoming live-migration loads the updated librbd library to talk with Ceph, so live-migration can be a valid update path for any new QEMU or library updates.

Altinea said:
If I can save the time needed to live-migrate our 120 VMs on this cluster, I will be more than happy !

Yes.

Altinea · May 10, 2021

Sorry, It seems I've not been clear enough : I didn't live-migrate virtual machines. AFAIK, the running KVM processes have not been restarted for a large majority of our KVM machines. I moved a few of them (3 of 120 actually).

That's what is surprising me (and could save painful work to others as well).

I didn't even restarted pvestatd and pvedaemon on all nodes. But I restarted pve-cluster and pveproxy services on all of them (for a completely different reason : ipv6 support in pveproxy).

t.lamprecht · May 10, 2021

Hmm, do you use KRDB on the storage then? As that was never affected.

Altinea · May 10, 2021

Nope :

Code:

rbd: ceph-ssd-fast
    content images
    krbd 0
    pool ceph-ssd-fast

Altinea · May 12, 2021

Am I the only one to see this unexpected (good) behavior after upgrading a Ceph cluster to 15.2.11 ?

t.lamprecht · May 12, 2021

It is definitively odd, IMO, the existing VMs started before the upgrade have to use the old librbd and thus the old, problematic, auth.

Are you sure that no reboot or migration was involved?

Also, were all ceph services, like monitors, MDS and managers, restarted after the upgrade?

Altinea · May 12, 2021

Hello,
Yes, that's pretty odd for sure.

What have been done :

Upgrade 9 nodes from 6.3-? to 6.4-5 with apt update && apt dist-upgrade
Restart all MGR, MDS and OSDs sequentially

At this stage, I got a LOT of "client is using insecure global_id reclaim" warning and one "mons are allowing insecure global_id reclaim" warning per MON (5 actually)

Moved 3 VMs and restarted pvestatd and pvedaemon on three nodes to see if the number of 'client' warning is decreasing. As far as I remember, that was the case.
Rewrote /etc/hosts on all nodes to get rid of the :ffff:: trick to have pveproxy listen on IPv6 too as those machines are ipv6-only on their public interface (Corosync and Ceph networks are in IPv4 RFC1918 pool).
Restarted pve-cluster sequencially on each node to have proper server address shown in the cluster summary view (as in the corosync subnet).
Wait ~24h as live-migrate all VMs on sunday evening is not a good practice (Murphy's never far)
All 'client' warnings gone in the meantime.
Post here my suprise

I checked the ceph health and I see that 3 warnings are back :

Code:

[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM: clients are using insecure global_id reclaim
    client.admin at 10.152.12.62:0/2202023404 is using insecure global_id reclaim
    client.admin at 10.152.12.63:0/3054920097 is using insecure global_id reclaim
    client.admin at 10.152.12.62:0/2339559705 is using insecure global_id reclaim

I'll have to read a bit further on the 'insecure ticket id' topic directly in the Ceph doc and forums to understand better the problem, solution and find a clue if I can unset insecure_global_id_reclaim.

I really wonder if someone is experiencing the same strange behavior.

t.lamprecht · May 12, 2021

Unlikely, but better asking: Did someone (if there are more admins) set the config flag already to disallow insecure migration?

Can you check a VM that has not been migrated, restarted, etc. and check what libraries its main process has open?

Bash:

# qm list prints the PID
qm list
# print all open files of that process, which includes shared libraries
lsof -n -p PID
# only Linux map files that have been deleted
lsof -n -p PID| grep DEL

If you want to post that you could open a new thread, to avoiding crowding this one.

Altinea · May 12, 2021

OK, just opened a new specific thread here : https://forum.proxmox.com/threads/c...ient-warning-disappear-and-reappearing.89059/

VictorSTS · May 13, 2021

Is there any way to check that all clients are effectively using the new, secure, global_id?

Followed the steps, including fully rebooting every Proxmox server and fully rebooting or live migrating every VM. Everything seems fine, besides a "pools have too many placement groups" that I already expected, but would like to be sure that all global_id are good instead of waiting 72h for them to expire and maybe cause some trouble.

Thanks!

t.lamprecht · May 13, 2021

VictorSTS said:
Is there any way to check that all clients are effectively using the new, secure, global_id?

Yes, once the client warning is gone.

The case from altinea seems a bit odd, but it's really the single one we currently know of behaving in an unexpected, but not confirmed wrong way. Actually it's not clear if the newer libraries are not loaded there, dynamical libraries, but I'll try to find some time to investigate this further and try to run the updates a few times, maybe I can reproduce it and update any recomendations we have here.

That said, for our various test updates and also our internal production cluster all went fine when following the recommendations described in my original first post in this thread and if you live-migrate the VMs to an upgraded host you're def. on the safe side, Altinea just asked if that could be avoided in his case, as it seemed that all VMs already use the new auth.

VictorSTS said:
Followed the steps, including fully rebooting every Proxmox server and fully rebooting or live migrating every VM. Everything seems fine, besides a "pools have too many placement groups" that I already expected, but would like to be sure that all global_id are good instead of waiting 72h for them to expire and maybe cause some trouble.

That sounds good, you could check a VM process to see if they loaded an outdated librbd, use qm list to get the PID and run lsof -n -p PID| grep DEL| grep rbd, it should return nothing, which means the process loaded the rbd library currently installed on the system through the latest upgrade.

The ceph warning is actually the better check, but this ensures you're not in such an odd situation like the other user.

nttec · May 30, 2021

we have this system upgraded to the latest version recently;y and with a subscription. with every upgrade, we are rebooting the whole system and then we run this one:

Code:

ceph config set mon auth_allow_insecure_global_id_reclaim false

after 24hrs one of the kvm started acting up, the kvm on the list is up but not accessible. then later I found out that we cant login to the GUI when trying to log in and then later we discover that we can't control any of the kvm and container. So we resulted in rebooting the whole system, we're not sure what is going on, not until we set this again to true

Code:

ceph config set mon auth_allow_insecure_global_id_reclaim true

everything went back online. GUI is accessible again and KVM/Container starts again.

I tried to do this as well:

Code:

# qm list prints the PID
qm list
# print all open files of that process, which includes shared libraries
lsof -n -p PID
# only Linux map files that have been deleted
lsof -n -p PID| grep DEL

We don't have result for (at least on one of our nodes)

Code:

lsof -n -p PID| grep DEL

here's our health details.

Code:

ceph health detail
HEALTH_WARN clients are using insecure global_id reclaim; mons are allowing insecure global_id reclaim
AUTH_INSECURE_GLOBAL_ID_RECLAIM clients are using insecure global_id reclaim
    client.admin at v1:192.168.1.10:0/3797016573 is using insecure global_id reclaim
    client.admin at v1:192.168.1.11:0/2168917268 is using insecure global_id reclaim
    client.admin at v1:192.168.1.11:0/30071291 is using insecure global_id reclaim
    client.admin at v1:192.168.1.12:0/1977813894 is using insecure global_id reclaim
    client.admin at v1:192.168.1.12:0/3901787205 is using insecure global_id reclaim
    client.admin at v1:192.168.1.13:0/2932947210 is using insecure global_id reclaim
    client.admin at v1:192.168.1.13:0/2582316675 is using insecure global_id reclaim
    client.admin at v1:192.168.1.13:0/2173085809 is using insecure global_id reclaim
AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED mons are allowing insecure global_id reclaim
    mon.pxceph has auth_allow_insecure_global_id_reclaim set to true
    mon.pxceph2 has auth_allow_insecure_global_id_reclaim set to true
    mon.pxceph3 has auth_allow_insecure_global_id_reclaim set to true

jloms · Jun 5, 2021

I've been facing same issue (clients using insecure global-id) after upgrade to octopus 15.2.11 & pve-manager 6.4-8, but my config is not really standard, don't know if the issue can help.
My proxmox cluster:

4 storage nodes, whith ceph installed;
3 compute nodes without ceph part deployed

before this last upgrade everething worked fine for 2 years (many ceph upgrades).
king
The problem is, compute node uses debian rados libs:
12.2.11+dfsg1-2.1+b1 500
500 http://ftp.fr.debian.org/debian buster/main amd64 Packages
500 http://ftp.debian.org/debian buster/main amd64 Packages

So, i put the ceph.list apt source, made an update/upgrade, and with the updated lib (15.2.11-pve1) everything is working;
iv also added the link to ceph.conf for enabling ceph commands from CLI.

But I don't deploy the full ceph proxmos pkg on this compute nodes, because they won't do any ceph job, is that config safe ?

budy · Jun 29, 2021

I am running two clusters, one PVE only for the benefit of having a Ceph cluster, so no VMs on that one. Plus, my actual VM cluster. I updated the Ceph one to the latest PVE/Ceph 6.4.9/14.2.20 and afterwards, I updated my PVEs as well. In that process, I performed live-migrations of all guests, since I was always migrating all guests away before updating/restarting the resp. PVE.

However, after all PVEs have been updateted and restarted, I am still seeing the warnings about client insecure global_id relclaim. Shouldn't the procedure have renewed all tokens or do I still have to wait for 72 hrs, before the warnings go away?

Could it be, that apt-get update/apt-get dist-upgrade doesn't update the base ceph packages beyond 12.2.11?

budy · Jun 29, 2021

As I can see, my PVE/Ceph cluster pulls the ceph packages from a special source. Is it safe to also do that on my PVE/VM nodes?I'd assume so, but better be safe than sorry.

aaron · Jun 29, 2021

budy said:
As I can see, my PVE/Ceph cluster pulls the ceph packages from a special source.

Which one would that be?

Should be from http://download.proxmox.com/. And yes, if you do have other nodes that are only Ceph clients, it can help to configure the same repository and then run updates. This should upgrade the Ceph client from the older default one that Debian ships, to the current one we ship. Hopefully this will then make the warning about clients still using the global_id reclaim go away.

Ceph Nautilus and Octopus Security Update for "insecure global_id reclaim" CVE-2021-20288

Member

Active Member

Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Famous Member

Proxmox Staff Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

We value your privacy