Ceph Nautilus and Octopus Security Update for "insecure global_id reclaim" CVE-2021-20288

t.lamprecht · Apr 22, 2021

We released updates for all of our currently supported Ceph releases to fix a security issue where Ceph was not ensuring that reconnecting/renewing clients were presenting an existing ticket when reclaiming their global_id value. An attacker that was able to authenticate could claim a global_id in use by a different client and potentially disrupt other cluster services.

Affected Versions:

for server: all previous versions
for clients:
- kernel: none
- user-space: all since (and including) Luminous 12.2.0

Attacker Requirements:

have a valid authentication key for the cluster
know or guess the global_id of another client
run a modified version of the Ceph client code to reclaim another client’s global_id
construct appropriate client messages or requests to disrupt service or exploit Ceph daemon assumptions about global_id uniqueness

This means that the risk on a default Proxmox VE managed ceph setup is rather low, we still recommend upgrading in a timely manner.

Available Fixes:

Ceph Octopus: 15.2.11
Ceph Nautlis: 14.2.20

Applying the Fixes:

After you upgrade your ceph server installation to the package versions including the fixes, you need to restart all monitors, managers, metadata-services (MDS) and OSDs!

You will then still see two HEALTH warnings:

client is using insecure global_id reclaim
mons are allowing insecure global_id reclaim

To address those you need to first either ensure all VMs using ceph on a storage without KRBD run the newer client library. For that, either fully restart the VMs (reboot over API or stop ad start), or migrate them to another node in the cluster that has that ceph update already installed.
You also need to restart the pvestatd and pvedaemon Proxmox VE daemons accessing the ceph cluster periodically to gather status data or to execute API calls. Either use the web-interface (Node -> System) or the command-line:

Bash:

systemctl try-reload-or-restart pvestatd.service pvedaemon.service

Next you can resolve the monitor warning by enforcing the stricter behavior that is possible now.
Execute the following command on one of the nodes in the Proxmox VE Ceph cluster:

Bash:

ceph config set mon auth_allow_insecure_global_id_reclaim false

Note: As said, that will cut-off any old client after the ticket validity times out (72h)

If you operate an external cluster and the Proxmox VE side only uses the client, you can still add our Ceph repository and run a normal upgrade process (apt update && apt dist-upgrade) to get the fixed client package versions.

See also:
https://docs.ceph.com/en/latest/security/CVE-2021-20288/

Bengt Nolin · Apr 22, 2021

Seems like you also need to restart certain PVE services in order to get the UI so show ceph related data after disallowing insecure reclaims (unless you reboot). I restarted pvedeamon and pvestatd and that seemed to be enough, but perhaps there are more?

t.lamprecht · Apr 22, 2021

You're right, there we use our perl FFI to the ceph RADOS library, which needs to be reloaded to get the updated version.
I edited the post to reflect that. And yes, the two services you mentioned are correct, and should be enough.

elterminatore · Apr 23, 2021

am i seeing this correctly? if the message "clients are using insecure global_id reclaim" is disappearing, it is safe to run this command: "ceph config set mon auth_allow_insecure_global_id_reclaim false" ? (I have restarted all pve/ceph cluster nodes and all vms were migrated during this procedure)

t.lamprecht · Apr 23, 2021

elterminatore said:
am i seeing this correctly? if the message "clients are using insecure global_id reclaim" is disappearing, it is safe to run this command: "ceph config set mon auth_allow_insecure_global_id_reclaim false" ? (I have restarted all pve/ceph cluster nodes and all vms were migrated during this procedure)

Yes.

xxfyk1 · Apr 23, 2021

How can I update 15.2.8 to 15.2.11.Have any documents?

xxfyk1 · Apr 23, 2021

And where can I get the upgrade fix package?

David Herselman · Apr 23, 2021

We run KRBD by default so my understanding is that we do not have to do anything besides updating and then restarting Ceph on each node, after which we can then cut off non-compliant clients.

running KRBD?

Code:

cat /etc/pve/storage.cfg
rbd: rbd_ssd
        content rootdir,images
        krbd 1
        pool rbd_ssd

[root@kvm1a ~]# rbd showmapped
id  pool     namespace  image          snap  device
0   rbd_ssd             vm-107-disk-0  -     /dev/rbd0
1   rbd_ssd             vm-106-disk-0  -     /dev/rbd1
2   rbd_ssd             vm-104-disk-0  -     /dev/rbd2
3   rbd_ssd             vm-107-disk-1  -     /dev/rbd3
4   rbd_ssd             vm-106-disk-1  -     /dev/rbd4

Code:

apt-get update; apt-get -y dist-upgrade; apt-get autoremove; apt-get autoclean;
ceph -s
systemctl restart ceph.target
  # do this on each node, waiting for mon/mgr/mds/osd/rgw to all fully recover, each time

Validate that the only warning relates to insecure clients being allowed:

Code:

[root@kvm1a ~]# ceph health detail
HEALTH_WARN mons are allowing insecure global_id reclaim; noout flag(s) set
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim
    mon.kvm1a has auth_allow_insecure_global_id_reclaim set to true
    mon.kvm1b has auth_allow_insecure_global_id_reclaim set to true
    mon.kvm1c has auth_allow_insecure_global_id_reclaim set to true
[WRN] OSDMAP_FLAGS: noout flag(s) set

Bulk disable on all nodes:

Code:

for f in `ls -1A /etc/pve/nodes`; do ssh $f "ceph config set mon auth_allow_insecure_global_id_reclaim false"; done

t.lamprecht · Apr 23, 2021

David Herselman said:
by default so my understanding is that we do not have to do anything besides updating and then restarting Ceph on each node, after which we can then cut off non-compliant clients.

Yes, note the PVE daemons that need to get restarted too, though. Also, you can ensure that enabling the restriction is safe by checking if the client health warning is gone.

David Herselman said:
systemctl restart ceph.target

I'd in general recommend a more cautious approach, restarting all those daemons at once can lead to issues, especially if there are unexpected problems with the new version.

David Herselman said:
Bulk disable on all nodes:

No, this needs only done once per ceph server setup, the monitors are clustered after all.

t.lamprecht · Apr 23, 2021

xxfyk1 said:
How can I update 15.2.8 to 15.2.11.Have any documents?

How did you install ceph server?
If you installed it over the PVE web-interface or using the pveceph CLI tool you should have our ceph repositories already setup.
Then it is enough to do a standard package update, either via the web-interface (Node -> Updates) or using the CLI:

Bash:

apt update
apt full-upgrade

If you do not have any repo setup (where did you get ceph from then?) you'd need to re-add that first, for the 15.2 release it's the Octopus repository: https://pve.proxmox.com/wiki/Package_Repositories#sysadmin_package_repositories_ceph

kofik · Apr 23, 2021

Hi @t.lamprecht

Thank you for your clear instructions, the upgrade has worked well here, but I've realized using 'lsof -n | grep DEL' that even after restarting monitors, managers, metadata-services (MDS) and OSDs that I hade processes using deleted / updated CEPH libraries.

In the end since our cluster has the required redundancy I decided to do a full reboot of each cluster node after all, and after every node was updated and rebooted sequentially, disabling auth_allow_insecure_global_id_reclaim everything was fine.

Just as hint: Posting this in the forums is good, but it will disappear in the forum history after some time unless pinned - and not be of interest to most in a couple of months. Thus: What about posting the instructions in the Wiki and then link to them in the forum post? This way it will remain available in the wiki for those folks who only update their Ceph much later on (i.e. when upgrading from Ceph Luminous)

Nonetheless: Thank you very much for the proactive information!

t.lamprecht · Apr 23, 2021

kofik said:
Thank you for your clear instructions, the upgrade has worked well here, but I've realized using 'lsof -n | grep DEL' that even after restarting monitors, managers, metadata-services (MDS) and OSDs that I hade processes using deleted / updated CEPH libraries.

How exactly did you restart the ceph services? If an in-place re-exec (reload) was used this could be normal, as having them still open does not necessarily mean that symbols from them are in use. Anyway, a full reboot certainly resulted in a fresh start.

kofik said:
Just as hint: Posting this in the forums is good, but it will disappear in the forum history after some time unless pinned - and not be of interest to most in a couple of months. Thus: What about posting the instructions in the Wiki and then link to them in the forum post? This way it will remain available in the wiki for those folks who only update their Ceph much later on (i.e. when upgrading from Ceph Luminous)

Thanks for your input. My basic assumptions where the following:
* people updating frequently, as recommended, will get the warning, search for them (with potentially "proxmox" keyword added) and find this post.
* It is an issue which the standard Proxmox VE managed ceph server setup is normally not vulnerable too, after all only our daemons (which have access anyway), VMs over librbd/krbd (which access is controlled by our stack too) and containers, which can only use the unaffected KRBD client any way.
* the instructions share basically all but the "disable access for old clients" step with a common ceph upgrade, where the web-interface already shows which services are still running an outdated version.

That was IMO enough reasons that an un-sticky post is fine. But, I pondered adding a more general stick post, hinting the upcoming EOL of Nautilus in ~ July where this issue and post could be mentioned too.

I'll link to this forum post in the Nautilus and Octopus upgrade sections, so that people coming from older releases get better visibility of this.

kofik · Apr 23, 2021

Hi

I've restarted the osd/msd etc. systemd .targets (and I did specifically not use ceph.target as discouraged in this thread).

Nonetheless, thank you very much, considering this release has only been out for early this week, that very quick and I appreciated the instructions!

xxfyk1 · Apr 24, 2021

t.lamprecht said:
How did you install ceph server?
If you installed it over the PVE web-interface or using the pveceph CLI tool you should have our ceph repositories already setup.
Then it is enough to do a standard package update, either via the web-interface (Node -> Updates) or using the CLI:

Bash:

apt update apt full-upgrade

If you do not have any repo setup (where did you get ceph from then?) you'd need to re-add that first, for the 15.2 release it's the Octopus repository: https://pve.proxmox.com/wiki/Package_Repositories#sysadmin_package_repositories_ceph

thanks,it's done.

mbarchein · Apr 29, 2021

Hello. About the upgrade from 14.2.16 to 14.2.20, should I take any special action or is there any gotcha with cephfs clients? I have several cephfs clients running stock Debian 9 and Debian 10 kernels with the in-kernel driver. Thanks,

t.lamprecht · Apr 29, 2021

Hi,

mbarchein said:
Hello. About the upgrade from 14.2.16 to 14.2.20, should I take any special action or is there any gotcha with cephfs clients? I have several cephfs clients running stock Debian 9 and Debian 10 kernels with the in-kernel driver. Thanks,

No, my post should apply for them too, and as you say you use the in-kernel client then you should be safe to restrict access after upgrading. But, as mentioned before, its good to check that the client is using insecure global_id reclaim warning is gone before doing so, as only that gives a guarantee that all currently connected clients are already using the safer auth.

Nico van der Horn · May 2, 2021

t.lamprecht said:
Note: As said, that will cut-off any old client after the ticket validity times out (72h)

To be sure I understand this correctly, what does "cut-off" mean exactly ?
Does that mean the client is shutdown or another problem shows up after 72 hours ?
Or that the connection is renewed and automatically working with krbd enabled ?
I just want to be sure to understand this right.

t.lamprecht · May 2, 2021

Nico van der Horn said:
Does that mean the client is shutdown or another problem shows up after 72 hours ?

That means any client not yet upgraded to a version which fixed the problematic behavior will not be able to talk with the ceph cluster any more, i.e., cut-off from participating in that ceph setup.

Nico van der Horn said:
Or that the connection is renewed and automatically working with krbd enabled ?

No, problematic clients can neither open a new connection nor renewed existing ones from the time the auth_allow_insecure_global_id_reclaim is set to false, only the existing ticket they got, or renewed, before the restriction was enabled, is still valid for another 72 hours (by default).

Note, krbd (= kernel) clients never had the problematic behavior in the first place, but user-space clients cannot automatically switch to the in-kernel one transparently.

Nico van der Horn · May 2, 2021

t.lamprecht said:
That means any client not yet upgraded to a version which fixed the problematic behavior will not be able to talk with the ceph cluster any more, i.e., cut-off from participating in that ceph setup.

No, problematic clients can neither open a new connection nor renewed existing ones from the time the auth_allow_insecure_global_id_reclaim is set to false, only the existing ticket they got, or renewed, before the restriction was enabled, is still valid for another 72 hours (by default).

Note, krbd (= kernel) clients never had the problematic behavior in the first place, but user-space clients cannot automatically switch to the in-kernel one transparently.

Executed all instructions as far as I know precisely.
Live-migrated the VMs to another node (5 total) and back.
Is that enough to be sure no problems will show up ?

t.lamprecht · May 3, 2021

Nico van der Horn said:
Is that enough to be sure no problems will show up ?

See what I wrote already in this post:

t.lamprecht said:
But, as mentioned before, its good to check that the client is using insecure global_id reclaim warning is gone before doing so, as only that gives a guarantee that all currently connected clients are already using the safer auth.

Ceph Nautilus and Octopus Security Update for "insecure global_id reclaim" CVE-2021-20288

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

New Member

New Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

New Member

Renowned Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

We value your privacy