Ceph 15.2.11 upgrade : insecure client warning disappear and reappearing

Altinea

Active Member
Jan 12, 2018
33
7
28
41
Hello,
In follow to https://forum.proxmox.com/threads/c...l_id-reclaim-cve-2021-20288.88038/post-389914, I'm opening a new thread.

I was asked to check this :
Code:
# qm list prints the PID
qm list
# print all open files of that process, which includes shared libraries
lsof -n -p PID
# only Linux map files that have been deleted
lsof -n -p PID| grep DEL

Note : I changed VMID to PID because using VMID with lsof doesn't seems to make sense.

I don't think posting the full list of opened files is pertinent but here's the deleted files of a NOT migrated VM :
Code:
lsof -n -p 167580| grep DEL
kvm     167580 root  DEL       REG               0,25            42445468 /usr/bin/qemu-system-x86_64
kvm     167580 root  DEL       REG               0,25            28867024 /usr/lib/x86_64-linux-gnu/libsqlite3.so.0.8.6
kvm     167580 root  DEL       REG               0,25            28849634 /usr/lib/x86_64-linux-gnu/libp11-kit.so.0.3.0
kvm     167580 root  DEL       REG               0,25            28862381 /usr/lib/x86_64-linux-gnu/libgstapp-1.0.so.0.1404.0
kvm     167580 root  DEL       REG               0,25            43194708 /usr/lib/ceph/libceph-common.so.0
kvm     167580 root  DEL       REG               0,25            42418405 /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2.10.10
kvm     167580 root  DEL       REG               0,25            42418406 /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2.10.10
kvm     167580 root  DEL       REG               0,25            28859152 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
kvm     167580 root  DEL       REG               0,25            28859153 /usr/lib/x86_64-linux-gnu/libssl.so.1.1
kvm     167580 root  DEL       REG               0,25            37213427 /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2
kvm     167580 root  DEL       REG               0,25            28849807 /usr/lib/x86_64-linux-gnu/libzstd.so.1.3.8
kvm     167580 root  DEL       REG               0,25            28851255 /lib/x86_64-linux-gnu/libudev.so.1.6.13
kvm     167580 root  DEL       REG               0,25            37213904 /usr/lib/x86_64-linux-gnu/libgnutls.so.30.23.2
kvm     167580 root  DEL       REG               0,25            28875485 /usr/lib/x86_64-linux-gnu/libjpeg.so.62.2.0
kvm     167580 root  DEL       REG               0,25            43194709 /usr/lib/librados.so.2.0.0
kvm     167580 root  DEL       REG               0,25            43194688 /usr/lib/librbd.so.1.12.0
kvm     167580 root  DEL       REG               0,25            28877389 /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4.5.0
kvm     167580 root  DEL       REG               0,25            42419281 /usr/lib/libproxmox_backup_qemu.so.0
kvm     167580 root  DEL       REG               0,25            28849782 /lib/x86_64-linux-gnu/libsystemd.so.0.25.0
kvm     167580 root  DEL       REG                0,1          3758783448 /dev/zero

And, for comparison, here's another one on a recently live-migrated one :
Code:
lsof -n -p 1729001| grep DEL
kvm     1729001 root  DEL       REG                0,1          1866635234 /dev/zero

Both are running on the same node. The live-migrated VM made a round trip.

I can confirm the first has not been migrated nor rebooted with uptime and PVE task history.

The Ceph Cluster is still using insecure claims :
Code:
ceph config get mon auth_allow_insecure_global_id_reclaim
true
Dunno how to check if this has been changed (some kind of ceph history ?) but this is highly improbable, at least by a human.

And to be complete, librbd package was upgraded sunday evening :

Code:
# cat /var/log/dpkg.log |grep librbd
2021-05-09 21:12:39 upgrade librbd1:amd64 15.2.10-pve1 15.2.11-pve1
2021-05-09 21:12:39 status half-configured librbd1:amd64 15.2.10-pve1
2021-05-09 21:12:39 status unpacked librbd1:amd64 15.2.10-pve1
2021-05-09 21:12:39 status half-installed librbd1:amd64 15.2.10-pve1
2021-05-09 21:12:40 status unpacked librbd1:amd64 15.2.11-pve1
2021-05-09 21:13:30 configure librbd1:amd64 15.2.11-pve1 <none>
2021-05-09 21:13:30 status unpacked librbd1:amd64 15.2.11-pve1
2021-05-09 21:13:30 status half-configured librbd1:amd64 15.2.11-pve1
2021-05-09 21:13:30 status installed librbd1:amd64 15.2.11-pve1

I didn't checked all of them but I can see the same results on other nodes.

Is there a way to associate this :
Code:
client.admin at 10.152.12.62:0/2202023404 is using insecure global_id reclaim
with a VMID ?

I'll probably live-migrate all VMs in the next days, so for me, this is not really a problem but this could potentially change the upgrade process for other users.

For reference, here's a link from Ceph about this secure claim issue :
https://docs.ceph.com/en/latest/security/CVE-2021-20288/

Best regards
 
Hi,
Note : I changed VMID to PID because using VMID with lsof doesn't seems to make sense.
Yeah, true, I edited the original post of mine in the other thread, thanks for the pointer.

I don't think posting the full list of opened files is pertinent but here's the deleted files of a NOT migrated VM :
It could help to see if there's another librbd loaded already, so if you do not mind please post that.
Also, it would be nice for my understanding of things to have the QEMU command of the same VM, so posting
Bash:
qm showcmd VMID --pretty
would be great.

client.admin at 10.152.12.62:0/2202023404 is using insecure global_id reclaim
So you still get the warnings about clients using insecure global_id reclaim ?? I thought you did not?

Or did you/somebody mute the health warning for a time?

Can you post the full output:
ceph health detail
 
Hello,
So, first, yes, warnings are back but only a few at a time :
  • Right after upgrading, I got a dozon of them. I didn't count by it was probably one per VM + one or two per hypervisor
  • 24h later, I got absolutely none
  • ~48h after upgrade, I got a few (4 or less)
  • That's already the case at time of this writing (actually having 2 AUTH_INSECURE_GLOBAL_ID_RECLAIM warnings)
My assumption at this stage is Ceph MONs forced-reconnect clients after upgrade and warnings was fired-up for a few hours after each client asked a new ticketid (insecurely).
Since, clients are renewing their ticketid from time to time (which rate ?) and this raise a new warning for a few hours (how many ?) that are displayed in ceph health.

After 24h, I didn't got any warning because tickets lifetime was set to 72h, no refresh was necessary and all previous warnings (from upgrade) timeouted.

Could this be happening ? I think this is a serious matter as if it's right, we should check for at least 72h that no new client warning was raised BEFORE setting auth_allow_insecure_global_id_reclaim to false.

That raise 2 questions : how can I check how long a warning is shown is ceph health ? And how can I see the 'history' of ceph health ? (perhaps with long-running ceph -w ?)

This is probably not relevant anymore and I've not been able to post the full opened files list (15000 char limit) but here's a grep on rbd :
Bash:
lsof -n -p 167580 |grep rbd
kvm     167580 root  DEL       REG               0,25            43194688 /usr/lib/librbd.so.1.12.0
So no other librbd loaded if I understand correctly.

And here is the KVM command :
Bash:
# qm showcmd 200 --pretty
/usr/bin/kvm \
  -id 200 \
  -name server.domain.com \
  -no-shutdown \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/200.qmp,server,nowait' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/200.pid \
  -daemonize \
  -smbios 'type=1,uuid=c07fd965-3d3b-4d4b-a1cb-ae2e3890f1d3' \
  -smp '4,sockets=1,cores=4,maxcpus=4' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vnc unix:/var/run/qemu-server/200.vnc,password \
  -cpu kvm64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep \
  -m 4096 \
  -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
  -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
  -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
  -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
  -device 'VGA,id=vga,bus=pci.0,addr=0x2' \
  -chardev 'socket,path=/var/run/qemu-server/200.qga,server,nowait,id=qga0' \
  -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
  -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
  -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
  -iscsi 'initiator-name=iqn.1993-08.org.debian:01:dc3ed6fd1ff5' \
  -drive 'file=rbd:ceph-ssd-fast/vm-200-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/ceph-ssd-fast.keyring,if=none,id=drive-virtio0,format=raw,cache=none,aio=native,detect-zeroes=on' \
  -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' \
  -netdev 'type=tap,id=net0,ifname=tap200i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
  -device 'virtio-net-pci,mac=22:9B:BF:A8:73:32,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' \
  -machine 'type=pc+pve0'

Do you see anything relevant ?

Thanks for your help
 
I played a bit with ceph tools and found the command
Bash:
ceph tell mon.\* sessions

I tried to get some infos from the MONs and I got 2 clients with "global_id_status": "reclaim_insecure". All others are in status "reclaim_ok", "new_ok" or "none" (the others MONs).

Here's the full output of a reclaim_insecure client :
Code:
    {
        "name": "client.199084333",
        "entity_name": "client.admin",
        "addrs": {
            "addrvec": [
                {
                    "type": "any",
                    "addr": "10.152.12.70:0",
                    "nonce": 647077118
                }
            ]
        },
        "socket_addr": {
            "type": "any",
            "addr": "10.152.12.70:0",
            "nonce": 647077118
        },
        "con_type": "client",
        "con_features": 4540138292840824831,
        "con_features_hex": "3f01cfb8ffecffff",
        "con_features_release": "luminous",
        "open": true,
        "caps": {
            "text": "allow *"
        },
        "authenticated": true,
        "global_id": 199084333,
        "global_id_status": "reclaim_insecure",
        "osd_epoch": 113584,
        "remote_host": "vm17"
    },

I'm looking for way to associate global_id and/or nonce to a running VM. If someone has any clue ...
 
And as of I'm writing, no more AUTH_INSECURE_GLOBAL_ID_RECLAIM warning ...
Bash:
# ceph health detail
HEALTH_WARN mons are allowing insecure global_id reclaim
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim
    mon.vm10 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm12 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm13 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm14 has auth_allow_insecure_global_id_reclaim set to true
    mon.vm18 has auth_allow_insecure_global_id_reclaim set to true

The previous VM used as example didn't changed his PID so I can conclude it hasn't been restarted :
Bash:
# lsof -n -p 167580 |grep rbd
kvm     167580 root  DEL       REG               0,25            43194688 /usr/lib/librbd.so.1.12.0

So I still have no idea if I can safely set auth_allow_insecure_global_id_reclaim to false. That's my only real problem here.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!