RBD stale after ceph rolling upgrade

Jules- · Aug 2, 2021

After passing the stage where CVE Patch (CVE-2021-20288: Unauthorized global_id reuse in cephx) for mon_warn_on_insecure_global_id_reclaim came into play and doing further rolling upgrades up to the latest version we are facing into a weird behavior executing: ceph.target on a single node

all VMs (on the updated node and also the remaining ones), depending on the IOPS workload will soon or later stop responding to write requests hitting kernel msg "blocked for more than x seconds", Virtio block device timeout is a way higher than virtio scsi but it doesn't really matter, both RBD VM becoming stale.
If the IOPS workload is high, it initially happen, if workload is low it can take up to 4 hours until the bug is triggered.

The only way to solve this unrespoding block device hung up is by restarting every single VM. After rebooting VMs they will blame about "orphan cleanup" but happily i never seen a failing storage yet since fsck seem to be able to fix the inode corrupt.
After restarting every single VMs they continue to work proberly until we execute ceph.target again.

The logs doesn't show any special issues except these:

2021-07-27 11:40:36.904 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:40:36.924 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:40:36.928 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:40:37.304 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id

2021-07-27 11:45:21.161 7f1400cc2700 0 auth: could not find secret_id=10715
2021-07-27 11:45:21.161 7f1400cc2700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700 0 auth: could not find secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=10715
2021-07-27 11:45:51.184 7f1400cc2700 0 auth: could not find secret_id=10715

3 Node Node specs:
Proxmox 6.4 KVM 5.2.0-6 using RBD images (mixed: virtio block/virtio scsi)
2 to 4 x SSD OSDs per Node
3 mons, 3 mgr

Affected releases:
- Ceph Nautilus 14.2.22 (prior version: 14.2.20)
- Ceph Octopus 15.2.13 (prior version: 15.2.11)

Any ideas what causing this issues?

Kind Regards
Jules

spirit · Aug 2, 2021

Hi,

so, you have done

"ceph config set mon auth_allow_insecure_global_id_reclaim false"

Do you have restart or live migrate vm (to have a new qemu process running on last patched librbd) before this ?

Jules- · Aug 2, 2021

yes, even on live migrated vms this issue occured. I have the feeling that a rolling restart takes alot more time than it was before the CVE patch.
Maybe there is a problem with some sort of timeout that got hit or something? The weird is, regardless what i tried to reproduce this on a test cluster, i cannot force this issue on them. It only happens on all those production clusters we recently upgraded and reupgraded again and again. They were fine for 2 years of rolling upgrades until that cve patch update got released.

Search

Search

RBD stale after ceph rolling upgrade

Jules-

Renowned Member

spirit

Distinguished Member

Jules-

Renowned Member