After passing the stage where CVE Patch (CVE-2021-20288: Unauthorized global_id reuse in cephx) for mon_warn_on_insecure_global_id_reclaim came into play and doing further rolling upgrades up to the latest version we are facing into a weird behavior executing: ceph.target on a single node
all VMs (on the updated node and also the remaining ones), depending on the IOPS workload will soon or later stop responding to write requests hitting kernel msg "blocked for more than x seconds", Virtio block device timeout is a way higher than virtio scsi but it doesn't really matter, both RBD VM becoming stale.
If the IOPS workload is high, it initially happen, if workload is low it can take up to 4 hours until the bug is triggered.
The only way to solve this unrespoding block device hung up is by restarting every single VM. After rebooting VMs they will blame about "orphan cleanup" but happily i never seen a failing storage yet since fsck seem to be able to fix the inode corrupt.
After restarting every single VMs they continue to work proberly until we execute ceph.target again.
The logs doesn't show any special issues except these:
2021-07-27 11:40:36.904 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:40:36.924 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:40:36.928 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:40:37.304 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:45:21.161 7f1400cc2700 0 auth: could not find secret_id=10715
2021-07-27 11:45:21.161 7f1400cc2700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700 0 auth: could not find secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=10715
2021-07-27 11:45:51.184 7f1400cc2700 0 auth: could not find secret_id=10715
3 Node Node specs:
Proxmox 6.4 KVM 5.2.0-6 using RBD images (mixed: virtio block/virtio scsi)
2 to 4 x SSD OSDs per Node
3 mons, 3 mgr
Affected releases:
- Ceph Nautilus 14.2.22 (prior version: 14.2.20)
- Ceph Octopus 15.2.13 (prior version: 15.2.11)
Any ideas what causing this issues?
Kind Regards
Jules
all VMs (on the updated node and also the remaining ones), depending on the IOPS workload will soon or later stop responding to write requests hitting kernel msg "blocked for more than x seconds", Virtio block device timeout is a way higher than virtio scsi but it doesn't really matter, both RBD VM becoming stale.
If the IOPS workload is high, it initially happen, if workload is low it can take up to 4 hours until the bug is triggered.
The only way to solve this unrespoding block device hung up is by restarting every single VM. After rebooting VMs they will blame about "orphan cleanup" but happily i never seen a failing storage yet since fsck seem to be able to fix the inode corrupt.
After restarting every single VMs they continue to work proberly until we execute ceph.target again.
The logs doesn't show any special issues except these:
2021-07-27 11:40:36.904 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:40:36.924 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:40:36.928 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:40:37.304 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3 handle_auth_request failed to assign global_id
2021-07-27 11:45:21.161 7f1400cc2700 0 auth: could not find secret_id=10715
2021-07-27 11:45:21.161 7f1400cc2700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700 0 auth: could not find secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=10715
2021-07-27 11:45:51.184 7f1400cc2700 0 auth: could not find secret_id=10715
3 Node Node specs:
Proxmox 6.4 KVM 5.2.0-6 using RBD images (mixed: virtio block/virtio scsi)
2 to 4 x SSD OSDs per Node
3 mons, 3 mgr
Affected releases:
- Ceph Nautilus 14.2.22 (prior version: 14.2.20)
- Ceph Octopus 15.2.13 (prior version: 15.2.11)
Any ideas what causing this issues?
Kind Regards
Jules