rbd I/O stops to separate ceph cluster

lightspd · Jan 17, 2024

Hello,

I recently ran into a new issue and not sure what would be the cause or where to start looking. I have a proxmox 7.4-18 cluster that communicates with a separate ceph reef cluster. Normally no issues, but recently had some drives start going bad on ceph, which resulted in getting slow ops warnings. Not usually a big deal and it's been only 1 or 2 slow ops and went away quickly.

Now onto the issue. It locked up all I/O on the RBD mount in the VM, Ubuntu server if it matters. Stopped the VM, but the slice was still there and had to reboot the proxmox server to start the VM. Status on qemu just show #.scope - kvm. pvesm status show running, ceph cluster shows healthy and I/O resumes find after restarting the server and VM. dmesg didn't show anything on VM or proxmox box.

I've had slow ops in the past, but never had it lock up a VM and now it's happened twice. Using KRBD and it's a EC4-2 pool. I'm sure I'm missing something, but wanted to see if anyone had any ideas or suggestions on cause or additional troubleshooting steps if it happens again.

Thanks

mira · Jan 17, 2024

Is it just I/O failing in the guest, or is the whole VM process stuck?
Do you see any processes in `D` state when you check with ps auxwf?
Do you see any I/O errors or hung tasks in the journal?

lightspd · Jan 17, 2024

I/O failing in the guest to the RBD mount, while the VM process seems to stop, proxmox shows it shutdown, the PID never dies itself.*
Yes, any write attempts to the mount result in a D state process.
I did not see any I/O errors or hung tasks in journal, but might have stopped it before a hung task showed up. I can say in the past when this happened before trying to gracefully shutdown fails with hung tasks.

* I will add that I do have other VM's in the cluster using the same pool that did not experience an issue, but I/O was low on those when it happened, so it seems like it only effected either VM's writing to a specific OSD/PG or high O/I. Can't comment on which case it is.

In older proxmox/ceph implementations I've had I/O has always recovered when slow ops clear and it only happened when a server would go down causing ceph to go crazy for a min.

Search

Search

rbd I/O stops to separate ceph cluster

lightspd

New Member

mira

Proxmox Staff Member

lightspd

New Member