rbd I/O stops to separate ceph cluster

lightspd

New Member
Jan 17, 2024
2
0
1
Hello,

I recently ran into a new issue and not sure what would be the cause or where to start looking. I have a proxmox 7.4-18 cluster that communicates with a separate ceph reef cluster. Normally no issues, but recently had some drives start going bad on ceph, which resulted in getting slow ops warnings. Not usually a big deal and it's been only 1 or 2 slow ops and went away quickly.

Now onto the issue. It locked up all I/O on the RBD mount in the VM, Ubuntu server if it matters. Stopped the VM, but the slice was still there and had to reboot the proxmox server to start the VM. Status on qemu just show #.scope - kvm. pvesm status show running, ceph cluster shows healthy and I/O resumes find after restarting the server and VM. dmesg didn't show anything on VM or proxmox box.

I've had slow ops in the past, but never had it lock up a VM and now it's happened twice. Using KRBD and it's a EC4-2 pool. I'm sure I'm missing something, but wanted to see if anyone had any ideas or suggestions on cause or additional troubleshooting steps if it happens again.

Thanks
 
Is it just I/O failing in the guest, or is the whole VM process stuck?
Do you see any processes in `D` state when you check with ps auxwf?
Do you see any I/O errors or hung tasks in the journal?
 
I/O failing in the guest to the RBD mount, while the VM process seems to stop, proxmox shows it shutdown, the PID never dies itself.*
Yes, any write attempts to the mount result in a D state process.
I did not see any I/O errors or hung tasks in journal, but might have stopped it before a hung task showed up. I can say in the past when this happened before trying to gracefully shutdown fails with hung tasks.

* I will add that I do have other VM's in the cluster using the same pool that did not experience an issue, but I/O was low on those when it happened, so it seems like it only effected either VM's writing to a specific OSD/PG or high O/I. Can't comment on which case it is.

In older proxmox/ceph implementations I've had I/O has always recovered when slow ops clear and it only happened when a server would go down causing ceph to go crazy for a min.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!