VM lockups with Ceph

t looks like the 3.10.0-5-pve kernel is the culprit of my lockups. The ceph guys had me back off to the 2.6.32-34-pve kernel and I can't get it to lock up. Granted, I'm now taking about a 30% performance hit, but it's stable.


maybe because it's slower you don't reach the lockup ?
Personnaly, I have done a lot of bench recently with kernel 3.10 librbd or krbd with full ssd setup, and I never had any problem.

One strange thing is that librbd is really userland and not related to kernel version.
So maybe the problem could could from qemu, as the kvm module is different between both kernels.
(or maybe a driver bug)

I did notice with krbd that live snapshots don't work with your patch, Spirit.

Thanks for the report, I'll try to fix that.

 
Last edited:


maybe because it's slower you don't reach the lockup ?
Personnaly, I have done a lot of bench recently with kernel 3.10 librbd or krbd with full ssd setup, and I never had any problem.

One strange thing is that librbd is really userland and not related to kernel version.
So maybe the problem could could from qemu, as the kvm module is different between both kernels.
(or maybe a driver bug)

Thanks for the report, I'll try to fix that.

According to the ceph guys it didn't appear there was a deadlock waiting on a mutex or anything like that. They're convinced its a scheduling issue and a thread just isn't getting scheduled that should, basically a stall of some sort, so they think it is the kernel. krbd with your patch worked fine on 3.10 though when using direct io. I'd definitely try to get proxmox to incorporate your patch as it is a good alternative, though I'll probably just back off kernel versions in production so I don't have to deal with custom patches in my production environment.

Oh, and I mentioned that live snapshots don't work with the krbd patch (think it is because the rbd device isn't being created and mounted for the RAM). I should mention that live snapshots also do not work with librbd support either, it looked like it was working but it simply never finished ... I had to hard kill the vm and run "qm unlock 101".
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!