Ceph with krbd is unstable on Proxmox 6.2

Mar 4, 2020
9
1
8
46
Running vms on Ceph with krbd is unstable since we upgraded to Proxmox 6.2

We are running Proxmox on 5 nodes with 2 nvme disks for Ceph in each node and around 50 vms, we don’t use lxc. We use krbd for our Ceph storage pool and VirtIO-scsi as vm disk controller.

After upgrading to Proxmox 6.2 we have had a lot of problems. We have a few old vms with Cenots 5 and 6 (We can’t upgrade them right now because it’s out of our control). Since we upgraded Proxmox to 6.2, these servers has started to have problem at random times, here is from syslog in a Centos 5 host.

Code:
May 28 08:27:00 kernel: end_request: I/O error, dev vda, sector 44822965
May 28 08:27:00 kernel: Buffer I/O error on device dm-0, logical block 5576717
May 28 08:27:00 kernel: lost page write due to I/O error on dm-0
May 28 08:27:00 kernel: Buffer I/O error on device dm-0, logical block 5576718
May 28 08:27:00 kernel: lost page write due to I/O error on dm-0
May 28 08:27:00 kernel: Aborting journal on device dm-0.
May 28 08:27:00 kernel: __journal_remove_journal_head: freeing b_committed_data
May 28 08:27:00 kernel: journal commit I/O error
May 28 08:27:00 kernel: ext3_abort called.
May 28 08:27:00 kernel: EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
May 28 08:27:00 kernel: Remounting filesystem read-only

We also have a FreeBSD 12.1 vm that gets a lot of SCSI errors in the logs:

Code:
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Error 5, Retries exhausted

I haven’t had any visible problems with newer linux vms, but since it happens both on older linux vms and on freebsd my guess is that it’s not related to the vm itself. I don’t know if newer linux kernel is better at recovering from these problems so we don’t notice them or if they don’t suffer from these failures at all.

I’ve done some testing with the FreeBSD vm since it is the easiest vm for me to see when something goes wrong. It has some load and it logs the errors nicely.

  • I get the same error on both 5.4.34-1-pve and 5.4.41-1-pve when I use krbd.
  • I do not get the error if I don’t use krbd. But this is not an option in the long run since it much slower.
  • I also tried to reboot Proxmox with the previous 5.3.18-3-pve kernel but I still get the same errors.
  • I did a test with debug logging turned on for ceph in the kernel, but it’s a lot of information and I really don’t know what to look for. The log was to big to attach in this forum so I have it on google drive if anyone is interested. https://drive.google.com/file/d/179lVk1bg9uCzW9j7fO8j5T6H-YeAxDoU/view?usp=sharing

Anyone experienced the same or have a clue what I should do or what could be wrong.
 
As a first thought, mitigation patches come to mind. Microcode and BIOS up-to-date?

  • I get the same error on both 5.4.34-1-pve and 5.4.41-1-pve when I use krbd.
From which kernel version did you originally upgrade?

  • I do not get the error if I don’t use krbd. But this is not an option in the long run since it much slower.
Well, krbd uses page cache while librbd uses its own. So by default it is not really comparable. But that's another discussion. ;)
 
As a first thought, mitigation patches come to mind. Microcode and BIOS up-to-date?
I thought I had the last BIOS, but there was a new version out. Anyway, I applied it and the lastest microcode updates and the problem still persists.

From which kernel version did you originally upgrade?
5.3.18-3-pve

Well, krbd uses page cache while librbd uses its own. So by default it is not really comparable. But that's another discussion. ;)
Of course, I’m just trying to narrow down where the problem is.

Any other thought of what I can try?
 
5.3.18-3-pve
Can you run the newest Kernel on all nodes? The mitigation (spectre/meltdown) patches work with the microcode. At least that would be one point to look into.

I thought I had the last BIOS, but there was a new version out. Anyway, I applied it and the lastest microcode updates and the problem still persists.
On all nodes?
 
Can you run the newest Kernel on all nodes? The mitigation (spectre/meltdown) patches work with the microcode. At least that would be one point to look into.
Yes, I'm running the newest kernel now on all nodes.

On all nodes?
Yes, I upgraded BIOS on all nodes.

I’m pretty sure it’s the same bug as described in this thread https://forum.proxmox.com/threads/v...y-and-buffer-i-o-errors-since-qemu-3-0.55452/ and bug reported here https://bugzilla.proxmox.com/show_bug.cgi?id=2311

The unstable vms is stable with discard off.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!