Ceph with krbd is unstable on Proxmox 6.2

jasgripen · May 28, 2020

Running vms on Ceph with krbd is unstable since we upgraded to Proxmox 6.2

We are running Proxmox on 5 nodes with 2 nvme disks for Ceph in each node and around 50 vms, we don’t use lxc. We use krbd for our Ceph storage pool and VirtIO-scsi as vm disk controller.

After upgrading to Proxmox 6.2 we have had a lot of problems. We have a few old vms with Cenots 5 and 6 (We can’t upgrade them right now because it’s out of our control). Since we upgraded Proxmox to 6.2, these servers has started to have problem at random times, here is from syslog in a Centos 5 host.

Code:

May 28 08:27:00 kernel: end_request: I/O error, dev vda, sector 44822965
May 28 08:27:00 kernel: Buffer I/O error on device dm-0, logical block 5576717
May 28 08:27:00 kernel: lost page write due to I/O error on dm-0
May 28 08:27:00 kernel: Buffer I/O error on device dm-0, logical block 5576718
May 28 08:27:00 kernel: lost page write due to I/O error on dm-0
May 28 08:27:00 kernel: Aborting journal on device dm-0.
May 28 08:27:00 kernel: __journal_remove_journal_head: freeing b_committed_data
May 28 08:27:00 kernel: journal commit I/O error
May 28 08:27:00 kernel: ext3_abort called.
May 28 08:27:00 kernel: EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
May 28 08:27:00 kernel: Remounting filesystem read-only

We also have a FreeBSD 12.1 vm that gets a lot of SCSI errors in the logs:

Code:

May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Error 5, Retries exhausted

I haven’t had any visible problems with newer linux vms, but since it happens both on older linux vms and on freebsd my guess is that it’s not related to the vm itself. I don’t know if newer linux kernel is better at recovering from these problems so we don’t notice them or if they don’t suffer from these failures at all.

I’ve done some testing with the FreeBSD vm since it is the easiest vm for me to see when something goes wrong. It has some load and it logs the errors nicely.

I get the same error on both 5.4.34-1-pve and 5.4.41-1-pve when I use krbd.
I do not get the error if I don’t use krbd. But this is not an option in the long run since it much slower.
I also tried to reboot Proxmox with the previous 5.3.18-3-pve kernel but I still get the same errors.
I did a test with debug logging turned on for ceph in the kernel, but it’s a lot of information and I really don’t know what to look for. The log was to big to attach in this forum so I have it on google drive if anyone is interested. https://drive.google.com/file/d/179lVk1bg9uCzW9j7fO8j5T6H-YeAxDoU/view?usp=sharing

Anyone experienced the same or have a clue what I should do or what could be wrong.

Alwin · May 28, 2020

As a first thought, mitigation patches come to mind. Microcode and BIOS up-to-date?

jasgripen said:
I get the same error on both 5.4.34-1-pve and 5.4.41-1-pve when I use krbd.

From which kernel version did you originally upgrade?

jasgripen said:
I do not get the error if I don’t use krbd. But this is not an option in the long run since it much slower.

Well, krbd uses page cache while librbd uses its own. So by default it is not really comparable. But that's another discussion.

jasgripen · Jun 1, 2020

Alwin said:
As a first thought, mitigation patches come to mind. Microcode and BIOS up-to-date?

I thought I had the last BIOS, but there was a new version out. Anyway, I applied it and the lastest microcode updates and the problem still persists.

Alwin said:
From which kernel version did you originally upgrade?

5.3.18-3-pve

Alwin said:
Well, krbd uses page cache while librbd uses its own. So by default it is not really comparable. But that's another discussion.

Of course, I’m just trying to narrow down where the problem is.

Any other thought of what I can try?

jasgripen · Jun 2, 2020

Seems like I have the same problem as descirbed in this thread:

https://forum.proxmox.com/threads/v...y-and-buffer-i-o-errors-since-qemu-3-0.55452/

Alwin · Jun 2, 2020

jasgripen said:
5.3.18-3-pve

Can you run the newest Kernel on all nodes? The mitigation (spectre/meltdown) patches work with the microcode. At least that would be one point to look into.

jasgripen said:
I thought I had the last BIOS, but there was a new version out. Anyway, I applied it and the lastest microcode updates and the problem still persists.

On all nodes?

jasgripen · Jun 2, 2020

Alwin said:
Can you run the newest Kernel on all nodes? The mitigation (spectre/meltdown) patches work with the microcode. At least that would be one point to look into.

Yes, I'm running the newest kernel now on all nodes.

Alwin said:
On all nodes?

Yes, I upgraded BIOS on all nodes.

I’m pretty sure it’s the same bug as described in this thread https://forum.proxmox.com/threads/v...y-and-buffer-i-o-errors-since-qemu-3-0.55452/ and bug reported here https://bugzilla.proxmox.com/show_bug.cgi?id=2311

The unstable vms is stable with discard off.

Alwin · Jun 2, 2020

jasgripen said:
I’m pretty sure it’s the same bug as described in this thread https://forum.proxmox.com/threads/v...y-and-buffer-i-o-errors-since-qemu-3-0.55452/ and bug reported here https://bugzilla.proxmox.com/show_bug.cgi?id=2311

While you are at it, could you please provide the data requested by Tim in the bug tracker?

jasgripen · Jun 2, 2020

Alwin said:
While you are at it, could you please provide the data requested by Tim in the bug tracker?

Yes, I'll try to do that.

atec666 · Sep 11, 2020

same thing here !
many Buffer io error , on rbd6 and 2 when migrate VM ...

atec666 · Sep 12, 2020

update seem to solve the issue

Search

Search

Ceph with krbd is unstable on Proxmox 6.2

jasgripen

Member

Alwin

Proxmox Retired Staff

jasgripen

Member

jasgripen

Member

Alwin

Proxmox Retired Staff

jasgripen

Member

Alwin

Proxmox Retired Staff

jasgripen

Member

atec666

Member

atec666

Member