Running vms on Ceph with krbd is unstable since we upgraded to Proxmox 6.2
We are running Proxmox on 5 nodes with 2 nvme disks for Ceph in each node and around 50 vms, we don’t use lxc. We use krbd for our Ceph storage pool and VirtIO-scsi as vm disk controller.
After upgrading to Proxmox 6.2 we have had a lot of problems. We have a few old vms with Cenots 5 and 6 (We can’t upgrade them right now because it’s out of our control). Since we upgraded Proxmox to 6.2, these servers has started to have problem at random times, here is from syslog in a Centos 5 host.
We also have a FreeBSD 12.1 vm that gets a lot of SCSI errors in the logs:
I haven’t had any visible problems with newer linux vms, but since it happens both on older linux vms and on freebsd my guess is that it’s not related to the vm itself. I don’t know if newer linux kernel is better at recovering from these problems so we don’t notice them or if they don’t suffer from these failures at all.
I’ve done some testing with the FreeBSD vm since it is the easiest vm for me to see when something goes wrong. It has some load and it logs the errors nicely.
Anyone experienced the same or have a clue what I should do or what could be wrong.
We are running Proxmox on 5 nodes with 2 nvme disks for Ceph in each node and around 50 vms, we don’t use lxc. We use krbd for our Ceph storage pool and VirtIO-scsi as vm disk controller.
After upgrading to Proxmox 6.2 we have had a lot of problems. We have a few old vms with Cenots 5 and 6 (We can’t upgrade them right now because it’s out of our control). Since we upgraded Proxmox to 6.2, these servers has started to have problem at random times, here is from syslog in a Centos 5 host.
Code:
May 28 08:27:00 kernel: end_request: I/O error, dev vda, sector 44822965
May 28 08:27:00 kernel: Buffer I/O error on device dm-0, logical block 5576717
May 28 08:27:00 kernel: lost page write due to I/O error on dm-0
May 28 08:27:00 kernel: Buffer I/O error on device dm-0, logical block 5576718
May 28 08:27:00 kernel: lost page write due to I/O error on dm-0
May 28 08:27:00 kernel: Aborting journal on device dm-0.
May 28 08:27:00 kernel: __journal_remove_journal_head: freeing b_committed_data
May 28 08:27:00 kernel: journal commit I/O error
May 28 08:27:00 kernel: ext3_abort called.
May 28 08:27:00 kernel: EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
May 28 08:27:00 kernel: Remounting filesystem read-only
We also have a FreeBSD 12.1 vm that gets a lot of SCSI errors in the logs:
Code:
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Retrying command (per sense data)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 00 cf 20 6d 00 01 00 00
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): CAM status: SCSI Status Error
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI status: Check Condition
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,6 (I/O process terminated)
May 27 10:30:48 mx1 (da0: vtscsi0:0:0:0): Error 5, Retries exhausted
I haven’t had any visible problems with newer linux vms, but since it happens both on older linux vms and on freebsd my guess is that it’s not related to the vm itself. I don’t know if newer linux kernel is better at recovering from these problems so we don’t notice them or if they don’t suffer from these failures at all.
I’ve done some testing with the FreeBSD vm since it is the easiest vm for me to see when something goes wrong. It has some load and it logs the errors nicely.
- I get the same error on both 5.4.34-1-pve and 5.4.41-1-pve when I use krbd.
- I do not get the error if I don’t use krbd. But this is not an option in the long run since it much slower.
- I also tried to reboot Proxmox with the previous 5.3.18-3-pve kernel but I still get the same errors.
- I did a test with debug logging turned on for ceph in the kernel, but it’s a lot of information and I really don’t know what to look for. The log was to big to attach in this forum so I have it on google drive if anyone is interested. https://drive.google.com/file/d/179lVk1bg9uCzW9j7fO8j5T6H-YeAxDoU/view?usp=sharing
Anyone experienced the same or have a clue what I should do or what could be wrong.