VMs remounting partition read-only and (Buffer) I/O errors since qemu 3.0

Menno · Jun 26, 2019

On our Proxmox/Ceph clusters we're seeing mutliple Debian Jessie VMs remounting their /tmp partition readonly after some logged (Buffer) I/O errors in syslog.

From what I can tell this happened after updating to pve-kernel-4.15.18-16-pve as we've not seen this issue before and we've not had this issue on our older VMware cluster running the same installations with Debian Jessie.

On the two clusters we've seen the issue we reverted back to pve-kernel-4.15.18-15-pve and the problem has not reappeared.

I'm able to trigger the I/O errors fairly easy by constantly writing many small files in a loop with dd and removing them again. The read-only remount seems a bit harder to trigger but I was able to when cloning the machine in Proxmox.

From what I was able to gather the only difference between the two kernels is only TCP Sack bugfixes which seem somewhat unlikely to be the cause (we were using the iptables workaround before the update and we have no matched packets in the firewall chain.)

Is anyone else seeing such issues or does anyone have suggestions what might be causing this?

This gets logged in /var/log/syslog.

Code:

Jun 25 02:08:21 server kernel: [942597.045574] sd 0:0:0:0: [sda] 
Jun 25 02:08:21 server kernel: [942597.045591] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 02:08:21 server kernel: [942597.045593] sd 0:0:0:0: [sda] 
Jun 25 02:08:21 server kernel: [942597.045595] Sense Key : Aborted Command [current]
Jun 25 02:08:21 server kernel: [942597.045598] sd 0:0:0:0: [sda] 
Jun 25 02:08:21 server kernel: [942597.045600] Add. Sense: I/O process terminated
Jun 25 02:08:21 server kernel: [942597.045603] sd 0:0:0:0: [sda] CDB:
Jun 25 02:08:21 server kernel: [942597.045609] Write(10): 2a 00 01 ab e8 06 00 00 02 00
Jun 25 02:08:21 server kernel: [942597.045621] end_request: I/O error, dev sda, sector 28043270
Jun 25 02:08:21 server kernel: [942597.045626] Buffer I/O error on device sda8, logical block 131075
Jun 25 02:08:21 server kernel: [942597.045627] lost page write due to I/O error on sda8
Jun 25 02:09:12 server kernel: [942647.909617] sd 0:0:0:0: [sda] 
Jun 25 02:09:12 server kernel: [942647.909625] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 02:09:12 server kernel: [942647.909627] sd 0:0:0:0: [sda] 
Jun 25 02:09:12 server kernel: [942647.909628] Sense Key : Aborted Command [current]
Jun 25 02:09:12 server kernel: [942647.909630] sd 0:0:0:0: [sda] 
Jun 25 02:09:12 server kernel: [942647.909632] Add. Sense: I/O process terminated
Jun 25 02:09:12 server kernel: [942647.909633] sd 0:0:0:0: [sda] CDB:
Jun 25 02:09:12 server kernel: [942647.909638] Write(10): 2a 00 01 ab e8 06 00 00 02 00
Jun 25 02:09:12 server kernel: [942647.909643] end_request: I/O error, dev sda, sector 28043270
Jun 25 02:09:12 server kernel: [942647.909646] Buffer I/O error on device sda8, logical block 131075
Jun 25 02:09:12 server kernel: [942647.909648] lost page write due to I/O error on sda8
Jun 25 07:02:47 server kernel: [960262.885704] sd 0:0:0:0: [sda] 
Jun 25 07:02:47 server kernel: [960262.885731] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 07:02:47 server kernel: [960262.885733] sd 0:0:0:0: [sda] 
Jun 25 07:02:47 server kernel: [960262.885734] Sense Key : Aborted Command [current]
Jun 25 07:02:47 server kernel: [960262.885736] sd 0:0:0:0: [sda] 
Jun 25 07:02:47 server kernel: [960262.885737] Add. Sense: I/O process terminated
Jun 25 07:02:47 server kernel: [960262.885753] sd 0:0:0:0: [sda] CDB:
Jun 25 07:02:47 server kernel: [960262.885758] Write(10): 2a 00 01 ab e8 06 00 00 02 00
Jun 25 07:02:47 server kernel: [960262.885768] end_request: I/O error, dev sda, sector 28043270
Jun 25 07:02:47 server kernel: [960262.885773] Buffer I/O error on device sda8, logical block 131075
Jun 25 07:02:47 server kernel: [960262.885774] lost page write due to I/O error on sda8
Jun 25 21:22:36 server kernel: [1011851.372149] Aborting journal on device sda8-8.
Jun 25 21:22:36 server kernel: [1011851.373252] EXT4-fs (sda8): ext4_writepages: jbd2_start: 13312 pages, ino 33; err -30
Jun 25 21:22:36 server kernel: [1011851.377512] EXT4-fs error (device sda8): ext4_journal_check_start:56: Detected aborted journal
Jun 25 21:22:36 server kernel: [1011851.377994] EXT4-fs (sda8): Remounting filesystem read-only
Jun 25 21:22:36 server kernel: [1011851.378445] EXT4-fs (sda8): ext4_writepages: jbd2_start: 13312 pages, ino 31; err -30

pveversion -v output (note that I downgraded the kernel packages)

Code:

proxmox-ve: 5.4-1 (running kernel: 4.15.18-15-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-3
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-11-pve: 4.15.18-34
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-52
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

Menno · Jul 15, 2019

An update regarding this issue, my initial findings were wrong as it was not the kernel update causing the issue but rather the update from qemu 2.12 to 3.0, but only because we're using Ceph with KRBD instead of librbd.

I've tested qemu 3.0 with librbd for a week and the issue did not return on those machines, I've also tested 2.12 with KRBD and the issue has not surfaced either, both tests ran for over a week.

To make sure it was really an issue with qemu 3.0 and KRBD I've changed back the VM using qemu 3.0 to KRBD and the problem is back.

To keep on using qemu 2.12 seems unwise regarding security updates and feature fixes but using KRBD improves performance drastically for our situation.

Could you please provide us with support regarding this issue?

xxyton · Jul 31, 2019

Hi,

I can confirm this issue with latest version pve-qemu-kvm: 3.0.1-4 - when used with Ceph KRBD VMs suffer from IO errors and data loss.

@dietmar Any chance to fix this soon?

tom · Jul 31, 2019

xxyton said:
latest version

latest version is pve-qemu-kvm: 4.0.0-3 (Proxmox VE 6.0)

xxyton · Jul 31, 2019

tom said:
latest version is pve-qemu-kvm: 4.0.0-3 (Proxmox VE 6.0)

@tom I meant latest version for Debian Stretch / Proxmox 5.4. It would be highly appreciated to have this bug fixed in Proxmox 5. Besides that can you confirm that the bug is fixed in pve-qemu-kvm: 4.0.0-3? I want to avoid upgrading the whole cluster (without any downgrade methode) if there is a risk that this bug is still unfixed.

Menno · Jul 31, 2019

tom said:
latest version is pve-qemu-kvm: 4.0.0-3 (Proxmox VE 6.0)

The issue is also present in qemu 4.0 (Proxmox 6.0).

edit: changed the title of the topic as it doesn't seem to be related to the kernel.

Menno · Aug 1, 2019

xxyton said:
Hi,

I can confirm this issue with latest version pve-qemu-kvm: 3.0.1-4 - when used with Ceph KRBD VMs suffer from IO errors and data loss.

Are your virtual machines using scsi or virtio disks? Mine are scsi with a VirtIO controller so I can use discard but I'm currently testing with virtio disks to see if I can still trigger the issue, so far 24 hours without any issues.

Do you have a way to (quickly) trigger the issue? Usually I'm able to trigger it within an hour but sometimes it takes more than a day, that's why my initial report was a bit fuzzy and why I won't declare virtio disks as a workaround until it ran for at least a week.

xxyton · Aug 1, 2019

Yeah, scsi with discard. VM config like:

Code:

scsi0: ceph:vm-100-disk-0,cache=writeback,discard=on,size=100G
scsihw: virtio-scsi-pci

It is rather easily trigged, just generate lots of IO activity inside VM, e.g.:

Code:

fio --rw=randwrite --iodepth=32 --ioengine=libaio --bs=4k --numjobs=1--filename=/root/test --size=50G --runtime=60 --group_reporting --name=test --direct=1

I am surprised that such a serious bug (freezing VMs, data loss!) hasn't caused more complaints here. Especially considering this bug is persisting for months already. Are we the only ones using a recent Proxmox version with Ceph storage?
There seems not much interest from Proxmox team.

@dietmar @tom are you gonna fix this?

xxyton · Aug 1, 2019

I just created a bug report: https://bugzilla.proxmox.com/show_bug.cgi?id=2311

Menno · Aug 2, 2019

xxyton said:
I just created a bug report: https://bugzilla.proxmox.com/show_bug.cgi?id=2311

Thanks, I've added myself to the bug report.

Menno · Aug 2, 2019

xxyton said:
Yeah, scsi with discard. VM config like:

Code:

scsi0: ceph:vm-100-disk-0,cache=writeback,discard=on,size=100G scsihw: virtio-scsi-pci

It is rather easily trigged, just generate lots of IO activity inside VM, e.g.:

Code:

fio --rw=randwrite --iodepth=32 --ioengine=libaio --bs=4k --numjobs=1--filename=/root/test --size=50G --runtime=60 --group_reporting --name=test --direct=1

I am surprised that such a serious bug (freezing VMs, data loss!) hasn't caused more complaints here. Especially considering this bug is persisting for months already. Are we the only ones using a recent Proxmox version with Ceph storage?
There seems not much interest from Proxmox team.
@dietmar @tom are you gonna fix this?

Thanks for the test case, testing it now to see if it's faster than my dd loop.

The lack of response is bothering me as well though my initial report was somewhat incomplete but it seems many are using Ceph with librbd instead of KRBD, I was not able to reproduce the issue with librbd.

I see in the bugzilla report you've figured out discard is causing this problem which would explain why using virtio disks seems stable because there's no discard in Proxmox 5.x with virtio, we've not seen this on our windows VMs as they're using virito disks.

I'll try and see if I can trigger the issue with Proxmox 6.0 and virtio + discard as well though I might not be able to do so today.

xxyton · Aug 2, 2019

Yeah thanks. I really do hope for some response from Proxmox team.

librbd is super super slow compared to KRBD so that is not a viable alternative.

Menno · Aug 5, 2019

I can confirm that with Proxmox VE 6.0, Ceph + KRBD and virtio disks with discard=on the issue remains.

So to recap (please correct me if I'm wrong), when using PVE 5.4 and 6.0 with Ceph and KRBD turning on discard results in data loss in the VM regardless of the OS used and seems to be caused since pve-qemu-kvm versions > 2.x.

tim · Aug 6, 2019

I would ask everyone to post any further comments and/or logs to the corresponding bug in Bugzilla:
https://bugzilla.proxmox.com/show_bug.cgi?id=2311

Please add yourself to the CC list to get informed about further comments in the bug tracker.

jasgripen · Jun 2, 2020

Anyone solutions to this. I started to get the same problems after upgrading to Proxmox 6.2 from 6.1. But I also migrated some vms from another systems to my Proxmox system at the same time so it could be that I didn’t notice it earlier because it was not as high load on the system before that.

https://forum.proxmox.com/threads/ceph-with-krbd-is-unstable-on-proxmox-6-2.70451/

Menno · Jun 2, 2020

No solution that I know of, we worked around it by turning off discard for the VM disks.

xxyton · Jun 10, 2020

@tim can you have another look at this bug?

David Herselman · Jun 15, 2020

We have also identified this as a problem since upgrading from kernel 5.3.18-3-pve to either 5.4.35-1-pve or 5.4.41-1-pve

We operate 10 Ceph clusters and have the following setup on all of them:

Ceph Nautilus 14.2.9
Kernel RBD with the following features enabled on each and every single image since kernel 5.3.18:
- rbd_default_features = 31
  - Layering (1)
  - Striping v2 (2)
  - Exclusive locking (4)
  - Object Map (8)
  - Fast Map (16)
- Added object-map and fast-diff to each pre-existing image
We use discard on VirtIO SCSI attached discs to get space reclamation

Herewith a sample VM definition:

Code:

agent: 1
boot: cdn
bootdisk: scsi0
cores: 1
cpu: Broadwell
ide2: none,media=cdrom
memory: 4096
name: lair-onos2
net0: virtio=F2:01:26:16:71:BB,bridge=vmbr0,tag=61
numa: 1
onboot: 1
ostype: l26
protection: 1
scsi0: rbd_ssd:vm-136-disk-0,cache=writeback,discard=on,size=40G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=794216ab-9bb1-4a8b-919b-f71295a51bf3
sockets: 2
vmgenid: cc07e4e2-c3ca-4ad6-870a-8eeb3537a9a1

Sample commands to add object-map and enable fast-diff on existing images:

Code:

ceph config rm global rbd_default_features;
ceph config-key rm config/global/rbd_default_features;
ceph config set global rbd_default_features 31;
ceph config dump | grep rbd_default_features;
for pool in rbd_ssd rbd_hdd; do for image in `rbd ls $pool`; do rbd feature enable $pool/$image exclusive-lock; rbd feature enable $pool/$image object-map, fast-diff; done; done;
for pool in rbd_ssd rbd_hdd; do for image in `rbd ls $pool`; do echo -en "$pool/$image\t: "; rbd info $pool/$image | grep -P -e '\sfeatures:'; done; done;
for pool in rbd_ssd rbd_hdd; do for image in `rbd ls $pool -l | grep -v '^NAME' | awk '{print $1}'`; do echo -ne "$pool/$image\t:"; rbd info $pool/$image | grep -e flags; done; done
for pool in rbd_ssd rbd_hdd; do for image in `rbd ls $pool -l | grep -v '^NAME' | awk '{print $1}'`; do [ `rbd info $pool/$image | grep -c 'object map invalid'` -gt 0 ] && rbd object-map rebuild $pool/$image; done; done;

The commonality that we've been able to identify is that these errors occur on guests which don't have proper sector alignment.

In Windows, run 'diskpart':

If your offset is 512 bytes (sector 1) the partition starts too early:

Would be interested to know if this affects anyone who's partitions are properly aligned...

This is very evident ever since booting the HyperVisor with 5.4.35 or 5.4.41:

xxyton · Jun 19, 2020

I am experiencing this bug despite correctly aligned partitions.

David Herselman · Jun 21, 2020

I believe changes in kernel 5.4 now time out or return errors and may subsequently log problems whilst it went unnoticed previously. A relatively slow disk only sandbox cluster exhibits these errors occasionally with valid partition alignment but other systems have stopped logging these errors after we transferred data to sector aligned partitions.

We carefully reviewed the output of the following script and worked on any guest where the start locations of any partitions didn't evenly divide by 2048 sectors (1 MiB (1024 * 1024) / 512 bytes per virtual sector = 2048 sectors).

This essentially prints the partition table, setting units as sectors, for all RBD images mapped on the host:

Code:

for f in `rbd showmapped | grep -v ^id | awk '{print $5}'`; do parted $f 'unit s print'; done

Herewith an example where one should ignore the 'Microsoft reserved partition' which aligns the data partition to a location which evenly divides by 1 MiB:

In this example the first disc (OS) starts at exactly 1 MiB (size * sector size / 1048576) = 2048 * 512 / 1048576)
The data volume is partitioned using GPT and the data partition starts at 129 MiB (264192 * 512 / 1048576)

VMs remounting partition read-only and (Buffer) I/O errors since qemu 3.0

Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Proxmox Staff Member

Member

Member

Member

Renowned Member

Member

Renowned Member

We value your privacy