VMs remounting partition read-only and (Buffer) I/O errors since qemu 3.0

Aug 7, 2018
23
1
23
124
On our Proxmox/Ceph clusters we're seeing mutliple Debian Jessie VMs remounting their /tmp partition readonly after some logged (Buffer) I/O errors in syslog.

From what I can tell this happened after updating to pve-kernel-4.15.18-16-pve as we've not seen this issue before and we've not had this issue on our older VMware cluster running the same installations with Debian Jessie.

On the two clusters we've seen the issue we reverted back to pve-kernel-4.15.18-15-pve and the problem has not reappeared.

I'm able to trigger the I/O errors fairly easy by constantly writing many small files in a loop with dd and removing them again. The read-only remount seems a bit harder to trigger but I was able to when cloning the machine in Proxmox.

From what I was able to gather the only difference between the two kernels is only TCP Sack bugfixes which seem somewhat unlikely to be the cause (we were using the iptables workaround before the update and we have no matched packets in the firewall chain.)

Is anyone else seeing such issues or does anyone have suggestions what might be causing this?

This gets logged in /var/log/syslog.


Code:
Jun 25 02:08:21 server kernel: [942597.045574] sd 0:0:0:0: [sda] 
Jun 25 02:08:21 server kernel: [942597.045591] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 02:08:21 server kernel: [942597.045593] sd 0:0:0:0: [sda] 
Jun 25 02:08:21 server kernel: [942597.045595] Sense Key : Aborted Command [current]
Jun 25 02:08:21 server kernel: [942597.045598] sd 0:0:0:0: [sda] 
Jun 25 02:08:21 server kernel: [942597.045600] Add. Sense: I/O process terminated
Jun 25 02:08:21 server kernel: [942597.045603] sd 0:0:0:0: [sda] CDB:
Jun 25 02:08:21 server kernel: [942597.045609] Write(10): 2a 00 01 ab e8 06 00 00 02 00
Jun 25 02:08:21 server kernel: [942597.045621] end_request: I/O error, dev sda, sector 28043270
Jun 25 02:08:21 server kernel: [942597.045626] Buffer I/O error on device sda8, logical block 131075
Jun 25 02:08:21 server kernel: [942597.045627] lost page write due to I/O error on sda8
Jun 25 02:09:12 server kernel: [942647.909617] sd 0:0:0:0: [sda] 
Jun 25 02:09:12 server kernel: [942647.909625] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 02:09:12 server kernel: [942647.909627] sd 0:0:0:0: [sda] 
Jun 25 02:09:12 server kernel: [942647.909628] Sense Key : Aborted Command [current]
Jun 25 02:09:12 server kernel: [942647.909630] sd 0:0:0:0: [sda] 
Jun 25 02:09:12 server kernel: [942647.909632] Add. Sense: I/O process terminated
Jun 25 02:09:12 server kernel: [942647.909633] sd 0:0:0:0: [sda] CDB:
Jun 25 02:09:12 server kernel: [942647.909638] Write(10): 2a 00 01 ab e8 06 00 00 02 00
Jun 25 02:09:12 server kernel: [942647.909643] end_request: I/O error, dev sda, sector 28043270
Jun 25 02:09:12 server kernel: [942647.909646] Buffer I/O error on device sda8, logical block 131075
Jun 25 02:09:12 server kernel: [942647.909648] lost page write due to I/O error on sda8
Jun 25 07:02:47 server kernel: [960262.885704] sd 0:0:0:0: [sda] 
Jun 25 07:02:47 server kernel: [960262.885731] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 07:02:47 server kernel: [960262.885733] sd 0:0:0:0: [sda] 
Jun 25 07:02:47 server kernel: [960262.885734] Sense Key : Aborted Command [current]
Jun 25 07:02:47 server kernel: [960262.885736] sd 0:0:0:0: [sda] 
Jun 25 07:02:47 server kernel: [960262.885737] Add. Sense: I/O process terminated
Jun 25 07:02:47 server kernel: [960262.885753] sd 0:0:0:0: [sda] CDB:
Jun 25 07:02:47 server kernel: [960262.885758] Write(10): 2a 00 01 ab e8 06 00 00 02 00
Jun 25 07:02:47 server kernel: [960262.885768] end_request: I/O error, dev sda, sector 28043270
Jun 25 07:02:47 server kernel: [960262.885773] Buffer I/O error on device sda8, logical block 131075
Jun 25 07:02:47 server kernel: [960262.885774] lost page write due to I/O error on sda8
Jun 25 21:22:36 server kernel: [1011851.372149] Aborting journal on device sda8-8.
Jun 25 21:22:36 server kernel: [1011851.373252] EXT4-fs (sda8): ext4_writepages: jbd2_start: 13312 pages, ino 33; err -30
Jun 25 21:22:36 server kernel: [1011851.377512] EXT4-fs error (device sda8): ext4_journal_check_start:56: Detected aborted journal
Jun 25 21:22:36 server kernel: [1011851.377994] EXT4-fs (sda8): Remounting filesystem read-only
Jun 25 21:22:36 server kernel: [1011851.378445] EXT4-fs (sda8): ext4_writepages: jbd2_start: 13312 pages, ino 31; err -30

pveversion -v output (note that I downgraded the kernel packages)

Code:
proxmox-ve: 5.4-1 (running kernel: 4.15.18-15-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-3
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-11-pve: 4.15.18-34
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-52
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
 
An update regarding this issue, my initial findings were wrong as it was not the kernel update causing the issue but rather the update from qemu 2.12 to 3.0, but only because we're using Ceph with KRBD instead of librbd.

I've tested qemu 3.0 with librbd for a week and the issue did not return on those machines, I've also tested 2.12 with KRBD and the issue has not surfaced either, both tests ran for over a week.

To make sure it was really an issue with qemu 3.0 and KRBD I've changed back the VM using qemu 3.0 to KRBD and the problem is back.

To keep on using qemu 2.12 seems unwise regarding security updates and feature fixes but using KRBD improves performance drastically for our situation.

Could you please provide us with support regarding this issue?
 
Hi,

I can confirm this issue with latest version pve-qemu-kvm: 3.0.1-4 - when used with Ceph KRBD VMs suffer from IO errors and data loss.

@dietmar Any chance to fix this soon?
 
latest version is pve-qemu-kvm: 4.0.0-3 (Proxmox VE 6.0)

@tom I meant latest version for Debian Stretch / Proxmox 5.4. It would be highly appreciated to have this bug fixed in Proxmox 5. Besides that can you confirm that the bug is fixed in pve-qemu-kvm: 4.0.0-3? I want to avoid upgrading the whole cluster (without any downgrade methode) if there is a risk that this bug is still unfixed.
 
Hi,

I can confirm this issue with latest version pve-qemu-kvm: 3.0.1-4 - when used with Ceph KRBD VMs suffer from IO errors and data loss.

Are your virtual machines using scsi or virtio disks? Mine are scsi with a VirtIO controller so I can use discard but I'm currently testing with virtio disks to see if I can still trigger the issue, so far 24 hours without any issues.

Do you have a way to (quickly) trigger the issue? Usually I'm able to trigger it within an hour but sometimes it takes more than a day, that's why my initial report was a bit fuzzy and why I won't declare virtio disks as a workaround until it ran for at least a week.
 
Yeah, scsi with discard. VM config like:
Code:
scsi0: ceph:vm-100-disk-0,cache=writeback,discard=on,size=100G
scsihw: virtio-scsi-pci

It is rather easily trigged, just generate lots of IO activity inside VM, e.g.:
Code:
fio --rw=randwrite --iodepth=32 --ioengine=libaio --bs=4k --numjobs=1--filename=/root/test --size=50G --runtime=60 --group_reporting --name=test --direct=1

I am surprised that such a serious bug (freezing VMs, data loss!) hasn't caused more complaints here. Especially considering this bug is persisting for months already. Are we the only ones using a recent Proxmox version with Ceph storage?
There seems not much interest from Proxmox team. :(
@dietmar @tom are you gonna fix this?
 
Last edited:
Yeah, scsi with discard. VM config like:
Code:
scsi0: ceph:vm-100-disk-0,cache=writeback,discard=on,size=100G
scsihw: virtio-scsi-pci

It is rather easily trigged, just generate lots of IO activity inside VM, e.g.:
Code:
fio --rw=randwrite --iodepth=32 --ioengine=libaio --bs=4k --numjobs=1--filename=/root/test --size=50G --runtime=60 --group_reporting --name=test --direct=1

I am surprised that such a serious bug (freezing VMs, data loss!) hasn't caused more complaints here. Especially considering this bug is persisting for months already. Are we the only ones using a recent Proxmox version with Ceph storage?
There seems not much interest from Proxmox team. :(
@dietmar @tom are you gonna fix this?

Thanks for the test case, testing it now to see if it's faster than my dd loop.

The lack of response is bothering me as well though my initial report was somewhat incomplete but it seems many are using Ceph with librbd instead of KRBD, I was not able to reproduce the issue with librbd.

I see in the bugzilla report you've figured out discard is causing this problem which would explain why using virtio disks seems stable because there's no discard in Proxmox 5.x with virtio, we've not seen this on our windows VMs as they're using virito disks.

I'll try and see if I can trigger the issue with Proxmox 6.0 and virtio + discard as well though I might not be able to do so today.
 
Yeah thanks. I really do hope for some response from Proxmox team.

librbd is super super slow compared to KRBD so that is not a viable alternative.
 
I can confirm that with Proxmox VE 6.0, Ceph + KRBD and virtio disks with discard=on the issue remains.

So to recap (please correct me if I'm wrong), when using PVE 5.4 and 6.0 with Ceph and KRBD turning on discard results in data loss in the VM regardless of the OS used and seems to be caused since pve-qemu-kvm versions > 2.x.
 
We have also identified this as a problem since upgrading from kernel 5.3.18-3-pve to either 5.4.35-1-pve or 5.4.41-1-pve

We operate 10 Ceph clusters and have the following setup on all of them:
  • Ceph Nautilus 14.2.9
  • Kernel RBD with the following features enabled on each and every single image since kernel 5.3.18:
    • rbd_default_features = 31
      • Layering (1)
      • Striping v2 (2)
      • Exclusive locking (4)
      • Object Map (8)
      • Fast Map (16)
    • Added object-map and fast-diff to each pre-existing image
  • We use discard on VirtIO SCSI attached discs to get space reclamation

Herewith a sample VM definition:
Code:
agent: 1
boot: cdn
bootdisk: scsi0
cores: 1
cpu: Broadwell
ide2: none,media=cdrom
memory: 4096
name: lair-onos2
net0: virtio=F2:01:26:16:71:BB,bridge=vmbr0,tag=61
numa: 1
onboot: 1
ostype: l26
protection: 1
scsi0: rbd_ssd:vm-136-disk-0,cache=writeback,discard=on,size=40G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=794216ab-9bb1-4a8b-919b-f71295a51bf3
sockets: 2
vmgenid: cc07e4e2-c3ca-4ad6-870a-8eeb3537a9a1


Sample commands to add object-map and enable fast-diff on existing images:
Code:
ceph config rm global rbd_default_features;
ceph config-key rm config/global/rbd_default_features;
ceph config set global rbd_default_features 31;
ceph config dump | grep rbd_default_features;
for pool in rbd_ssd rbd_hdd; do for image in `rbd ls $pool`; do rbd feature enable $pool/$image exclusive-lock; rbd feature enable $pool/$image object-map, fast-diff; done; done;
for pool in rbd_ssd rbd_hdd; do for image in `rbd ls $pool`; do echo -en "$pool/$image\t: "; rbd info $pool/$image | grep -P -e '\sfeatures:'; done; done;
for pool in rbd_ssd rbd_hdd; do for image in `rbd ls $pool -l | grep -v '^NAME' | awk '{print $1}'`; do echo -ne "$pool/$image\t:"; rbd info $pool/$image | grep -e flags; done; done
for pool in rbd_ssd rbd_hdd; do for image in `rbd ls $pool -l | grep -v '^NAME' | awk '{print $1}'`; do [ `rbd info $pool/$image | grep -c 'object map invalid'` -gt 0 ] && rbd object-map rebuild $pool/$image; done; done;


The commonality that we've been able to identify is that these errors occur on guests which don't have proper sector alignment.

In Windows, run 'diskpart':
diskpart.png

If your offset is 512 bytes (sector 1) the partition starts too early:
diskpart2.png


Would be interested to know if this affects anyone who's partitions are properly aligned...

This is very evident ever since booting the HyperVisor with 5.4.35 or 5.4.41:
system_events_153.png
 
I believe changes in kernel 5.4 now time out or return errors and may subsequently log problems whilst it went unnoticed previously. A relatively slow disk only sandbox cluster exhibits these errors occasionally with valid partition alignment but other systems have stopped logging these errors after we transferred data to sector aligned partitions.

We carefully reviewed the output of the following script and worked on any guest where the start locations of any partitions didn't evenly divide by 2048 sectors (1 MiB (1024 * 1024) / 512 bytes per virtual sector = 2048 sectors).

This essentially prints the partition table, setting units as sectors, for all RBD images mapped on the host:
Code:
for f in `rbd showmapped | grep -v ^id | awk '{print $5}'`; do parted $f 'unit s print'; done


Herewith an example where one should ignore the 'Microsoft reserved partition' which aligns the data partition to a location which evenly divides by 1 MiB:
parted_sector_alignment.png

In this example the first disc (OS) starts at exactly 1 MiB (size * sector size / 1048576) = 2048 * 512 / 1048576)
The data volume is partitioned using GPT and the data partition starts at 129 MiB (264192 * 512 / 1048576)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!