IO Errors in Proxmox VM

merlink

Member
Jan 31, 2013
1
0
21
Hy @ All,

at the Moment I'm testing if Proxmox runs smooth and could be integrated in our Firm.

But I get a strange Error inside a debian VM.

For Tests we installed openmediavault which is a NAS Distribution based on Debian.
So I generated a VM with 2 Virtual disks. 1 is 10Gig the second is about 16TB (A bigger SAN gives this Space)

This ran wonderfull so I filled the NAS with media data for testing.

After 2 Day and 1.8TB data uploading with samba I got these Errors in the VR Debian System:

Jan 31 01:18:59 media kernel: [128592.136733] end_request: I/O error, dev vdb, sector 3878633888
Jan 31 01:18:59 media kernel: [128592.140719] Buffer I/O error on device vdb1, logical block 484828980
Jan 31 01:18:59 media kernel: [128592.140719] lost page write due to I/O error on vdb1
Jan 31 01:18:59 media kernel: [128592.140719] Buffer I/O error on device vdb1, logical block 484828981
Jan 31 01:18:59 media kernel: [128592.140719] lost page write due to I/O error on vdb1

a little later this:

Jan 31 01:18:59 media kernel: [128592.140719] end_request: I/O error, dev vdb, sector 3878634896
Jan 31 01:18:59 media kernel: [128592.140719] end_request: I/O error, dev vdb, sector 3878635904
Jan 31 01:18:59 media kernel: [128592.140719] end_request: I/O error, dev vdb, sector 3878636912
Jan 31 01:18:59 media kernel: [128592.161932] end_request: I/O error, dev vdb, sector 3878636920
Jan 31 01:18:59 media kernel: [128592.161932] end_request: I/O error, dev vdb, sector 3878637928
Jan 31 01:18:59 media kernel: [128592.161932] end_request: I/O error, dev vdb, sector 3878638936

and ends up with this message:

Jan 31 01:19:00 media kernel: [128592.311082] end_request: I/O error, dev vdb, sector 3878868872
Jan 31 01:19:00 media kernel: [128592.351101] Aborting journal on device vdb1-8.
Jan 31 01:19:00 media kernel: [128592.352044] JBD2: Detected IO errors while flushing file data on vdb1-8
Jan 31 01:19:00 media kernel: [128592.352591] EXT4-fs error (device vdb1): ext4_journal_start_sb: Detected aborted journal
Jan 31 01:19:00 media kernel: [128592.352593] EXT4-fs (vdb1): Remounting filesystem read-only

So everything is dead.

On the Proxmox System not 1 logentry tells about any Error.

For me this Error is reproducable, I did this 3 times with Sata, SCSI or Virtio Drivers (standard Config and comlete new VM install).

Last time with:
cache = direct sync

and VM option:
elevator = noop
This leads to: JBD: barrier-based sync failed on vda1-8 - disabling barriers
which shouldn't be a problem.


At the Moment (after a Proxmox reboot) when I start the VM I get this Kernel Log:



[ 7.821427] JBD: barrier-based sync failed on vda1-8 - disabling barriers
[ 8.955052] end_request: I/O error, dev vdb, sector 3867150344
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148296
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148297
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148298
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148299
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148300
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148301
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148302
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148303
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148304
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148305
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] end_request: I/O error, dev vdb, sector 3870951424
[ 8.959047] end_request: I/O error, dev vdb, sector 3871344648
[ 8.959047] end_request: I/O error, dev vdb, sector 3875538952
[ 10.613150] EXT4-fs (vdb1): mounted filesystem with ordered data mode


Any Ideas????

Thank you.
 
Are you using software raid?

Its been reported that with some debian basesd systems that with software raid you can get this message JBD: barrier-based sync failed on vda1-8. A way to stop the posting of these messages is to Edit grub.conf or menu.lst and pass the following parameter to kernel: barrier=off

This masks the potential real problem, based on the remainder of the messages you posted, its very likely you have a hardware problem of some type. Make sure you have good verifiable backups.


Typically when you see messages such as lost page write error and buffer IO error message messages its a hardware problem.
Problem can be the drive--most common, ram and lastly motherboard and or related firmware.

If you google lost page message you will come up a number of posting discussing same error messages pertaining to a failed backup in proxmox. http://forum.proxmox.com/threads/6544-FS-problems-on-proxmox-during-the-vzdump Look a this one.
Would recommend hardware raid since there can be fewer interactions with the operating system and a good controller will let you know what drive is in failure.
 
Hy @ All,

...

For Tests we installed openmediavault which is a NAS Distribution based on Debian.
So I generated a VM with 2 Virtual disks. 1 is 10Gig the second is about 16TB (A bigger SAN gives this Space)

This ran wonderfull so I filled the NAS with media data for testing.

...

[ 7.821427] JBD: barrier-based sync failed on vda1-8 - disabling barriers
[ 8.955052] end_request: I/O error, dev vdb, sector 3867150344
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148296
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148297
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148298
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148299
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148300
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148301
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148302
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148303
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148304
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] Buffer I/O error on device vdb1, logical block 3867148305
[ 8.959047] lost page write due to I/O error on vdb1
[ 8.959047] end_request: I/O error, dev vdb, sector 3870951424
[ 8.959047] end_request: I/O error, dev vdb, sector 3871344648
[ 8.959047] end_request: I/O error, dev vdb, sector 3875538952
[ 10.613150] EXT4-fs (vdb1): mounted filesystem with ordered data mode


Any Ideas????

Thank you.

Hi,
this looks like an issue on the NAS. But how do you use the NAS? Can you post the storage.cfg and the VM-config?

Any error-messages at the openmediavault side?

Is the openmediavault only for test? Then perhaps you can also do the same test with openattic (also debian based, but more for company use instead home use like openmediavault).

Udo
 
Hy @ All,

So I generated a VM with 2 Virtual disks. 1 is 10Gig the second is about 16TB (A bigger SAN gives this Space)

This ran wonderfull so I filled the NAS with media data for testing.

After 2 Day and 1.8TB data uploading with samba I got these Errors in the VR Debian System:

I have a similar issue here:

Host:
ProxMox 2.3-13 (next step in our approach to solve the problem should be update of proxmox i think)
16TB HDD -> Hardware RAID 5

Guest:
MS Windows 2008 R2

HDD (ide0) local, raw 100GB C:
HDD (ide1) local, raw 100GB E:
HDD (ide3) local, qcow2 12TB R:

Guest is Running about 1 year without any problems, now HDD3 is getting bigger and bigger
at the moment we are at ~ 2TB and are getting an I/O Error ( 0x8007045D ) if trying to copy files.

It's not at any time, but copying large amounts of data will lead to this error.

If anybody has an idea - let me know!

Best regards
 
I have had a similar issue. Here's my experience (unfortunately I don't have access to the machine right now, so I'm writing this from memory):

The setup is: Proxmox 3.2 on debian wheezy (I had issues with the proxmox installer ISO booting off UEFI). Dell PERC H310 controller, 2 x 15k rpm SAS HDDs RAID1, 3 x 7.2k SAS HDDs RAID5. The CPU is a Sandy-Bridge Xeon, with all things virtualization enabled in UEFI (including VT-d and SR-IOV, but I don't passthrough any PCI hardware nor do the broadcom NICs on the DELL support SR-IOV AFAIK).

At first, I tried passing the entire RAID5 to the guest (again, an OpenMediaVault installation) via virtio-scsi-pci (as scsi-block). The OpenMediaVault (OMV) kernel (the one provided by debian squeeze stable) seemed not to support virtio-scsi, so I installed 3.2.0 from squeeze backports. I think 3.2 was before virtio-scsi went mainline, but for some reason it worked, and virtual disk (as configured in Dell PERC) /dev/sdb in host magically appeared as /dev/sda in guest. I created a VG which only contained sda as a PV, and then created a single LV taking up all space within the VG. I went on to format the LV using ext4, which was painfully slow (moreover, guest AND host became mostly unresponsive while formatting, I got timeouts in the web GUI, ssh sessions stalling etc), which I tracked down to being due to cache=writethrough being set for this HDD in proxmox. After I set it to none, formatting went really fast, but this is when I started getting the errors reported in the OP:

kernel: [128592.140719] end_request: I/O error, dev sda, sector 3878634896

at a rate of tens per second. I didn't have the time to look for other errors however, so I cannot really say whether any Buffer I/O Errors or "lost page write" errors were reported. But this end_request message was printed so abundantly on the console which was hard to miss.

I was alarmed, so I thought I would go the safe way of using a more conservative approach: I downgraded OMV to 2.6.32-5, deleted the LV, and let Proxmox handle the VG. I added a disk to the guest (OMV) via the GUI, and set it to 3.0 TB. I then rebooted the guest (with the original kernel) and went on to format /dev/vdb1 (which was the partition OMV created on /dev/vdb, which for the host is /dev/VG/LV).

This time things were much worse, both guest and host stalled completely. SSH sessions to these machines died (on the client side), both Proxmox and OMV web interfaces got timeouts and wouldn't recover, even the iDRAC console refused to print any input (which, in my book, means keypresses on a physical keyboard are not registered either). During this phase the HDD LEDs were flashing, apparently metadata was being written on the RAID5 array (OMV's version of e2fsprogs doesn't support lazy initialization).

In the end, I managed to restart the host OS by issuing several "graceful shutdown" requests on the iDRAC console. After a couple of minutes Proxmox registered the requests as "power button pressed" events and it rebooted. I then went on to format the LV from the host, which went reasonably fast without any stalling (I used kpartx on the LV to expose the partition). I then booted the guest, and the filesystem was initialized. I went on to create shares and started copying files into them, and while copying, the GUI's responsiveness did get affected a little this time as well, but only in the sense that it "felt" slower and not as snappy (the source files are located on slow media, so I'm guessing this copying process does not really stress the host to guest storage backend, in constract to gigs of zeros being writting during ext4 formatting).

It would be greatly appreciated if anyone could post similar findings and whether they have found any ways to mitigate the situation. In contrast to other posts, in my case no iSCSI targets were used (which I deduced from references to SAN devices).

Best regards,
George
 
George (and OP): here's my seytup in a nutshell. Has been very stable since setup (about 2 weeks ago) and is being used constantly by a CentOS VM.

I have configured three 2TB drives in RAID5 on my M5016 ServeRAID controller with write-back mode (BBU). Then, I created a PV on the resulting volume at the host level (in proxmox)

Code:
root@proxmox:~# pvdisplay
  --- Physical volume ---
  PV Name               /dev/sdb
  VG Name               datastore
  PV Size               3.64 TiB / not usable 4.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              953196
  Free PE               0
  Allocated PE          953196
  PV UUID               CHxPDo-GGDk-3FIc-r3mx-KqUo-UAKI-5nbZNX

and then created a VG using the entire PV

Code:
root@proxmox:~# vgdisplay
  --- Volume group ---
  VG Name               datastore
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  13
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               3.64 TiB
  PE Size               4.00 MiB
  Total PE              953196
  Alloc PE / Size       953196 / 3.64 TiB
  Free  PE / Size       0 / 0   
  VG UUID               bCMIXo-MXPD-rnD5-Ti76-DbhK-6ULM-u57u82


Finally, I created a logical volume of 300GB and mounted it at the host level under /mnt/backups and added this as backup destination in Proxmox's GUI. The other LV (leftover of 3.34TB) was passed to the CentOS VM via its vm.conf file raw (unformatted).

CentOS sees the LV as a standard disk (/dev/vdc) and I created a ext4 FS on it, mounted it in the guest and started using it.

Code:
root@proxmox:~# lvdisplay
  --- Logical volume ---
  LV Path                /dev/datastore/backups
  LV Name                backups
  VG Name                datastore
  LV UUID                bK6Bnl-6fzv-rZwB-CW2s-X1w6-F2r3-GkMvVO
  LV Write Access        read/write
  LV Creation host, time proxmox, 2014-04-12 20:23:10 -0400
  LV Status              available
  # open                 1
  LV Size                300.00 GiB
  Current LE             76800
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3
   
  --- Logical volume ---
  LV Path                /dev/datastore/storage
  LV Name                storage
  VG Name                datastore
  LV UUID                b5XadK-ZXBl-gWWL-y0UY-QWVG-CRxm-w3Hp2p
  LV Write Access        read/write
  LV Creation host, time proxmox, 2014-04-12 20:27:27 -0400
  LV Status              available
  # open                 1
  LV Size                3.34 TiB
  Current LE             876396
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4

I haven't seen any I/O errors like you guys and I haven't had any performance issues or other stability problems. I am not sure if I used the 3.64TB RAID5 array the best I could but I couldn't get community feedback so I went forward with this strategy before my VM's start corrupting SQL data due to storage issues.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!