lost page write due to I/O error

Cayuga

Renowned Member
May 3, 2011
86
0
71
I installed Proxmox a couple of weeks ago and am currently running with a three node cluster. Some of my Linux clients (2.6.32) are getting the following errors:

[ 1491.167866] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 1491.167869] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
[ 1491.167872] Descriptor sense data with sense descriptors (in hex):
[ 1491.167874] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 1491.167886] 01 af 96 40
[ 1491.167889] sd 0:0:0:0: [sda] Add. Sense: No additional sense information
[ 1491.167893] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 01 af 96 40 00 00 08 00
[ 1491.167900] end_request: I/O error, dev sda, sector 28284480
[ 1491.167905] Buffer I/O error on device dm-0, logical block 3410328
[ 1491.167906] lost page write due to I/O error on dm-0

I've tried with and without virtio and it makes no difference.

The underlying storage is iSCSI if that matters.

Any ideas about what I should look for or what I might be able to change to make these errors go away?
 
Dietmar,

Thanks for asking.

The only logs that were modified when the error last happened were: /var/log/daemon.log /var/log/syslog /var/log/auth.log
and none of them showed any errors.

I did see the following entries that don't have a time correlation to the guest errors:
May 2 10:07:11 danish kernel: scsi 2:0:0:0: Direct-Access OPNFILER VIRTUAL-DISK 0 PQ: 0 ANSI: 4
May 2 10:07:11 danish kernel: sd 2:0:0:0: Attached scsi generic sg2 type 0
May 2 10:07:11 danish kernel: sd 2:0:0:0: [sdc] 409600000 512-byte logical blocks: (209 GB/195 GiB)
May 2 10:07:11 danish kernel: sd 2:0:0:0: [sdc] Write Protect is off
May 2 10:07:11 danish kernel: sd 2:0:0:0: [sdc] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
May 2 10:07:11 danish kernel: sdc: unknown partition table
May 2 10:07:11 danish kernel: sd 2:0:0:0: [sdc] Attached SCSI disk
May 3 03:45:32 danish kernel: connection2:0: detected conn error (1011)
May 3 07:39:19 danish kernel: vmbr0: port 8(tap126i0d0) entering disabled state
May 3 07:39:19 danish kernel: vmbr0: port 8(tap126i0d0) entering disabled state
May 3 07:39:30 danish kernel: device tap126i0d0 entered promiscuous mode
May 3 07:39:30 danish kernel: vmbr0: port 8(tap126i0d0) entering forwarding state
May 3 07:47:41 danish kernel: vmbr0: port 7(tap125i0d0) entering disabled state
May 3 07:47:41 danish kernel: vmbr0: port 7(tap125i0d0) entering disabled state
May 3 07:48:03 danish kernel: device tap125i0d0 entered promiscuous mode
May 3 07:48:03 danish kernel: vmbr0: port 7(tap125i0d0) entering forwarding state

Jeff
 
Maybe, but 2.6.32 is pretty new. I have older and newer guests that don't seem to exhibit this problem and most importantly, it makes me very nervous... I want to be able to run guests without worrying. I'm still assuming that I did something wrong and just want to fix it.

FYI - this was a VMware guest that had been running for a while. It started out life as a VMware LAMP appliance -- http://www.turnkeylinux.org/lampstack

Jeff
 
Maybe, but 2.6.32 is pretty new. I have older and newer guests that don't seem to exhibit this problem

Why don't you try kernel from a working guest? That way you can make sure that the problem is not guest kernel related.
 
Dietmar,

Thanks for the suggestion. I opted instead (inspired by your suggestion) to update (via apt-get) to the latest kernel for the two offending machines. No errors in the first six hours!!! :-)

I'll post again later with another update.

Thanks again!

Jeff
 
Dietmar,

I have four guest machines that have had a kernel update applied. They have been running for between 4 and 18 hours without any i/o errors. If we can make it until Monday, I'll declare victory.

Thanks again for the excellent suggestion.

Jeff
 
I have four machines that have been up for 3+ days without any i/o errors. Time to declare victory :-)

Thanks again