ext3 I/O error

JustaGuy

Renowned Member
Jan 1, 2010
324
2
83
This morning I got a few of these in syslog:

Aug 4 06:25:02 bascule kernel: Buffer I/O error on device dm-3, logical block 0
Aug 4 06:25:02 bascule kernel: lost page write due to I/O error on dm-3
Aug 4 06:25:02 bascule kernel: EXT3-fs error (device dm-3): ext3_get_inode_loc: unable to read inode block - inode=39288833, block=157155330

What's this dm-3 device?
 
This morning I got a few of these in syslog:

Aug 4 06:25:02 bascule kernel: Buffer I/O error on device dm-3, logical block 0
Aug 4 06:25:02 bascule kernel: lost page write due to I/O error on dm-3
Aug 4 06:25:02 bascule kernel: EXT3-fs error (device dm-3): ext3_get_inode_loc: unable to read inode block - inode=39288833, block=157155330

What's this dm-3 device?
Hi,
this can happen when a volume don't exist anymore.
Try a
Code:
pvscan
are there errors?
To find the right device use
Code:
ls -lsa /dev/dm-3
0 brw-rw---- 1 root disk 251, 3 15. Jul 09:35 /dev/dm-3
in this case you need the storage with major 251 and minor 3
Code:
dmsetup info
show all dm-devices - look for the right major:minor number.

Udo
 
Thanks, Udo.

pvscan looked normal, 1 big 'ol disk.

ls -lsa says /dev/dm-3 doesn't exist apparently.

and dmsetup major/minors don't go up that far to 3.

----

I rebooted the server yesterday, and afterward saw no errors in syslog or dmesg.
It doesn't make sense to me.

I found these errors after starting to investigate why my backups weren't happening, I imagine that's a topic for another thread though.
 
This error's back, and this time there are clues from the output of these commands that were suggested I do, however I'm unsure on what to make of it.

There's a backup that's taking forever and I imagine this has something to do with it.

htop shows the vmtar process working on a .raw file in the /mnt/vzsnap0/images/204 directory, which is mentioned as being associated with this erroring dm-3 device.
Except /mnt/vzsnap0 looks empty.

Code:
bascule:~# pvscan
  /dev/dm-3: read failed after 0 of 4096 at 1866700095488: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 1866700152832: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 4096: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  PV /dev/sda2   VG pve   lvm2 [1.82 TB / 2.99 GB free]
  Total: 1 [1.82 TB] / in use: 1 [1.82 TB] / in no VG: 0 [0   ]
Code:
bascule:~# ls -lsa /dev/dm-3
0 brw-rw---- 1 root disk 251, 3 Aug  7 00:58 /dev/dm-3
bascule:~# dmsetup info
Name:              pve-vzsnap--bascule--0-cow
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      251, 5
Number of targets: 1
UUID: LVM-G5c2Q5ourJn8WrjMXEy2d6Nsm3MBcdVJg6cdy93hdABFhqQUq6uLbmOGhbw9b2qd-cow

Name:              pve-data-real
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        2
Event number:      0
Major, minor:      251, 4
Number of targets: 1
UUID: LVM-G5c2Q5ourJn8WrjMXEy2d6Nsm3MBcdVJvHRMLL5YyqpVhr2cUuVw9eUEKalClmQK-real

Name:              pve-swap
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      251, 0
Number of targets: 1
UUID: LVM-G5c2Q5ourJn8WrjMXEy2d6Nsm3MBcdVJpNN0r1xayU1ccYwuFnNW1oR1f2ELyVxB

Name:              pve-root
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      251, 1
Number of targets: 1
UUID: LVM-G5c2Q5ourJn8WrjMXEy2d6Nsm3MBcdVJA3dIF6iJRU02poyL6kHiMNlOYfIu16FO

Name:              pve-data
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      251, 2
Number of targets: 1
UUID: LVM-G5c2Q5ourJn8WrjMXEy2d6Nsm3MBcdVJvHRMLL5YyqpVhr2cUuVw9eUEKalClmQK

Name:              pve-vzsnap--bascule--0
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      251, 3
Number of targets: 1
UUID: LVM-G5c2Q5ourJn8WrjMXEy2d6Nsm3MBcdVJg6cdy93hdABFhqQUq6uLbmOGhbw9b2qd
Code:
bascule:~# lvscan
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  ACTIVE            '/dev/pve/swap' [19.00 GB] inherit
  ACTIVE            '/dev/pve/root' [100.00 GB] inherit
  inactive Original '/dev/pve/data' [1.70 TB] inherit
  inactive Snapshot '/dev/pve/vzsnap-bascule-0' [1.00 GB] inherit
Code:
bascule:~# vgscan
  Reading all physical volumes.  This may take a while...
  /dev/dm-3: read failed after 0 of 4096 at 1866700095488: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 1866700152832: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 4096: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  Found volume group "pve" using metadata type lvm2
bascule:~#
 
I think this one 32GB machine that was erroring might be why the others never have a chance to get done. Here's the report from my email:
204: Aug 07 00:58:34 INFO:
Code:
Starting Backup of VM 204 (qemu)
204: Aug 07 00:58:34 INFO: running
204: Aug 07 00:58:34 INFO: status = running
204: Aug 07 00:58:35 INFO: backup mode: snapshot
204: Aug 07 00:58:35 INFO: bandwidth limit: 100000 KB/s
204: Aug 07 00:58:35 INFO:   Logical volume "vzsnap-bascule-0" created
204: Aug 07 00:58:35 INFO: creating archive '/mnt/pve/Datastore11_vzdump/vzdump-qemu-204-2010_08_07-00_58_34.tgz'
204: Aug 07 00:58:35 INFO: adding '/mnt/pve/Datastore11_vzdump/vzdump-qemu-204-2010_08_07-00_58_34.tmp/qemu-server.conf' to archive ('qemu-server.conf')
204: Aug 07 00:58:35 INFO: adding '/mnt/vzsnap0/images/204/vm-204-disk-1.raw' to archive ('vm-disk-ide0.raw')
204: Aug 08 00:55:37 INFO: received signal - terminate process
204: Aug 08 00:55:37 INFO: archive file size: 0KB
204: Aug 08 00:55:38 INFO:   /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
204: Aug 08 00:55:38 INFO:   Logical volume "vzsnap-bascule-0" successfully removed
204: Aug 08 00:55:38 INFO: Finished Backup of VM 204 (23:57:04)

I sigkilled it, as well as the others following.
Will revisit this one later & post what happens.

Any ideas on the I/O error?
How does that just start happening?

They're 6x 1TB 7200rpm SAS in RAID10 w/ 2 hotspares
Dell PERC 6/i controller.
 
is there anything in the RAID controller log?

AFAIK there's no way to know without going in through BIOS setup.
I'll reboot later & see.

Dell produced a Red Hat package that provides that function through their OpenManage software, and there is an alien'd .deb of it floating around the internet, but it wouldn't install for some reason.

If anyone's managed to install Dell OpenManage in pve, I'd be interested in hearing about how you overcame the obstacles that shut me down.
 
Dell produced a Red Hat package that provides that function through their OpenManage software, and there is an alien'd .deb of it floating around the internet, but it wouldn't install for some reason.

If anyone's managed to install Dell OpenManage in pve, I'd be interested in hearing about how you overcame the obstacles that shut me down.

Have you tried these deb packages: https://subtrac.sara.nl/oss/omsa_2_deb ?
I installed dellomsa 5.5 without problems but that was over a year ago. I remember that version 6 packages had some problems then but things may have changed since..
 
Have you tried these deb packages: https://subtrac.sara.nl/oss/omsa_2_deb ?

Yeah, that's the address I remember.

All the instructions I found to do it led me to a dead end somehow.
There would be something I had to do on the computer that wasn't mentioned in the instructions.
I don't recall what it was, it was about a year ago.

Now I haven't the time for stuff like that that won't work the first time.
...And now these dang disks demanding time & attention I can't afford anyway... geesh!
Maybe someday.
 
I think this one 32GB machine that was erroring might be why the others never have a chance to get done. Here's the report from my email:
204: Aug 07 00:58:34 INFO:
Code:
Starting Backup of VM 204 (qemu)
204: Aug 07 00:58:34 INFO: running
204: Aug 07 00:58:34 INFO: status = running
204: Aug 07 00:58:35 INFO: backup mode: snapshot
204: Aug 07 00:58:35 INFO: bandwidth limit: 100000 KB/s
204: Aug 07 00:58:35 INFO:   Logical volume "vzsnap-bascule-0" created
204: Aug 07 00:58:35 INFO: creating archive '/mnt/pve/Datastore11_vzdump/vzdump-qemu-204-2010_08_07-00_58_34.tgz'
204: Aug 07 00:58:35 INFO: adding '/mnt/pve/Datastore11_vzdump/vzdump-qemu-204-2010_08_07-00_58_34.tmp/qemu-server.conf' to archive ('qemu-server.conf')
204: Aug 07 00:58:35 INFO: adding '/mnt/vzsnap0/images/204/vm-204-disk-1.raw' to archive ('vm-disk-ide0.raw')
204: Aug 08 00:55:37 INFO: received signal - terminate process
204: Aug 08 00:55:37 INFO: archive file size: 0KB
204: Aug 08 00:55:38 INFO:   /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
204: Aug 08 00:55:38 INFO:   Logical volume "vzsnap-bascule-0" successfully removed
204: Aug 08 00:55:38 INFO: Finished Backup of VM 204 (23:57:04)

Seems normal on tonight's backup.
Here's the problematic one's report:

Code:
204: Aug 09 00:55:30 INFO: Starting Backup of VM 204 (qemu)
204: Aug 09 00:55:31 INFO: stopped
204: Aug 09 00:55:31 INFO: status = stopped
204: Aug 09 00:55:31 INFO: backup mode: stop
204: Aug 09 00:55:31 INFO: bandwidth limit: 100000 KB/s
204: Aug 09 00:55:31 INFO: creating archive '/mnt/pve/Datastore11_vzdump/vzdump-qemu-204-2010_08_09-00_55_30.tar'
204: Aug 09 00:55:31 INFO: adding '/mnt/pve/Datastore11_vzdump/vzdump-qemu-204-2010_08_09-00_55_30.tmp/qemu-server.conf' to archive ('qemu-server.conf')
204: Aug 09 00:55:31 INFO: adding '/var/lib/vz/images/204/vm-204-disk-1.raw' to archive ('vm-disk-ide0.raw')
204: Aug 09 01:24:48 INFO: Total bytes written: 33724988416 (18.31 MiB/s)
204: Aug 09 01:25:44 INFO: archive file size: 31.41GB
204: Aug 09 01:25:44 INFO: Finished Backup of VM 204 (00:30:14)

Too bad I can't OpenManage into the RAID card's log. Soon.
 
There's no log facility in the PERC 6/i Integrated BIOS Configuration Utility 1.22.02-0612
All it can tell me is there aren't any S.M.A.R.T. errors on any of the disks.
Running a consistency check right now, for what it's worth.

Looks like it's going to be a long night.

Consistency_Check.jpg

**EDIT
4 hours later the consistency check is done without any output, so I'm left to assume it found all to be well.

It just so happens that the boot after was un-fsck'd boot #22, and therefore pveroot gets fsck'd, and that finished without any problems.
 
Last edited:
This usually indicates some disk failures - is there anything in the RAID controller log?

This log is huge. 6576 lines.

Dell tech reviewing now.


I learned these drives have old firmware, and that firmware changelogs generally addresses I/O errors such as these.
Since the logs were clean- ie. no obvious errors, it would imply the controller's not aware of the problem, if indeed it is in the disk.

I'm also told that write-back & full-on, non-adaptive readahead isn't always good depending on the software doing the I/O. For example I guess the SQLs don't want to deal with the risk of non-ECC memory errors occurring due to cache usage.

Do the pve devs have a position on the usage of readahead & write-back?


Another thing is that there's this feature I wasn't aware of called NCQ, that isn't supported by my current firmware, and maybe could suddenly become enabled with the update I'll be looking into.
It seems alot- .20 versions between my currently installed f/w & the current released f/w.

I know I won't be mixing drive brands again if I can help it, since now I have to keep up on Hitachi & Seagate firmware releases.
 
Last edited:
I updated the drives from CC34 to CC46 and it had no effect on the problem.
I still have this in syslog:

Code:
Aug 16 06:25:07 bascule kernel: EXT3-fs error (device dm-3): ext3_find_entry: reading directory #2 offset 0
Aug 16 06:25:07 bascule kernel: Buffer I/O error on device dm-3, logical block 0
Aug 16 06:25:07 bascule kernel: lost page write due to I/O error on dm-3
 
After a long slow process of reconstruction which included numerous upgrades and a huge bag of zip ties, I can finally consider this resolved.
Now 3 in a row have completed without issue.

Not only was the firmware upgraded on 4 of 6 drives, but an upgraded version of PVE was installed fresh.
The NFS server that had been in use at the time has since died and was replaced as well.
Also the SAN was upgraded from 1 Gb to 20.
 
Hi,
what kind of SAN do you use?
Ultimately storage will be tiered 7 ways, on & off-line.
This has only just this week reached a barely functional stage.

So far there's 2 PowerEdge servers, PVE on the 2950 & Debian Squeeze running nfs-kernel-server on the R210.
High speed connectivity is via 2 SFP+ cables run directly between the 2 servers. Each is equipped with a dual-port Intel 82598 10Gb NIC bonded in balance-rr mode.
All these are Dell parts.

Low speed access to the R210 is through a Netgear GS105 unmanaged Gb switch connected via 2 Cat6 cables to a dual-port Broadcom NetXtremeII BCM5716 1Gb NIC bonded in balance-rr mode.

On the R210 there are 3 speeds available:

25GB on a Kingston SSDNow E-Series via SATA II
Dual 2TB Seagate Barracuda XT's in software RAID1 via SATA III
Six 1TB virtual drives provided by a Drobo-S full of 2TB SATA II drives, via an eSATA (III) cable.


It's barely functional, and nothing's working properly yet.
One VM's database is supposed to be mounted on that SSD, and it doesn't work unless I install the entire server on a RAW image living on it as provided through PVE's NFS storage.
The connections are supposed to be brokered by Vyatta in a VM, and PVE's still using the 1Gb connection to reach the switch that has the 2Gb connection to the server.
There's still alot to sort out.

This because among other things apparently I'm slow to understand routing, and the learning curve's a hell of a thing.
Plus the Drobo is awfully slow. I suspect a faulty cable, replacing that & monitoring the effect is on the to-do list.
Currently I'm only able achieve 100 GB / hour during backups.

This in itself is an enormous improvement from how it once was, when I had NFS over a 1Gb SheevaPlug with an eSATA drive.
What once took almost 18 hours now completes in less than 6 and it's only the first draft.

The project was discussed in this thread over the summer, I've learned alot since.
I hope to have the 20Gb link sorted by Friday, but that's what I said last week too- lol.
iScsi with Myricom-NIC, or Dolphin? How are your experience with that (througput)?

Udo

I don't know what any of that is, sounds like hardware?
What I do know is that I wasted some time developing a sane naming convention back when I was considering iSCSI.

The decision to go with NFS came about when I learned an ext4 volume can be converted to btrfs once that's ready.
In order to make use of it's deduplication, it made more sense to have the server rather than the client manage the filesystem.
Therefore, with the exception of the Drobo, whose sparse provisioning only understands ext3, all the storage volumes are btrfs-ready ext4.
 
Last edited:
How are your experience with that (througput)?

Udo

ping:
Code:
rtt min/avg/max/mdev = 0.064/0.079/0.106/0.017 ms
First completed of 3 simultaneous qmrestores:
Code:
INFO: 4294967296 bytes (4.3 GB) copied, 62.6381 s, 68.6 MB/s
:cool:
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!