VM IO Failure

drjaymz@

Member
Jan 19, 2022
141
5
23
102
I recently migrated / imported a bunch of machines from KVM.

Today one of them is throwing an IO error which doesn't appear to have caused an issue -but maybe just not discovered what that issue is yet,

Code:
Mar 27 11:20:56 gesyar3 kernel: [948952.805107] EXT3-fs (vdb): error in ext3_new_inode: IO failure
Mar 27 13:29:48 gesyar3 kernel: [956684.459644] EXT3-fs (vdb): error in ext3_new_inode: IO failure
Mar 27 13:33:24 gesyar3 kernel: [956900.330197] EXT3-fs (vdb): error in ext3_new_inode: IO failure
Mar 27 13:42:42 gesyar3 kernel: [957458.132799] EXT3-fs (vdb): error in ext3_new_inode: IO failure

This VM is running an old(ish) suse enterprise VM.

Code:
# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       7.9G  3.1G  4.5G  41% /
udev            1.9G  104K  1.9G   1% /dev
tmpfs           1.9G  704K  1.9G   1% /dev/shm
/dev/vdb        148G  115G   26G  82% /home

I then restored a backup that was made after the error and then checked the filesystem.

Code:
/dev/rpool/data# e2fsck vm-302-disk-1
e2fsck 1.46.5 (30-Dec-2021)
vm-302-disk-1: recovering journal
vm-302-disk-1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix<y>? yes
Inode 7987207 was part of the orphaned inode list.  FIXED.
Inode 7987208 was part of the orphaned inode list.  FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (8779614, counted=8779592).
Fix<y>? yes
Inode bitmap differences:  -(7987203--7987208)
Fix<y>? yes
Free inodes count wrong for group #376 (8040, counted=8042).
Fix<y>? yes
Free inodes count wrong for group #975 (8180, counted=8190).
Fix<y>? yes
Free inodes count wrong (9495526, counted=9495533).
Fix<y>? yes


vm-302-disk-1: ***** FILE SYSTEM WAS MODIFIED *****
vm-302-disk-1: 334867/9830400 files (6.2% non-contiguous), 30542008/39321600 blocks

Initially I thought I was clever creating a copy and then checking that but now I realise that if you create a snapshot of your file system whilst its running then you'll probably always see inode issues because files are half open or partially written?

So, I don't really know what to make of these errors. I got on to them because user complained that the database said it was shutting down - which I think was what it said when perhaps it was unable to write or a write failed.
It didn't shut down and integrity seems fine. Its a progress database which I don't expect anyone to know about.

The IO failures roughly align with the */15 replication windows - so could it be related to that? I am running 3 nodes and replicating between the other two.
But I checked and the 13:42 is 3 minutes before the replication so - probably can rule that out.
I checked the forums but didn't see anything too similar.

The Proxmox setup is 5 SSD software raid zfs. i.e. not hardware raid. IO wait etc remains so low the graph doesn't bother to draw them.

Any ideas would be helpful?
 
Last edited: