I recently migrated / imported a bunch of machines from KVM.
Today one of them is throwing an IO error which doesn't appear to have caused an issue -but maybe just not discovered what that issue is yet,
This VM is running an old(ish) suse enterprise VM.
I then restored a backup that was made after the error and then checked the filesystem.
Initially I thought I was clever creating a copy and then checking that but now I realise that if you create a snapshot of your file system whilst its running then you'll probably always see inode issues because files are half open or partially written?
So, I don't really know what to make of these errors. I got on to them because user complained that the database said it was shutting down - which I think was what it said when perhaps it was unable to write or a write failed.
It didn't shut down and integrity seems fine. Its a progress database which I don't expect anyone to know about.
The IO failures roughly align with the */15 replication windows - so could it be related to that? I am running 3 nodes and replicating between the other two.
But I checked and the 13:42 is 3 minutes before the replication so - probably can rule that out.
I checked the forums but didn't see anything too similar.
The Proxmox setup is 5 SSD software raid zfs. i.e. not hardware raid. IO wait etc remains so low the graph doesn't bother to draw them.
Any ideas would be helpful?
Today one of them is throwing an IO error which doesn't appear to have caused an issue -but maybe just not discovered what that issue is yet,
Code:
Mar 27 11:20:56 gesyar3 kernel: [948952.805107] EXT3-fs (vdb): error in ext3_new_inode: IO failure
Mar 27 13:29:48 gesyar3 kernel: [956684.459644] EXT3-fs (vdb): error in ext3_new_inode: IO failure
Mar 27 13:33:24 gesyar3 kernel: [956900.330197] EXT3-fs (vdb): error in ext3_new_inode: IO failure
Mar 27 13:42:42 gesyar3 kernel: [957458.132799] EXT3-fs (vdb): error in ext3_new_inode: IO failure
This VM is running an old(ish) suse enterprise VM.
Code:
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 7.9G 3.1G 4.5G 41% /
udev 1.9G 104K 1.9G 1% /dev
tmpfs 1.9G 704K 1.9G 1% /dev/shm
/dev/vdb 148G 115G 26G 82% /home
I then restored a backup that was made after the error and then checked the filesystem.
Code:
/dev/rpool/data# e2fsck vm-302-disk-1
e2fsck 1.46.5 (30-Dec-2021)
vm-302-disk-1: recovering journal
vm-302-disk-1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes
Inode 7987207 was part of the orphaned inode list. FIXED.
Inode 7987208 was part of the orphaned inode list. FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (8779614, counted=8779592).
Fix<y>? yes
Inode bitmap differences: -(7987203--7987208)
Fix<y>? yes
Free inodes count wrong for group #376 (8040, counted=8042).
Fix<y>? yes
Free inodes count wrong for group #975 (8180, counted=8190).
Fix<y>? yes
Free inodes count wrong (9495526, counted=9495533).
Fix<y>? yes
vm-302-disk-1: ***** FILE SYSTEM WAS MODIFIED *****
vm-302-disk-1: 334867/9830400 files (6.2% non-contiguous), 30542008/39321600 blocks
Initially I thought I was clever creating a copy and then checking that but now I realise that if you create a snapshot of your file system whilst its running then you'll probably always see inode issues because files are half open or partially written?
So, I don't really know what to make of these errors. I got on to them because user complained that the database said it was shutting down - which I think was what it said when perhaps it was unable to write or a write failed.
It didn't shut down and integrity seems fine. Its a progress database which I don't expect anyone to know about.
The IO failures roughly align with the */15 replication windows - so could it be related to that? I am running 3 nodes and replicating between the other two.
But I checked and the 13:42 is 3 minutes before the replication so - probably can rule that out.
I checked the forums but didn't see anything too similar.
The Proxmox setup is 5 SSD software raid zfs. i.e. not hardware raid. IO wait etc remains so low the graph doesn't bother to draw them.
Any ideas would be helpful?
Last edited: