Continued IO Errors

stra4d

Well-Known Member
Mar 1, 2012
80
0
46
If I keep getting IO errors (and FS gets mounted as read only) does that mean the virtual disk is corrupt or the underlying physical disk on the server itself has issues.

I am trying to make backups (creating copies of files) on a VM but during the creation of the copies on the VM I keep getting IO errors on the VM in /dev/vda1. I can run fsck to fix and everything seems fine. Then after trying the backup again, back to IO errors and mounted as read only. fsck, etc.

Everything is up-to-date on proxmox server:

Code:
# pveversion -v
proxmox-ve: 5.4-2 (running kernel: 4.15.18-26-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-14
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.15.18-25-pve: 4.15.18-53
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.4.98-3-pve: 4.4.98-103
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.21-1-pve: 4.4.21-71
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-41
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-55
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
 
the answers to your question can be found in /var/log/messages and/or by launching the vm manually with logging enable (-d argument.)
So if nothing in /var/log/messages on the proxmox server, then it is most likely the VM. Will restoring from backup(using qmrestore) 'recreate' the VM disc or will I end up w/ the same issues after restoring?
 
So if nothing in /var/log/messages on the proxmox server, then it is most likely the VM. Will restoring from backup(using qmrestore) 'recreate' the VM disc or will I end up w/ the same issues after restoring?
based on your reported problem it highly unlikely that you will have "nothing in /var/log/messages." In any case, its impossible for me to say if the problem will resolve on vm restoration since we havent established why it was happening in the first place.
 
So as a recap, there did not appear to be anything in /var/log/messages on the proxmox server. After doing a restore from backup and restarting some services on the VM got the issue again:

Capture1.JPG

Also on the console:

Capture2.JPG

Searching for this issue (Page cache invalidation failure) only turns up a few results in Russian.
 
The /var/log/messages from the proxmox server:
Code:
Apr 29 14:05:46 <server> pvedaemon[2858]: <root@pam> successful auth for user 'root@pam'
Apr 29 14:05:53 <server> pvedaemon[2858]: <root@pam> starting task UPID:rciblade360:00000E49:02252476:5EA9C201:qmstart:113:root@pam:
Apr 29 14:05:56 <server> kernel: [359881.706125] device tap113i0 entered promiscuous mode
Apr 29 14:05:56 <server> kernel: [359881.725580] vmbr0: port 2(tap113i0) entered blocking state
Apr 29 14:05:56 <server> kernel: [359881.725583] vmbr0: port 2(tap113i0) entered disabled state
Apr 29 14:05:56 <server> kernel: [359881.725745] vmbr0: port 2(tap113i0) entered blocking state
Apr 29 14:05:56 <server> kernel: [359881.725747] vmbr0: port 2(tap113i0) entered forwarding state
Apr 29 14:05:57 <server> pvedaemon[2858]: <root@pam> end task UPID:rciblade360:00000E49:02252476:5EA9C201:qmstart:113:root@pam: OK
Apr 29 14:06:25 <server> pvedaemon[1465]: <root@pam> starting task UPID:rciblade360:00000EE5:0225310B:5EA9C221:vncproxy:113:root@pam:
Apr 29 14:20:46 <server> pvedaemon[1465]: <root@pam> successful auth for user 'root@pam'
 
you can use dmesg to display the relevent messages if that helps, but those screencap contain all the info you need. begin with turning off cache for your vm.
1588185472916.png

If that fixes the problem, great. BTW, was this VM converted from another hypervisor?
 
dmesg returns:

Code:
[349765.146475] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[349765.146856] File: /drive2/iso/images/113/vm-113-disk-0.qcow2 PID: 10529 Comm: kworker/6:1
[354909.318533] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[354909.318857] File: /drive2/iso/images/113/vm-113-disk-0.qcow2 PID: 12731 Comm: kworker/11:0

[360103.974076] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[360103.974396] File: /drive2/iso/images/113/vm-113-disk-0.qcow2 PID: 1234 Comm: kworker/2:2
 
/etc/pve/storage.cfg:
Code:
dir: drive2_ISO
        path /drive2/iso
        content images,iso
        maxfiles 1

dir: local
        path /var/lib/vz
        content vztmpl,rootdir,images,iso
        maxfiles 0

dir: drive2_vm
        path /drive2/vm
        content rootdir,images
        maxfiles 1

dir: drive2_backups
        path /drive2/vzdumpbackups
        content backup
        maxfiles 2
        shared 0

lsblk:
Code:
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda            8:0    0 279.4G  0 disk
  sda1         8:1    0     1M  0 part
  sda2         8:2    0   128M  0 part
  sda3         8:3    0 279.2G  0 part
    pve-swap 253:0    0    11G  0 lvm  [SWAP]
    pve-root 253:1    0  69.8G  0 lvm  /
    pve-data 253:2    0 182.5G  0 lvm  /var/lib/vz
sdb            8:16   0 698.6G  0 disk
  sdb1         8:17   0 698.6G  0 part /drive2
sdc            8:32   0 698.7G  0 disk
  sdc1         8:33   0 698.7G  0 part
sr0           11:0    1  1024M  0 rom

smartctl:
Code:
# smartctl --all /dev/sdb1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-26-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              LOGICAL VOLUME
Revision:             3.66
User Capacity:        750,122,819,584 bytes [750 GB]
Logical block size:   512 bytes
Rotation Rate:        15000 rpm
Logical Unit id:      0x600508b1001c55b299d54a6a1e44a663
Serial number:        50014380132EEBC0
Device type:          disk
Local Time is:        Wed Apr 29 16:10:41 2020 EDT
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging
 
Ok, we're getting somewhere.

1. Migrate the disk to local. does the problem remain?
2. What generation is the raid controller? this should provide the relevent info: lspci -vv | grep RAID -A2.

I vaguely remember issues with certain HP controllers and linux kernel versions, but would need to get detail before proceeding. what I do suggest in ANY case is to obtain an SPP ISO from HP most current for your server and updating all firmware to most current. see https://techlibrary.hpe.com/us/en/e...ts/service_pack/spp/index.aspx?version=Gen8.1
 
  • Like
Reactions: stra4d
lspci:
Code:
# lspci -vv | grep RAID -A2
05:00.0 RAID bus controller: Hewlett-Packard Company Smart Array G6 controllers (rev 01)
        Subsystem: Hewlett-Packard Company Smart Array P410i
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+

The server is an HP DL360 G7. We have 3 of them, running something like 20 VMs which all run w/o an issue. This is the only one that runs an Ubuntu server VM however.
 
Unfortunately HP does not provide service packs without warranty or a support agreement. $$$
 
Update: restoring from backup did not change anything, so it must not fully re-create the disk. Moving the disk (it was somehow in the ISO storage dir. instead of with the other VMs on disk2) to another dir. on the same drive created a new disk and that seems to have fixed the issue. Perhaps fully removing the VM before restoring or restoring to another ID would have re-created the disk and also fixed the issue.

Big thanks to @alexskysilk for all your help here!
 
I had once similar strange issues, turned out to be my power supply, more specific, too many drives connected to the same power line.
 
I had once similar strange issues, turned out to be my power supply, more specific, too many drives connected to the same power line.
Thanks. You mean a desktop style power supply where you can choose which power cables does what? This is a dual power supply server do that is not really possible as far as I know.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!