Sporadic Buffer I/O error on device vda1 inside guest,RAW on LVM on top of DRBD

stanislav · Feb 16, 2015

Hello!

I have several proxmox clusters separeated geographically.
Each cluster contains couples of servers sharing LVM over drbd, as a DRBD interlink i have Intel 10G ethernet cards. All servers have top-level RAID controllers with bbu.
All servers have pve-enterprise repository access via community subscription.

Since update in december some guests get sporadically messages like that independend of filesystem type. Some VMs have xfs, some ext4.

Code:

[2015-02-14 16:37:06]  end_request: I/O error, dev vda, sector 15763032
[2015-02-14 16:37:06]  Buffer I/O error on device vda1, logical block 1970123
[2015-02-14 16:37:06]  EXT4-fs warning (device vda1): ext4_end_bio:250: I/O error -5 writing to inode 398637 (offset 0 size 4096 starting block 1970380)
[2015-02-14 16:37:06]  end_request: I/O error, dev vda, sector 15763064
[2015-02-14 16:37:06]  Buffer I/O error on device vda1, logical block 1970127
[2015-02-14 16:37:06]  EXT4-fs warning (device vda1): ext4_end_bio:250: I/O error -5 writing to inode 398637 (offset 16384 size 4096 starting block 1970384)
[2015-02-14 16:37:06]  end_request: I/O error, dev vda, sector 15763144
[2015-02-14 16:37:06]  Buffer I/O error on device vda1, logical block 1970137
[2015-02-14 16:37:06]  EXT4-fs warning (device vda1): ext4_end_bio:250: I/O error -5 writing to inode 398637 (offset 57344 size 4096 starting block 1970394)
[2015-02-14 16:37:06]  end_request: I/O error, dev vda, sector 15763176
[2015-02-14 16:37:06]  Buffer I/O error on device vda1, logical block 1970141
[2015-02-14 16:37:06]  EXT4-fs warning (device vda1): ext4_end_bio:250: I/O error -5 writing to inode 398637 (offset 73728 size 4096 starting block 1970398)
[2015-02-14 16:37:06]  end_request: I/O error, dev vda, sector 15763256
[2015-02-14 16:37:06]  Buffer I/O error on device vda1, logical block 1970151
[2015-02-14 16:37:06]  EXT4-fs warning (device vda1): ext4_end_bio:250: I/O error -5 writing to inode 398637 (offset 114688 size 4096 starting block 1970408)

File system check don't find any corruptions, but some Windows VMs got data loss. BTW, filesystem check on windows VMs isn't able to find any inconsistencies.

Code:

[CODE]roxmox01:~# pveversion -v
proxmox-ve-2.6.32: 3.3-139 (running kernel: 2.6.32-34-pve)
pve-manager: 3.3-5 (running version: 3.3-5/bfebec03)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-28-pve: 2.6.32-124
pve-kernel-2.6.32-30-pve: 2.6.32-130
pve-kernel-2.6.32-34-pve: 2.6.32-140
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.3-3
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-25
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

proxmox01:~# cat /etc/drbd.d/r0.res
resource r0 {
protocol C;
startup {
wfc-timeout 0; # non-zero wfc-timeout can be dangerous (http://forum.proxmox.com/threads/3465-Is-it-safe-to-use-wfc-timeout-in-DRBD-configuration)
degr-wfc-timeout 60;
become-primary-on both;
}
net {
cram-hmac-alg sha1;
shared-secret "lai8IezievuCh0eneiph0eetaigaiMee";
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 0;
}

syncer {
al-extents 3389;
verify-alg crc32c;
}

disk {
no-disk-barrier;
no-disk-flushes;
}

on proxmox01 {
device /dev/drbd0;
disk /dev/sdb;
address 192.168.1.147:7788;
meta-disk internal;
}
on proxmox02 {
device /dev/drbd0;
disk /dev/sdb;
address 192.168.1.148:7788;
meta-disk internal;
}
}[/CODE]

Any ideas?

e100 · Feb 18, 2015

What cache mode are you using for the virtual disks?

What raid controller and disks are you using?

Any related else in the proxmox host?

I have seen similar errors a few years ago when I had some buggy raid card and hard drive firmware that resulted in various timeouts, disks being ejected from the array etc.

stanislav · Feb 18, 2015

Hello, e100!

All virtual disks are in writethrough mode.

Servers with most issues have LSI MegaRAID SAS 9280-4i4e controllers, with cachecade as well as without.
Some servers have Adaptec ASR7805Q with MaxCache. All controllers have latest firmware.

On proxmox host self - nothing related, even nothing in ring buffer, no suspicious I/O load. All servers are from supermicro, enterprise grade 2HU machines with xeon CPUs and ECC RAM.

We have this configuration for almost 2 years, and until December we had no any issue.

King Regards!
Stanislav

e100 · Feb 18, 2015

The only suggestion I have is you should try using directsync instead of writethrough to see if it makes a difference.

What model CPUs do you have?

stanislav · Feb 19, 2015

I will change writethrough to directsync and continue observation.
BTW, though buffer io errors - there were no data loss on ext4 and xfs partitions. fsck doesn't report any filesystem errors.

CPUs are
Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz
Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz

Regards!

gianni.bet · Jun 17, 2015

stanislav said:
I will change writethrough to directsync and continue observation.
BTW, though buffer io errors - there were no data loss on ext4 and xfs partitions. fsck doesn't report any filesystem errors.

CPUs are
Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz
Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz

Regards!

Hi we have the same exactly problem

The change to directsync did solve the issue???

Our versions:

root@kvmg1-ver3:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-37-pve: 2.6.32-147
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Gianni.bet

macday · Jun 18, 2015

Hello,

I have the same kind of problem. But my guest FS is BTRFS. It was an read error on vda1 but BTRFS self healing features read it from another block and the where no more errors on this.

...waiting for DRBD9

gianni.bet · Jun 18, 2015

macday said:
Hello,

I have the same kind of problem. But my guest FS is BTRFS. It was an read error on vda1 but BTRFS self healing features read it from another block and the where no more errors on this.

...waiting for DRBD9

Hi thanks for posting..
How can you say that is a read error?? Your VM affected did survive with this error??
And ... if we have Windows VM guests

no btrfs at all!!
Damn we have a LOT of VM in drbd...........

gianni.bet

xtavras · Jun 29, 2015

gianni.bet said:
Hi we have the same exactly problem

The change to directsync did solve the issue???

Hi gianni.bet,
what I/O subsystem do you use? HW RAID + DRBD? It's interesting on what layer did this happen.

macday · Jun 29, 2015

I never had issues like this since I changed my VM-Cache Mode to writethrough.

xtavras · Jun 29, 2015

macday said:
I never had issues like this since I changed my VM-Cache Mode to writethrough.

Hm, interesting, we've used "directsync" since it's like "writethrough" but even more "harder" with caching (if I understand it correctly) and it didn't help, but I will try "writethrough" nevertheless, thanks.

e100 · Jun 29, 2015

Apparently writethrough seems to be the best option when using DRBD.
All other cache modes are known to cause out of sync (OOS) blocks in DRBD.

Oddly I have issues with some nodes where if I have guests using writethrough DRBD split brains under heavy IO load (eg during backups)

gianni.bet · Jul 1, 2015

xtavras said:
Hi gianni.bet,
what I/O subsystem do you use? HW RAID + DRBD? It's interesting on what layer did this happen.

Yes in this cluster I have HW RAID + DRBD and the cache was in writethrough but there are 11 VM in this two-node cluster, all in writethrough, the one "problematic" is a heavy loaded varnish cache... now I set it in directsync and since then I haven't any error..... P.S No another error appears after 10 days..... grrrrrr

gianni.bet

xtavras · Jul 23, 2015

Unfortunately, changing caching method to "directsync" or "writesync" doesn't fix it for us, we are still have problems with "Buffer I/O" errors, upgrade to kernel to 3.10 and drbd 8.4 doesn't help either, even switching to slave on drbd and "invalidating" one of the node brings nothing, strange enough, we don't have issues on another Proxmox cluster with older version of Proxmox.

Here is pveversion of affected cluster.

Code:

proxmox-ve-2.6.32: 3.3-147 (running kernel: 3.10.0-10-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-3.10.0-10-pve: 3.10.0-34
pve-kernel-2.6.32-28-pve: 2.6.32-124
pve-kernel-2.6.32-30-pve: 2.6.32-130
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-34-pve: 2.6.32-140
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

And here where we don't have this problem.

Code:

proxmox-ve-2.6.32: 3.2-129 (running kernel: 2.6.32-30-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-28-pve: 2.6.32-124
pve-kernel-2.6.32-30-pve: 2.6.32-130
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-18
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

Both are using enterprise repos.
Some regression in another components, e.g. qemu-server?

gianni.bet · Jul 23, 2015

Strange thing ...me too in another "old" cluster 3.1 (same design 2 node drbd in active-active) never had any issue and the virtual machine have all the vm disks in writethrough mode since the beginning

This is the pveversion:

proxmox-ve-2.6.32: 3.1-114 (running kernel: 2.6.32-26-pve)
pve-manager: 3.1-21 (running version: 3.1-21/93bf03d4)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-2
pve-cluster: 3.0-8
qemu-server: 3.1-8
pve-firmware: 1.0-23
libpve-common-perl: 3.0-8
libpve-access-control: 3.0-7
libpve-storage-perl: 3.0-17
pve-libspice-server1: 0.12.4-2
vncterm: 1.1-4
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.1-1

This means that old 3.x versions don't have this issue...

gianni.bet

stanislav · Jul 30, 2015

Hello!

I was able to reproduce this error by running "dbench -D /root -s -S -t 600 2" inside of the guest.
After 387 seconds it broke.

Code:

   2   1302642    28.39 MB/sec  execute 382 sec  latency 7.610 ms
   2   1305139    28.38 MB/sec  execute 383 sec  latency 7.179 ms
   2   1307585    28.37 MB/sec  execute 384 sec  latency 7.167 ms
   2   1309804    28.36 MB/sec  execute 385 sec  latency 7.396 ms
   2   1312452    28.36 MB/sec  execute 386 sec  latency 9.038 ms
   2   1315195    28.36 MB/sec  execute 387 sec  latency 6.408 ms
[1320800] write failed on handle 11882 (Input/output error)
Child failed with status 1
webmail02-dev:~# [1310908] datasync directory "/root/clients/client0/~dmtmp/PM" failed: Input/output error


Message from syslogd@webmail02-dev at Jul 30 17:37:39 ...
 kernel:[  773.302715] journal commit I/O error

This is ringbuffer part:

Code:

[Thu Jul 30 17:24:50 2015] FS-Cache: Loaded
                  50       FS-Cache: Netfs 'nfs' registered for caching
                  50       Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
                  59       eth0: no IPv6 routers present
[Thu Jul 30 17:37:38 2015] end_request: I/O error, dev vda, sector 6277528
[Thu Jul 30 17:37:38 2015] end_request: I/O error, dev vda, sector 5499816
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492149
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492150
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492151
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492152
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492153
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492154
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492155
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492156
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492157
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492158
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492159
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492160
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492161
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492162
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492163
[Thu Jul 30 17:37:38 2015] Buffer I/O error on device vda2, logical block 492164
[Thu Jul 30 17:37:38 2015] EXT4-fs warning (device vda2): ext4_end_bio:250: I/O error -5 writing to inode 8349 (offset 65536 size 65536 starting block 687493
)
[Thu Jul 30 17:37:38 2015] Aborting journal on device vda2-8.
[Thu Jul 30 17:37:38 2015] journal commit I/O error
[Thu Jul 30 17:37:38 2015] EXT4-fs error (device vda2): ext4_journal_start_sb:327: Detected aborted journal
[Thu Jul 30 17:37:38 2015] EXT4-fs (vda2): Remounting filesystem read-only
Message from syslogd@webmail02-dev at Jul 30 17:37:39 ...
 kernel:[  773.302715] journal commit I/O error

Furthermore i did strace on kvm process inside proxmox host, suspicious part begins with:

Code:

17:37:36 accept4(7, {sa_family=AF_FILE, NULL}, [2], SOCK_CLOEXEC) = 27
17:37:36 fcntl(27, F_GETFL)             = 0x2 (flags O_RDWR)
17:37:36 fcntl(27, F_SETFL, O_RDWR|O_NONBLOCK) = 0
17:37:36 fstat(27, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
17:37:36 fcntl(27, F_GETFL)             = 0x802 (flags O_RDWR|O_NONBLOCK)
17:37:36 write(27, "{\"QMP\": {\"version\": {\"qemu\": {\"m"..., 105) = 105
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=22, events=POLLIN|POLLERR|POLLHUP}, {fd=20, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=27, events=POLLIN}], 7, {0, 0}, NULL, 8) = 3 ([{fd=6, revents=POLLIN}, {fd=5, revents=POLLIN}, {fd=27, revents=POLLIN}], left {0, 0})
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 read(5, "\1\0\0\0\0\0\0\0", 512) = 8
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 recvmsg(27, {msg_name(0)=NULL, msg_iov(1)=[{"{", 1}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 1
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=22, events=POLLIN|POLLERR|POLLHUP}, {fd=20, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=27, events=POLLIN}], 7, {0, 0}, NULL, 8) = 2 ([{fd=6, revents=POLLIN}, {fd=27, revents=POLLIN}], left {0, 0})
17:37:36 read(6, "\6\0\0\0\0\0\0\0", 16) = 8
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 ioctl(10, 0x4020aea5, 0x7fff04cf24f0) = 1
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 recvmsg(27, {msg_name(0)=NULL, msg_iov(1)=[{"\"", 1}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 1
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=22, events=POLLIN|POLLERR|POLLHUP}, {fd=20, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=27, events=POLLIN}], 7, {0, 42205124}, NULL, 8) = 3 ([{fd=21, revents=POLLIN}, {fd=6, revents=POLLIN}, {fd=27, revents=POLLIN}], left {0, 42203692})
17:37:36 read(6, "\4\0\0\0\0\0\0\0", 16) = 8
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 recvmsg(27, {msg_name(0)=NULL, msg_iov(1)=[{"e", 1}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 1
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 read(21, "\1\0\0\0\0\0\0\0", 512) = 8
17:37:36 futex(0x7f49da2ead38, FUTEX_WAKE_PRIVATE, 1) = 1
17:37:36 ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=22, events=POLLIN|POLLERR|POLLHUP}, {fd=20, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=27, events=POLLIN}], 7, {0, 0}, NULL, 8) = 3 ([{fd=6, revents=POLLIN}, {fd=5, revents=POLLIN}, {fd=27, revents=POLLIN}], left {0, 0})
17:37:36 read(6, "\2\0\0\0\0\0\0\0", 16) = 8
17:37:36 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
17:37:36 futex(0x7f49da2ead38, FUTEX_WAKE_PRIVATE, 1) = 1

Any ideas?

Best regards!
Stanislav

xtavras · Aug 11, 2015

FYI, update to pve >= 3.4-6 fixes this problem.

BrettR · Oct 11, 2018

Hello Proxmox Wizards...

The Linux VMs within my proxmox system have recently started to display the following error
messages..when I try to dynamically map a USB drive into the VM.
----------------------------------------------
[ ] sd 5:0:0:0 [sda] No Caching mode page found
[ ] sd 5:0:0:0 [sda] Assuming drive cache write through
[ ] blk_update_request: I/O error, dev sda, sector 0
[ ] buffer I/O error on dev sda, logical block 0, async page read
[ ] blk_upate_request: I/O error on dev sda, sector 0
...

These errors appear within an Ubuntu Linux VM..when I issue the following commands
on the proxmox API server console:
qm monitor 810
device_add usb-host,vendorid=0x1058,productid=0x1048,id=usb0

Previously, the above 'device_add' command would cause a mounted USB drive to suddenly appear
within the VM. The USB device would show up as /dev/sda1 (and not /dev/sda)

Here is my pve version information:
-------------------------------------------------------------------------
pveversion -v
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
--------------------------------------------

I've checked the disk space within the 810 VM..and there seems to be available space..
I've also checked the disk space within my proxmox API server..and it has available space.

The disk caching settings for the 810 VM are:
Previously set to 'no caching'
I then tried setting the caching to 'write through'..but that didn't resolve the issue.

*** Any ideas on what might have triggered the above errors?

(Also, I've tried multiple USB devices..and each of them trigger the same type of errors.)
(I"ve also tried using different USB ports on the proxmox API server;
(I've also rebooted the proxmox API server)

Thanks in advance!!

Brett

Stoiko Ivanov · Oct 11, 2018

Please upgrade to the latest version (5.2.2), and try running with the latest 4.15. kernel - quite a few changes happened since 4.13.13-2

* Does the disk work when you mount it on the PVE-node?

Lokytech · Nov 19, 2021

Hello everyone,
I'm sorry to dig up an old thread but i'm exactly in this situation after upgrading from 6.4 (everything was fine) to 7.0 (last week) and 7.1 (yesterday).

Nothing in the log of the host. Network/Disk/CPU/RAM all ok.

But inside the VM (KVM) it's another story.

It's independant of the targeted storage (ceph-rbd or MD3200i).
If i generate high IO inside the VM, i got those message :

Code:

Nov 19 03:37:04 VM-5 kernel: [13492.511828] Buffer I/O error on device dm-5, logical block 293225478
Nov 19 03:37:04 VM-5 kernel: [13492.511973] Buffer I/O error on device dm-5, logical block 293225479
Nov 19 03:37:04 VM-5 kernel: [13492.512116] Buffer I/O error on device dm-5, logical block 293225480
Nov 19 03:37:04 VM-5 kernel: [13492.512258] Buffer I/O error on device dm-5, logical block 293225481
Nov 19 03:37:04 VM-5 kernel: [13492.512549] print_req_error: I/O error, dev vdb, sector 38961152
Nov 19 03:37:04 VM-5 kernel: [13492.512705] print_req_error: I/O error, dev vdb, sector 38963712
Nov 19 03:37:04 VM-5 kernel: [13492.512865] EXT4-fs warning (device dm-5): ext4_end_bio:323: I/O error 10 writing to inode 36965380 (offset 28764536832 size 8388608 starting block 293226304)
Nov 19 03:37:04 VM-5 kernel: [13492.513010] print_req_error: I/O error, dev vdb, sector 38965248
Nov 19 03:37:04 VM-5 kernel: [13492.513168] print_req_error: I/O error, dev vdb, sector 38967808
Nov 19 03:37:04 VM-5 kernel: [13492.513324] EXT4-fs warning (device dm-5): ext4_end_bio:323: I/O error 10 writing to inode 36965390 (offset 3061841920 size 8388608

I tried the different available kernel :
- 5.13.19-1-pve
- 5.11.22-7-pve
- 5.4.143-1-pve
- Target storage on Ceph OR iSCSI LVM
- Different host (R710, R540, Poweredge 2950)
- Different VM (all debian buster)
- I broke the multipath to be only on one link instead of 8
- VM with cache=none/writethrough/directsync

The I/O error on iSCSI is 10 but on Ceph is -5

Code:

Nov 17 09:17:02 VM-4 kernel: [  111.693582] Buffer I/O error on device dm-1, logical block 40967
Nov 17 09:17:02 VM-4 kernel: [  111.693659] Buffer I/O error on device dm-1, logical block 40968
Nov 17 09:17:02 VM-4 kernel: [  111.693736] Buffer I/O error on device dm-1, logical block 40969
Nov 17 09:17:02 VM-4 kernel: [  111.693949] blk_update_request: I/O error, dev vda, sector 833536
Nov 17 09:17:02 VM-4 kernel: [  111.694038] EXT4-fs warning (device dm-1): ext4_end_bio:314: I/O error -5 writing to inode 914713 (offset 0 size 8388608 starting block 41216)
Nov 17 09:17:02 VM-4 kernel: [  111.694161] blk_update_request: I/O error, dev vda, sector 835584
Nov 17 09:17:02 VM-4 kernel: [  111.694247] blk_update_request: I/O error, dev vda, sector 837616

During the high IO operation, i'm sending 100MB/s sustained traffic through netcat and disk IO are also at 100MB/s to the storage.
Storage backend network is 10Gb network.

Code:

$> pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.4.143-1-pve)
pve-manager: 7.1-5 (running version: 7.1-5/6fe299a0)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-7
pve-kernel-5.13.19-1-pve: 5.13.19-2
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.14-1
proxmox-backup-file-restore: 2.0.14-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-2
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-1
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-3
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

or

Code:

pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-7-pve)
pve-manager: 7.0-14+1 (running version: 7.0-14+1/08975a4c)
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-7
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-4.13.13-1-pve: 4.13.13-31
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-12
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-13
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.13-1
proxmox-backup-file-restore: 2.0.13-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.1-1
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.1.0-1
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-18
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

Would anyone have any idea where should i look ?
Right now i'm trying to mount a disk on ceph and iscsi in order to run the dbench test to check if i can reproduce the problem on a host.

Sporadic Buffer I/O error on device vda1 inside guest,RAW on LVM on top of DRBD

New Member

Famous Member

New Member

Famous Member

New Member

Member

Member

Member

Renowned Member

Member

Renowned Member

Famous Member

Member

Renowned Member

Member

New Member

Renowned Member

Member

Proxmox Staff Member

Renowned Member

We value your privacy