We have a small PVE cluster that ran PVE 6.4, and when we upgraded them to PVE 7.0, our backup jobs for VMs failed (but not CTs). We have collected the following diagnoses.
The majority of our compute servers are HPE ProLiant DL388 Gen10, with Intel X710-DA2 SFP+ NIC (HPE Ethernet 10Gb 2-port 562SFP+ Adapter) connecting to our storage server, an HPE MSA 1050. The storage server exposes a single LUN which is used as an LVM PV (and a VG), and CTs and VMs use LVs on that VG, so it's essentially LVM-over-iSCSI.
When we run a backup job for a VM, we can see the output of vzdump showing a few megabytes to tens of megabytes read and then stuck, with dmesg -w showing "connection1:0 detected conn error (1020)" every 13 seconds (it's very precise), followed by "blk_update_request: I/O error, dev sdd, sector 16546623488 op 0x0: (READ) flags 0x4000 phys_seg 256 prio class 2". At this point, the whole iSCSI interface is unavailable and stuck for about a minute before it self-remediates. LXC containers and VMs running on this host start to turn mounted filesystems into read-only mode and must be restarted before they're operational.
tcpdump shows 3~6 seconds' inactivity before the storage server issues TCP RST. When we power off the VM and perform various kinds of operations, we could not reproduce the issue, including full-disk dd (both read tests and write tests were run) and e2fsck -f (which reports filesystem as clean), as well as migrating VM disks from / to this storage. Running vzdump on VMs on other storage (local-lvm) turns no problem out, nor does backing up LXC containers on the same LVM-over-iSCSI storage.
We tried the same operation on 4 more servers of the same model (DL388 Gen10, our storage facility is shared so migration takes just a few seconds), and they all showed the same symptoms. Two other servers of a different make (ML350 Gen10, with Broadcom NetXtreme II BCM57810) can run vzdump at full speed and without any issue.
We have tried the following troubleshooting methods to no avail.
Do you have any idea what we should look into next?
Update: We tried some more stuff and made a difference.
We're currently running pve-qemu-kvm 5.2.0-6 and everything else up-to-date.
The majority of our compute servers are HPE ProLiant DL388 Gen10, with Intel X710-DA2 SFP+ NIC (HPE Ethernet 10Gb 2-port 562SFP+ Adapter) connecting to our storage server, an HPE MSA 1050. The storage server exposes a single LUN which is used as an LVM PV (and a VG), and CTs and VMs use LVs on that VG, so it's essentially LVM-over-iSCSI.
When we run a backup job for a VM, we can see the output of vzdump showing a few megabytes to tens of megabytes read and then stuck, with dmesg -w showing "connection1:0 detected conn error (1020)" every 13 seconds (it's very precise), followed by "blk_update_request: I/O error, dev sdd, sector 16546623488 op 0x0: (READ) flags 0x4000 phys_seg 256 prio class 2". At this point, the whole iSCSI interface is unavailable and stuck for about a minute before it self-remediates. LXC containers and VMs running on this host start to turn mounted filesystems into read-only mode and must be restarted before they're operational.
tcpdump shows 3~6 seconds' inactivity before the storage server issues TCP RST. When we power off the VM and perform various kinds of operations, we could not reproduce the issue, including full-disk dd (both read tests and write tests were run) and e2fsck -f (which reports filesystem as clean), as well as migrating VM disks from / to this storage. Running vzdump on VMs on other storage (local-lvm) turns no problem out, nor does backing up LXC containers on the same LVM-over-iSCSI storage.
We tried the same operation on 4 more servers of the same model (DL388 Gen10, our storage facility is shared so migration takes just a few seconds), and they all showed the same symptoms. Two other servers of a different make (ML350 Gen10, with Broadcom NetXtreme II BCM57810) can run vzdump at full speed and without any issue.
We have tried the following troubleshooting methods to no avail.
- We installed pve-kernel-5.4.128-1-pve from PVE 6, rebooted with this kernel and did not observe any difference.
- We installed open-iscsi 2.0.874-7.1 (and missing dependencies) from Debian Buster, rebooted, no difference either.
- We updated the firmware of the storage server to the latest version, no difference either.
- We updated the firmware for the Intel NIC to the latest version, no difference either.
Do you have any idea what we should look into next?
Update: We tried some more stuff and made a difference.
- We installed pve-qemu-kvm 5.2.0-6 from PVE 6 and the situation improved. There still is "detected conn error (1020)" lines in dmesg from time to time, but the backup operation reached the disk array's top speed and succeeded finally.
- We installed qemu-server 6.4-2 from PVE 6 (with pve-qemu-kvm 5.2) and it worsened a bit, with more frequent connection errors and slower average speed, but the backup job still succeeded.
- Setting bwlimit for vzdump remediated the issue. A value low enough eliminated all iSCSI connection errors. We tested up to 20 MB/s across two hosts of the same "problematic" model and it went fine.
However, different hosts of the same model turned different results for higher bwlimit, with no clear pattern across hosts or VMs.
aio=native
or aio=threads
to the disk configuration, but neither made any difference with other configurations (i.e. pve-qemu-kvm 6.0.0-3 is still completely dumb while pve-qemu-kvm 5.2.0-6 could succeed with intermittent errors).We're currently running pve-qemu-kvm 5.2.0-6 and everything else up-to-date.
Last edited: