vzdump causes iSCSI connection lost on one server model with PVE 7, but not on another model

iBug

Member
Feb 20, 2020
20
2
23
USTC
ibug.io
We have a small PVE cluster that ran PVE 6.4, and when we upgraded them to PVE 7.0, our backup jobs for VMs failed (but not CTs). We have collected the following diagnoses.

The majority of our compute servers are HPE ProLiant DL388 Gen10, with Intel X710-DA2 SFP+ NIC (HPE Ethernet 10Gb 2-port 562SFP+ Adapter) connecting to our storage server, an HPE MSA 1050. The storage server exposes a single LUN which is used as an LVM PV (and a VG), and CTs and VMs use LVs on that VG, so it's essentially LVM-over-iSCSI.

When we run a backup job for a VM, we can see the output of vzdump showing a few megabytes to tens of megabytes read and then stuck, with dmesg -w showing "connection1:0 detected conn error (1020)" every 13 seconds (it's very precise), followed by "blk_update_request: I/O error, dev sdd, sector 16546623488 op 0x0: (READ) flags 0x4000 phys_seg 256 prio class 2". At this point, the whole iSCSI interface is unavailable and stuck for about a minute before it self-remediates. LXC containers and VMs running on this host start to turn mounted filesystems into read-only mode and must be restarted before they're operational.

tcpdump shows 3~6 seconds' inactivity before the storage server issues TCP RST. When we power off the VM and perform various kinds of operations, we could not reproduce the issue, including full-disk dd (both read tests and write tests were run) and e2fsck -f (which reports filesystem as clean), as well as migrating VM disks from / to this storage. Running vzdump on VMs on other storage (local-lvm) turns no problem out, nor does backing up LXC containers on the same LVM-over-iSCSI storage.

We tried the same operation on 4 more servers of the same model (DL388 Gen10, our storage facility is shared so migration takes just a few seconds), and they all showed the same symptoms. Two other servers of a different make (ML350 Gen10, with Broadcom NetXtreme II BCM57810) can run vzdump at full speed and without any issue.

We have tried the following troubleshooting methods to no avail.
  • We installed pve-kernel-5.4.128-1-pve from PVE 6, rebooted with this kernel and did not observe any difference.
  • We installed open-iscsi 2.0.874-7.1 (and missing dependencies) from Debian Buster, rebooted, no difference either.
  • We updated the firmware of the storage server to the latest version, no difference either.
  • We updated the firmware for the Intel NIC to the latest version, no difference either.
We don't have access to updated system BIOS or firmware for the host server itself so we couldn't test it. We also don't have the expertise to do a full downgrade from Bullseye to Buster. We don't have a spare host to install another PVE 6 at present, but we could struggle to set one aside for this if it's really needed.

Do you have any idea what we should look into next?



Update: We tried some more stuff and made a difference.
  • We installed pve-qemu-kvm 5.2.0-6 from PVE 6 and the situation improved. There still is "detected conn error (1020)" lines in dmesg from time to time, but the backup operation reached the disk array's top speed and succeeded finally.
  • We installed qemu-server 6.4-2 from PVE 6 (with pve-qemu-kvm 5.2) and it worsened a bit, with more frequent connection errors and slower average speed, but the backup job still succeeded.
  • Setting bwlimit for vzdump remediated the issue. A value low enough eliminated all iSCSI connection errors. We tested up to 20 MB/s across two hosts of the same "problematic" model and it went fine.
    However, different hosts of the same model turned different results for higher bwlimit, with no clear pattern across hosts or VMs.
In addition, we tried adding aio=native or aio=threads to the disk configuration, but neither made any difference with other configurations (i.e. pve-qemu-kvm 6.0.0-3 is still completely dumb while pve-qemu-kvm 5.2.0-6 could succeed with intermittent errors).

We're currently running pve-qemu-kvm 5.2.0-6 and everything else up-to-date.
 
Last edited:
  • Like
Reactions: avladulescu
Hi,

I am getting similar issues with LVM and iSCSI connection from a 3 node cluster with latest installed (today) Prox 7 (pve-manager/7.1-6/4e61e21c (running kernel: 5.13.19-1-pve) and a TrueNas 12 storage (dell r510) via a clustered 10 Gbe fiber connection into some Arista switches.

The same storage server, also powers the same way a different pve cluster of 2 nodes (pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-27-pve) which runs just fine, also connected on the same switch with identical network configuration.

I have tried tweaking the iscsid.conf configuration without luck.

Can't really understand if this a bug or not in Prox 7, since I do believe the quite tested-it before releasing the new updated version.

Would be nice to know if anybody else bumped into similar problems.
 
Hello,

I am getting back with some more info on this.

Indeed seems to be an issue with the new Proxmox release, as using hardware (previously installed with version 7 & fully upgraded - both OS and prox packages from pve-no-subscription repository), swaped OS harddrives, installed a fresh 6, upgraded to the latest - both OS and prox packages (pve-no-subscription).

root@pvetest:~# pveversion pve-manager/6.4-13/9f411e79 (running kernel: 5.4.143-1-pve)

Re-applied the exact same network configurations and ip addresses (no changes whatsoever on MTU, ovs, network configuration) and went on the old running cluster of version 5 and copy+pasted the storage configuration for the iscsi and lvm storage points.

Created a VM with a fresh install of a ubuntu image, finished the installation without issues and started the backup process.

While installing and performing the backup to a remote NFS server, I have monitored the TrueNas iscsi traffic both from proxmox server as well as from the BSD console. No more hangs or timeouts, the traffic "flows" interrupted and backup is finalized without seeing any new messages like :

Code:
WARNING: 172.21.12.21 (iqn.2021-12.int.****.*****01:98d65d6e1fa3): no ping reply (NOP-Out) after 5 seconds; dropping connection
WARNING: 172.21.12.21 (iqn.2021-12.int.****.*****01:98d65d6e1fa3): no ping reply (NOP-Out) after 5 seconds; dropping connection

on the storage server.

Not quite sure where is the problem, though, looking at /etc/lvm/lvm.conf on version 7, 6 and 5 show quite some differences.
The iscsid.conf configuration was also the same between the tests.

Code:
Nov 28 02:53:16 ********** kernel: [  100.272019] kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable
Nov 28 02:53:24 ********** kernel: [  108.261737]  connection1:0: detected conn error (1020)
Nov 28 02:53:26 ********** kernel: [  110.263878] sd 3:0:0:0: Power-on or device reset occurred
Nov 28 02:53:36 ********** kernel: [  120.310461]  connection1:0: detected conn error (1020)
Nov 28 02:55:40 ********** kernel: [  244.008941] task:kworker/u48:1   state:D stack:    0 pid:  551 ppid:     2 flags:0x00004000
Nov 28 02:55:40 ********** kernel: [  244.008946] Workqueue: iscsi_eh __iscsi_block_session [scsi_transport_iscsi]
Nov 28 02:55:40 ********** kernel: [  244.008963] Call Trace:
Nov 28 02:55:40 ********** kernel: [  244.008967]  __schedule+0x2fa/0x910
Nov 28 02:55:40 ********** kernel: [  244.008974]  schedule+0x4f/0xc0
Nov 28 02:55:40 ********** kernel: [  244.008976]  schedule_preempt_disabled+0xe/0x10
Nov 28 02:55:40 ********** kernel: [  244.008979]  __mutex_lock.constprop.0+0x305/0x4d0
Nov 28 02:55:40 ********** kernel: [  244.008982]  ? select_task_rq_fair+0x16e/0x1300
Nov 28 02:55:40 ********** kernel: [  244.008986]  ? iscsi_dbg_trace+0x63/0x80 [scsi_transport_iscsi]
Nov 28 02:55:40 ********** kernel: [  244.008994]  __mutex_lock_slowpath+0x13/0x20
Nov 28 02:55:40 ********** kernel: [  244.008997]  mutex_lock+0x34/0x40
Nov 28 02:55:40 ********** kernel: [  244.008999]  device_block+0x28/0xc0
Nov 28 02:55:40 ********** kernel: [  244.009003]  starget_for_each_device+0xcd/0xf0
Nov 28 02:55:40 ********** kernel: [  244.009006]  ? scsi_mq_put_budget+0x60/0x60
Nov 28 02:55:40 ********** kernel: [  244.009009]  ? scsi_run_queue_async+0x60/0x60
Nov 28 02:55:40 ********** kernel: [  244.009011]  target_block+0x30/0x40
Nov 28 02:55:40 ********** kernel: [  244.009013]  device_for_each_child+0x5e/0xa0
Nov 28 02:55:40 ********** kernel: [  244.009015]  scsi_target_block+0x41/0x50
Nov 28 02:55:40 ********** kernel: [  244.009017]  __iscsi_block_session+0x6c/0xd0 [scsi_transport_iscsi]
Nov 28 02:55:40 ********** kernel: [  244.009026]  process_one_work+0x220/0x3c0
Nov 28 02:55:40 ********** kernel: [  244.009029]  worker_thread+0x53/0x420
Nov 28 02:55:40 ********** kernel: [  244.009030]  ? process_one_work+0x3c0/0x3c0
Nov 28 02:55:40 ********** kernel: [  244.009032]  kthread+0x12b/0x150
Nov 28 02:55:40 ********** kernel: [  244.009034]  ? set_kthread_struct+0x50/0x50
Nov 28 02:55:40 ********** kernel: [  244.009037]  ret_from_fork+0x22/0x30
Nov 28 02:55:40 ********** kernel: [  244.042796] perf: interrupt took too long (6497 > 2500), lowering kernel.perf_event_max_sample_rate to 30750
Nov 28 02:56:26 ********** kernel: [  290.280899] sd 3:0:0:0: [sdb] tag#1605 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:26 ********** kernel: [  290.280903] sd 3:0:0:0: [sdb] tag#1605 CDB: Read(16) 88 00 00 00 00 00 73 01 70 00 00 00 08 00 00 00
Nov 28 02:56:26 ********** kernel: [  290.280997] sd 3:0:0:0: [sdb] tag#1606 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:26 ********** kernel: [  290.281000] sd 3:0:0:0: [sdb] tag#1606 CDB: Read(16) 88 00 00 00 00 00 73 01 78 00 00 00 08 00 00 00
Nov 28 02:56:26 ********** kernel: [  290.281078] sd 3:0:0:0: [sdb] tag#1607 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:26 ********** kernel: [  290.281081] sd 3:0:0:0: [sdb] tag#1607 CDB: Read(16) 88 00 00 00 00 00 73 01 80 00 00 00 08 00 00 00
Nov 28 02:56:26 ********** kernel: [  290.281156] sd 3:0:0:0: [sdb] tag#1608 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:26 ********** kernel: [  290.281158] sd 3:0:0:0: [sdb] tag#1608 CDB: Read(16) 88 00 00 00 00 00 73 01 90 00 00 00 08 00 00 00
Nov 28 02:56:26 ********** kernel: [  290.281236] sd 3:0:0:0: [sdb] tag#1610 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:26 ********** kernel: [  290.281238] sd 3:0:0:0: [sdb] tag#1610 CDB: Read(16) 88 00 00 00 00 00 73 01 88 00 00 00 08 00 00 00
Nov 28 02:56:26 ********** kernel: [  290.281313] sd 3:0:0:0: [sdb] tag#1611 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:26 ********** kernel: [  290.281315] sd 3:0:0:0: [sdb] tag#1611 CDB: Read(16) 88 00 00 00 00 00 00 00 01 00 00 00 01 00 00 00
Nov 28 02:56:26 ********** kernel: [  290.330931] sd 3:0:0:0: [sdb] tag#1622 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:26 ********** kernel: [  290.330934] sd 3:0:0:0: [sdb] tag#1622 CDB: Read(16) 88 00 00 00 00 00 73 01 98 00 00 00 08 00 00 00
Nov 28 02:56:31 ********** kernel: [  295.304897] sd 3:0:0:0: [sdb] tag#1617 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:31 ********** kernel: [  295.304901] sd 3:0:0:0: [sdb] tag#1617 CDB: Read(16) 88 00 00 00 00 00 73 01 c8 00 00 00 08 00 00 00
Nov 28 02:56:31 ********** kernel: [  295.304987] sd 3:0:0:0: [sdb] tag#1619 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:31 ********** kernel: [  295.304989] sd 3:0:0:0: [sdb] tag#1619 CDB: Read(16) 88 00 00 00 00 00 73 01 b0 00 00 00 08 00 00 00
Nov 28 02:56:31 ********** kernel: [  295.305066] sd 3:0:0:0: [sdb] tag#1620 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:31 ********** kernel: [  295.305068] sd 3:0:0:0: [sdb] tag#1620 CDB: Read(16) 88 00 00 00 00 00 73 01 d0 00 00 00 08 00 00 00
Nov 28 02:56:31 ********** kernel: [  295.305143] sd 3:0:0:0: [sdb] tag#1621 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:31 ********** kernel: [  295.305145] sd 3:0:0:0: [sdb] tag#1621 CDB: Read(16) 88 00 00 00 00 00 73 01 d8 00 00 00 08 00 00 00
Nov 28 02:56:31 ********** kernel: [  295.305222] sd 3:0:0:0: [sdb] tag#1622 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:31 ********** kernel: [  295.305224] sd 3:0:0:0: [sdb] tag#1622 CDB: Read(16) 88 00 00 00 00 00 73 01 b8 00 00 00 08 00 00 00
Nov 28 02:56:31 ********** kernel: [  295.305298] sd 3:0:0:0: [sdb] tag#1623 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:31 ********** kernel: [  295.305300] sd 3:0:0:0: [sdb] tag#1623 CDB: Read(16) 88 00 00 00 00 00 73 01 c0 00 00 00 08 00 00 00
Nov 28 02:56:31 ********** kernel: [  295.305373] sd 3:0:0:0: [sdb] tag#1624 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:31 ********** kernel: [  295.305375] sd 3:0:0:0: [sdb] tag#1624 CDB: Read(16) 88 00 00 00 00 00 73 01 e0 00 00 00 08 00 00 00
Nov 28 02:56:31 ********** kernel: [  295.305449] sd 3:0:0:0: [sdb] tag#1625 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:31 ********** kernel: [  295.305451] sd 3:0:0:0: [sdb] tag#1625 CDB: Read(16) 88 00 00 00 00 00 73 01 a8 00 00 00 08 00 00 00
Nov 28 02:56:31 ********** kernel: [  295.305529] sd 3:0:0:0: [sdb] tag#1626 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:31 ********** kernel: [  295.305531] sd 3:0:0:0: [sdb] tag#1626 CDB: Read(16) 88 00 00 00 00 00 73 01 a0 00 00 00 08 00 00 00
Nov 28 02:56:36 ********** kernel: [  300.324877] sd 3:0:0:0: [sdb] tag#1663 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:36 ********** kernel: [  300.324882] sd 3:0:0:0: [sdb] tag#1663 CDB: Read(16) 88 00 00 00 00 00 73 01 e8 00 00 00 08 00 00 00
Nov 28 02:56:36 ********** kernel: [  300.348905] sd 3:0:0:0: [sdb] tag#1605 FAILED Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK cmd_age=180s
Nov 28 02:56:36 ********** kernel: [  300.348912] sd 3:0:0:0: [sdb] tag#1605 CDB: Read(16) 88 00 00 00 00 00 ff ff ff 80 00 00 00 08 00 00
Nov 28 02:57:40 ********** kernel: [  364.840936] task:kworker/u48:1   state:D stack:    0 pid:  551 ppid:     2 flags:0x00004000
Nov 28 02:57:40 ********** kernel: [  364.840941] Workqueue: iscsi_eh __iscsi_block_session [scsi_transport_iscsi]
Nov 28 02:57:40 ********** kernel: [  364.840958] Call Trace:
Nov 28 02:57:40 ********** kernel: [  364.840962]  __schedule+0x2fa/0x910
Nov 28 02:57:40 ********** kernel: [  364.840968]  schedule+0x4f/0xc0
Nov 28 02:57:40 ********** kernel: [  364.840970]  schedule_preempt_disabled+0xe/0x10
Nov 28 02:57:40 ********** kernel: [  364.840973]  __mutex_lock.constprop.0+0x305/0x4d0
Nov 28 02:57:40 ********** kernel: [  364.840976]  ? select_task_rq_fair+0x16e/0x1300
Nov 28 02:57:40 ********** kernel: [  364.840981]  ? iscsi_dbg_trace+0x63/0x80 [scsi_transport_iscsi]
Nov 28 02:57:40 ********** kernel: [  364.840989]  __mutex_lock_slowpath+0x13/0x20
Nov 28 02:57:40 ********** kernel: [  364.840992]  mutex_lock+0x34/0x40
Nov 28 02:57:40 ********** kernel: [  364.840994]  device_block+0x28/0xc0
Nov 28 02:57:40 ********** kernel: [  364.840998]  starget_for_each_device+0xcd/0xf0
Nov 28 02:57:40 ********** kernel: [  364.841001]  ? scsi_mq_put_budget+0x60/0x60
Nov 28 02:57:40 ********** kernel: [  364.841004]  ? scsi_run_queue_async+0x60/0x60
Nov 28 02:57:40 ********** kernel: [  364.841006]  target_block+0x30/0x40
Nov 28 02:57:40 ********** kernel: [  364.841008]  device_for_each_child+0x5e/0xa0
Nov 28 02:57:40 ********** kernel: [  364.841010]  scsi_target_block+0x41/0x50
Nov 28 02:57:40 ********** kernel: [  364.841012]  __iscsi_block_session+0x6c/0xd0 [scsi_transport_iscsi]
Nov 28 02:57:40 ********** kernel: [  364.841021]  process_one_work+0x220/0x3c0
Nov 28 02:57:40 ********** kernel: [  364.841024]  worker_thread+0x53/0x420
Nov 28 02:57:40 ********** kernel: [  364.841025]  ? process_one_work+0x3c0/0x3c0
Nov 28 02:57:40 ********** kernel: [  364.841027]  kthread+0x12b/0x150
Nov 28 02:57:40 ********** kernel: [  364.841029]  ? set_kthread_struct+0x50/0x50
Nov 28 02:57:40 ********** kernel: [  364.841032]  ret_from_fork+0x22/0x30

An interesting point in the testing (in version 7) is that when you start the backup process of a VM, before it starts dumping anything (at 0%) it hangs for a couple of seconds. Correlating that with the tcpdump, there is a gap in the network packet transmission for a couple of seconds (approx 5), enough to trigger the NOP-out command on the TrueNas iscsid daemon and close the connection.


Code:
INFO: starting new backup job: vzdump 444 --compress zstd --mode snapshot --node pve**** --remove 0 --storage storage-nfs-*****-backup
INFO: Starting Backup of VM 444 (qemu)
INFO: Backup started at 2021-11-28 02:53:14
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: test
INFO: include disk 'virtio0' 'storage-lvm-*****:vm-444-disk-0' 10G
INFO: creating vzdump archive '/mnt/pve/storage-backup/dump/vzdump-qemu-444-2021_11_28-02_53_14.vma.zst'
INFO: starting kvm to execute backup task
INFO: started backup task '14433e69-bcb2-46f3-ad16-abedbf9b4b72'
INFO: started backup task '14433e69-bcb2-46f3-ad16-abedbf9b4b72'
INFO:   0% (27.0 MiB of 10.0 GiB) in 3s, read: 9.0 MiB/s, write: 241.3 KiB/s
INFO:   0% (45.0 MiB of 10.0 GiB) in 3m 21s, read: 93.1 KiB/s, write: 0 B/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: stopping kvm after backup task
VM still running - terminating now with SIGTERM
VM still running - terminating now with SIGKILL
Volume group "**********" not found
can't deactivate LV '/dev/***********/vm-444-disk-0':   Cannot process volume group *************
volume deactivation failed: storage-lvm-********:vm-444-disk-0 at /usr/share/perl5/PVE/Storage.pm line 1142.
ERROR: Backup of VM 444 failed - job failed with err -5 - Input/output error
INFO: Failed at 2021-11-28 03:00:48
INFO: Backup job finished with errors
TASK ERROR: job errors

Code:
pve-manager/7.1-6/4e61e21c (running kernel: 5.13.19-1-pve)

The backup type is zstd and I also tried playing around with the bw speeds in the datacenter.cfg file and slow or fast, the output is the same. What possibly iBug spotted regarding the "speed limiting" was related to the fact that if the network is very fast, the timeout happens much faster on the pve side, but that can be very misleading.

Furthermore, I got another standby server, configured with lio and exported a LUN to the proxmox 7 setup. It run without an issue, despite the problems reported with the TrueNas (v12).
 
Last edited:
Also tested with

Code:
pve-manager/7.1-6/4e61e21c (running kernel: 5.11.22-4-pve)

and at 28% of taken backup

Code:
Nov 28 03:12:44 ********** kernel: [   72.880477] device tap444i0 entered promiscuous mode
Nov 28 03:13:05 ********** kernel: [   93.712445]  connection1:0: detected conn error (1020)
Nov 28 03:13:07 ********** kernel: [   95.720056] sd 3:0:0:0: Power-on or device reset occurred
Nov 28 03:13:12 ********** kernel: [  100.866692]  connection1:0: detected conn error (1020)
Nov 28 03:13:14 ********** kernel: [  102.870596] sd 3:0:0:0: Power-on or device reset occurred
Nov 28 03:13:23 ********** kernel: [  111.757008]  connection1:0: detected conn error (1020)
Nov 28 03:13:25 ********** kernel: [  113.764469] sd 3:0:0:0: Power-on or device reset occurred
Nov 28 03:13:31 ********** kernel: [  119.804100]  connection1:0: detected conn error (1020)
Nov 28 03:13:33 ********** kernel: [  121.812530] sd 3:0:0:0: Power-on or device reset occurred

Single difference on this kernel is that the backup continues to run, although the connection via the iscsi is already terminated.

Code:
INFO: starting kvm to execute backup task
INFO: started backup task '72a8a080-d245-4d2b-84bf-b99260bdf626'
INFO:  11% (1.2 GiB of 10.0 GiB) in 3s, read: 402.0 MiB/s, write: 61.7 MiB/s
INFO:  14% (1.5 GiB of 10.0 GiB) in 6s, read: 97.0 MiB/s, write: 95.7 MiB/s
INFO:  18% (1.9 GiB of 10.0 GiB) in 9s, read: 132.9 MiB/s, write: 119.8 MiB/s
INFO:  24% (2.4 GiB of 10.0 GiB) in 12s, read: 188.7 MiB/s, write: 111.6 MiB/s
INFO:  26% (2.6 GiB of 10.0 GiB) in 15s, read: 78.3 MiB/s, write: 62.5 MiB/s
INFO:  27% (2.8 GiB of 10.0 GiB) in 18s, read: 55.3 MiB/s, write: 18.0 MiB/s
INFO:  28% (2.8 GiB of 10.0 GiB) in 23s, read: 1.8 MiB/s, write: 0 B/s
INFO:  30% (3.0 GiB of 10.0 GiB) in 30s, read: 29.9 MiB/s, write: 593.7 KiB/s
INFO:  33% (3.3 GiB of 10.0 GiB) in 33s, read: 102.9 MiB/s, write: 67.0 MiB/s
INFO:  34% (3.4 GiB of 10.0 GiB) in 49s, read: 5.8 MiB/s, write: 652.5 KiB/s
INFO:  36% (3.7 GiB of 10.0 GiB) in 1m 38s, read: 5.6 MiB/s, write: 0 B/s
INFO:  37% (3.7 GiB of 10.0 GiB) in 1m 55s, read: 3.4 MiB/s, write: 0 B/s
INFO:  39% (3.9 GiB of 10.0 GiB) in 2m 7s, read: 15.3 MiB/s, write: 0 B/s
INFO:  44% (4.5 GiB of 10.0 GiB) in 2m 10s, read: 189.3 MiB/s, write: 1.7 MiB/s
INFO:  47% (4.7 GiB of 10.0 GiB) in 2m 17s, read: 38.4 MiB/s, write: 0 B/s
 
Has there been any progress with this? We rolled back to PVE6 so we had a stable system but given its upcoming EoL we need a solution for PVE7. I'm not sure why this issue isn't more widespread.
 
@avladulescu @benh7 Strangely enough, our issues were gone after enabling jumbo frame (9000 bytes MTU) on the management network where the problematic NIC is connected. You might want to give this a try unless you have incompatible devices.

We were able to bisect the cause to QEMU 6.0 adding high load to the iSCSI interface (see this) and temporarily downgraded to QEMU 5.2, though we're currently staying with jumbo frames and never ran into this again.
 
@avladulescu @benh7 Strangely enough, our issues were gone after enabling jumbo frame (9000 bytes MTU) on the management network where the problematic NIC is connected. You might want to give this a try unless you have incompatible devices.

We were able to bisect the cause to QEMU 6.0 adding high load to the iSCSI interface (see this) and temporarily downgraded to QEMU 5.2, though we're currently staying with jumbo frames and never ran into this again.
Thanks for the hint on the bug mate.

We do run jumbo inside our networks and have separate arista switches for "storage network" traffic with mlag, so..i'll poke the bat and check the link with the bug.

Tell me... did you put qemu 5.2 on hold in apt , meaning running an system package upgrade doesn't break stuff with latest 7.x packages ?

thx
Alex
 
@avladulescu @benh7 Strangely enough, our issues were gone after enabling jumbo frame (9000 bytes MTU) on the management network where the problematic NIC is connected. You might want to give this a try unless you have incompatible devices.

We were able to bisect the cause to QEMU 6.0 adding high load to the iSCSI interface (see this) and temporarily downgraded to QEMU 5.2, though we're currently staying with jumbo frames and never ran into this again.
@iBug Thanks, that confirms what I've found as well. On my PVE7.1 test server I removed the multipath config and found I still had the issue with a single iSCSI connection to our SAN. I enabled jumbo frames on the SAN and changed the MTU of my NIC to 9000. Suddenly I had a stable iSCSI connection.

Today, I have enabled multipath again and our system is still stable. This appears to have fixed the issue so after a bit more testing I'll be attempting an upgrade to PVE7.1 on our live system again.
 
@avladulescu For the time being we did put pve-qemu-kvm on hold. When we found that 9 KB MTU solved the issue for us, we released the hold and followed the latest versions. We've been running fine since.

@benh7 Glad that I could help.