Redhat VirtIO developers would like to coordinate with Proxmox devs re: "[vioscsi] Reset to device ... system unresponsive"

ok, switching to threads does not help either. and again the errors only occur a few minutes after the backup has been completed.
 
How to do that? (reduce number of concurrent backups) ? In my cluster there is only 1 backup job and if I'm not mistaken PVE sterilize backups from all nodes within cluster)
Yes, a backup job will run in parallel on each node (one backup per node at a time).
P.S. right, Vm hangs after backup (when another backup in progress). Win and Linux vm affected
What error message do you get in Linux?
Not in my case. 40Gbe link. There is powerful CPU and HW RAID with SSD disks on PBS server
CEPH datastore performance is a key limited factor (of backup speed)

And one more thing:
VM "hangs" (with different event to windows syslog + ID 129 as well) when another VM is backing up (that VM is located on another node)
Is there any way to limit read/write bandwidth of PBS client (something like bwlimit in vdump.conf) ?
Yes, exactly that setting. That should limit the background copying of the image during backup, so Ceph might not get overwhelmed and guest writes might have a better chance to get through.

How should I change my Vm config to incorporate this tune?

P.S. there is another thread: https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756
That's unfortunately not possible without using the args override to specify arbitrary CLI arguments.
 
What error message do you get in Linux?

1) It died silently. It was not even possible to log in via console (it was no freezed but there were no services avalible)

Yes, exactly that setting. That should limit the background copying of the image during backup, so Ceph might not get overwhelmed and guest writes might have a better chance to get through.

2) Let me make it clear. PBS respects bwlimit parameter in vdump.conf? In other words: if I set bwlimit in vdump.conf PBS client will limit backup speed? Am I correct?



3) And what kind of limit will stratify the most:
  • a) bwlimit in vdump.conf
  • b) traffic control in PBS
  • c) Datacenter - Options - Bandwidth Limits


Thanks in advance
 
Last edited:
But what about the problems I have described? The error only occurs a few minutes after the backup. And not just on one system.
 
But what about the problems I have described? The error only occurs a few minutes after the backup. And not just on one system.

Have you tried tuning virtio-scsi parameters as advised by virtio devs?
  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device
    • IoTimeoutValue = 0x5a (90)
  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device
    • PhysicalBreaks = 0x3f (63)
 
Have you tried tuning virtio-scsi parameters as advised by virtio devs?
  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device
    • IoTimeoutValue = 0x5a (90)
  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device
    • PhysicalBreaks = 0x3f (63)

I will try the following:
  • tune virtio-scsi driver settings via windows registry
  • set ionice=8 in vzdump.conf on all nodes in cluster
 
2) Let me make it clear. PBS respects bwlimit parameter in vdump.conf? In other words: if I set bwlimit in vdump.conf PBS client will limit backup speed? Am I correct?
Yes.
3) And what kind of limit will stratify the most:
  • a) bwlimit in vdump.conf
Yes, that is the setting you should use.
  • b) traffic control in PBS
If you limit on the PBS-side, then guest writes during backup will also suffer from that. It's better to try and limit the reading side, i.e. Proxmox VE.
  • c) Datacenter - Options - Bandwidth Limits
Those are various cluster-wide defaults for certain operations, but there is none for backups.

I will try the following:
  • tune virtio-scsi driver settings via windows registry
  • set ionice=8 in vzdump.conf on all nodes in cluster
The ionice setting doesn't usually take any effect, in particular with VM backups to PBS it doesn't: https://pve.proxmox.com/pve-docs/chapter-vzdump.html#vzdump_configuration
 
  • Like
Reactions: Whatever
I think that it can't be that I have to set timeouts to such a high value so that it doesn't lead to an error. Or that I have to limit bandwidths just so that something runs without errors.
 
I think that it can't be that I have to set timeouts to such a high value so that it doesn't lead to an error. Or that I have to limit bandwidths just so that something runs without errors.

What do mean by "such high values"?
Defaults are: 2Mb with 60s timeout, tuned: 256Kb with 90s timeout
You are free to test just one of them anyway
 
What do mean by "such high values"?
Defaults are: 2Mb with 60s timeout, tuned: 256Kb with 90s timeout
You are free to test just one of them anyway
I have never had to change such values in environments with Hyper-V and Veeam. That's just unfamiliar for me at first. Of course I will test this and will be happy to provide feedback. With such findings, perhaps the whole thing can be taken further in the right direction.
 
I have never had to change such values in environments with Hyper-V and Veeam. That's just unfamiliar for me at first. Of course I will test this and will be happy to provide feedback. With such findings, perhaps the whole thing can be taken further in the right direction.

PVE and virtio devs are already aware of this problem and I hope they will find out what goes wrong and fix it asap
 
  • Like
Reactions: exitsys
i think fixation is especially needed in the design/architecture, i.e. VM IO needs to be madee independent from the backup-path/speed/availability.

THIS is the REAL issue (and has always been).

but i guess it's not easy to solve by proxmox devs, because qemu upstream is also involved in this issue.
 
Last edited:
The ionice setting doesn't usually take any effect, in particular with VM backups to PBS it doesn't: https://pve.proxmox.com/pve-docs/chapter-vzdump.html#vzdump_configuration

However with ionice=8 set in vzdump.conf I can see it in backup log

INFO: starting new backup job: vzdump --exclude 101,100,103 --notes-template '{{guestname}}' --storage PBS --mode snapshot --mailto ... --all 1 --mailnotification failure --node 063-pve-04446
INFO: Starting Backup of VM 6302 (qemu)
INFO: Backup started at 2024-01-09 19:37:52
INFO: status = running
INFO: VM Name: rdc001
INFO: include disk 'scsi0' 'rbd:vm-6302-disk-0' 100G
INFO: backup mode: snapshot
INFO: bandwidth limit: 150000 KB/s
INFO: ionice priority: 8
INFO: creating Proxmox Backup Server archive 'vm/6302/2024-01-09T16:37:52Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'bac78841-47d5-4bf1-853c-170bfa279b7f'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (4.2 GiB of 100.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 4.2 GiB dirty of 100.0 GiB total
INFO: 15% (660.0 MiB of 4.2 GiB) in 3s, read: 220.0 MiB/s, write: 217.3 MiB/s
INFO: 25% (1.1 GiB of 4.2 GiB) in 6s, read: 145.3 MiB/s, write: 145.3 MiB/s
INFO: 35% (1.5 GiB of 4.2 GiB) in 9s, read: 148.0 MiB/s, write: 132.0 MiB/s
INFO: 45% (1.9 GiB of 4.2 GiB) in 12s, read: 141.3 MiB/s, write: 138.7 MiB/s
 
Last edited:
I performed some tests on my cluster and can confirm that tuning vzdump.conf could be used as workaround

max average throughput of my 5 nodes cluster (with CEPH) = ~800MiB/s and PBS storage = ~300MiB/s

1) I set vzdump.conf as following:
Code:
bwlimit: 150000
ionice: 8

2) On PBS I limited input rate by 250MiB/s with traffic control rule

With this options set i got:
  • in worst case (when there are lot of clean blocks - unchanged data) all 5 nodes will occupy 5*150=750MiB/sec of CEPH (still some room for VM IO on CEPH)
...
INFO: 46% (42.4 GiB of 92.0 GiB) in 26m 46s, read: 67.1 MiB/s, write: 0 B/s
INFO: 47% (43.4 GiB of 92.0 GiB) in 26m 53s, read: 148.6 MiB/s, write: 0 B/s
INFO: 48% (44.2 GiB of 92.0 GiB) in 26m 59s, read: 146.7 MiB/s, write: 0 B/s
INFO: 49% (45.1 GiB of 92.0 GiB) in 27m 15s, read: 55.2 MiB/s, write: 5.0 MiB/s
INFO: 50% (46.1 GiB of 92.0 GiB) in 27m 38s, read: 44.0 MiB/s, write: 30.3 MiB/s
INFO: 51% (47.0 GiB of 92.0 GiB) in 27m 47s, read: 104.0 MiB/s, write: 0 B/s
INFO: 52% (47.9 GiB of 92.0 GiB) in 27m 53s, read: 146.7 MiB/s, write: 0 B/s
INFO: 53% (48.9 GiB of 92.0 GiB) in 28m, read: 146.3 MiB/s, write: 0 B/s
INFO: 54% (49.8 GiB of 92.0 GiB) in 28m 46s, read: 20.1 MiB/s, write: 6.3 MiB/s
...

  • in case with dirty blocks (changed data) backup speed will be limited by total PBS input rate limit

  • backup (vzdump) starts with ionice=8 and has lower priority than other processes (work when idle)
Total backup process took ~30% longer but there were no error in VMs. All VMs were online and all service could be accessed.



However the best solution (imo) would be to have feature that serializes backups all over the cluster (backup VMs one by one) as well as fixing virtio scsi driver
 
Last edited:
However with ionice=8 set in vzdump.conf I can see it in backup log
Yes, it will be set for the vzdump process and child processes. But it only takes and effect when you use the BFQ scheduler. And for VMs, the backup is done via the QEMU process to which the ionice setting is not applied (that would also affect guest writes). And when using PBS, vzdump doesn't spawn a compressor child process. So in the case VM+PBS or if not using BFQ scheduler, the setting is essentially without effect.
 
  • Like
Reactions: Whatever
This should be fixed with pve-qemu-kvm package updated to 8.1.2-5.
 
Last edited:
Hi,
This should be fixed with pve-qemu-kvm package updated to 8.1.2-5.
no, that is not true. If your PBS connection is too slow or gets lost, the issue described in this thread is still present.
 
This should be fixed with pve-qemu-kvm package updated to 8.1.2-5.
Hi,
no, that is not true. If your PBS connection is too slow or gets lost, the issue described in this thread is still present.
So backup fleecing doesn't help...? Or only in some scenarios?

Is it fair to say that the PBS issue is just one manifestation of a group of IO / hang problems that many are hoping Vadim @ RH's work on the virtio drivers might fix? Is it even possible that the PBS issue is unrelated to the original issue that Vadim reached out for?

I - like many - saw a presentation of the IO issue every time I performed a Windows cumulative update. Sometimes this is obscured as a boot failure because upon crashing a VM, the VM might attempt a boot-time chkdsk, resulting in more IO, which exacerbates the problem. This was often reported during initial patching following Windows setup. These were ultimately unresolved because users were leaving the workaround in place (using IDE or SATA instead - I believe the default LSI works too), e.g. as mentioned here. Updating just one VM could bring the whole server down even if I shut down most of my other VMs on the affected host. In my case, I think several issues were at play which I have attempted to provide some details of in the thread I started here.

ok, switching to threads does not help either. and again the errors only occur a few minutes after the backup has been completed.

Worked for me per https://bugzilla.kernel.org/show_bug.cgi?id=199727#c24. (not my post btw)

I note you do have to enable iothread=1, not just aio=threads. Very important..!! You did do that, right..?

I suspect the issue is in the io_uring and native Async IO implementations. Perhaps a shared library threads doesn't use...? Whether it is Proxmox specific or upstream I couldn't say. It's unfortuante I never tried threads in the 19 months (!!) I was troubleshooting this. All the materials I read sold me on io_uring or native, and as a result I never even trialled threads. I suspect this will prove to be a common theme amongst users.

Have you tried tuning virtio-scsi parameters as advised by virtio devs?
  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device
    • IoTimeoutValue = 0x5a (90)
  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device
    • PhysicalBreaks = 0x3f (63)

The IoTimeoutValue has been implemented since Windows 8. It has been much more common practice in MSCS and SAN based environments to set the HKLM\System\CurrentControlSet\Services\Disk\TimeoutValue global value instead (due to historical reasons, i.e. it has been available since NT4.0 iirc). The default for this is 60s, which really should be plenty. In the absence of either registry value, the default is 10s. That being said, it is possible the driver does not properly implement the StorPort Miniport specifc IoTimeoutValue and IoLatencyCap tuning parameters in the registry (there are others too, but the defaults should suffice). These would exist at HKLM\System\CurrentControlSet\Services\vioscsi\Parameters (no "Device" key at the end). This can also be implemented at the class level at HKLM\System\CurrentControlSet\Services\Disk\IoTimeoutValue., as version 0.1.248 of the driver does. It is my view the defaults (60s) for TimeoutValue and IoTimeoutValue will be sufficient in almost all cases.
More info:
https://learn.microsoft.com/en-us/w...ge/registry-entries-for-scsi-miniport-drivers
https://learn.microsoft.com/en-us/w...egistry-entries-for-storport-miniport-drivers
https://github.com/virtio-win/kvm-guest-drivers-windows/issues/907 (CAUTION: There are more that a few errors in this post, but it is an imformative thread)

The PhysicalBreaks registry entry actually needs to be set to less than the max_segments size for each backing block device.
This can be determined by issuing the command grep "" /sys/block/*/queue/max_segments.
If, for example, your block device has a max_segments of 60, it should be set to no more than 59, but one might consider setting it to 32 instead.
More info: https://github.com/virtio-win/kvm-guest-drivers-windows/issues/827
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!