Well, finally we almost solved the issue by increasing the network capacity for backups, espcecially by adding additional RAM on the storage side, so now we can cache way more data and Proxmox can send "at full throttle" data to it without having to wait for write operations. It still happens on very rare conditions, but given the amount of VMs and the frequency it happens we're fine with the current situation.
The problem was always less serious with backups, especially if you applied the tweaks I posted above... some VMs were more susceptible (Debian 7), some were not at all (Ubuntu 14/15/16.04, Debian 9). But the real problem was always with
restores and
migrations: try to restore (or migrate) some big VM to local storage while you have active web or application serving VMs running, and you will see a lot of these erors on the consoles (these screengrabs are fresh, taken today).
Debian 6, IDE qcow2 on ZFS
Debian 7, Virtio qcow2 on ZFS
Debian 7, Virtio qcow2 on ZFS
Of course the same thing happens with IDE, Virtio and Virtio-SCSI interfaces, only the console errors are different. Network connections are disrupted, tasks are blocked or hung, sometimes even the kernel freaks out. This is a QEMU / KVM / kernel issue, and no one seems to acknowledge it, even the big companies like Red Hat are only posting mitigation strategies like this was a side effect of using KVM. Weird thing is that not even the Proxmox developers ackowledged this as a real problem, despite that fact that many of us are reporting these issues for years.
Here is my bugreport on the Proxmox bugzilla:
https://bugzilla.proxmox.com/show_bug.cgi?id=1453