Hi,
We had a similar problem but it is really difficult to establish a precise bug report!
Yes, problems started after an upgrade, from PVE 3.1 to latest at this time : 2014-09-29.
Before that, i upgraded a first proxmox cluster as soon as v3.3 was out. No problem. Mainly Linux machines.
A week later, a second proxmox cluster, five nodes, was upgraded. All at once, via a cssh :
apt-get update ; apt-get dist-upgrade
No problem. Mainly Linux VMs. "Mainly" because Windows machines are really low loaded.
A week later again, a 3 nodes cluster was updated. One node at a time, by moving VMs, updating, restarting node, moving VMs back and so on.
Mainly BIG Windows machines, 2012R2, RDS farms (with hundreds of user sessions...) , brokers, remote app, appv and web gateways. 20GB of RAM on each RDS server. All those Windows VM use virtio. No ballooning service enabled (RDS doesn't like!).
Bad luck : randomly unresponsives, sometimes 2 - 3 times a day, sometimes not at all for 3 days.
Nothing special on those machines. Quasi identical ones, namely DFS servers, never had a crash / became unresponsive.
A special note : ALL those unresponsive machines log events ID 129, viostor reset. Not on their disk, which became unreachable. On a kibana / logstash / elasticsearch log centralizer on the network.
We decided to move back from virtio to IDE for disk and e1000 for NIC. Not any more problem then! Not one. It is not the best solution from a performance point of view, but it seems, after 2 weeks without any crash, to be a stable one.
A have dpkg.log if needed (monday).
Virtio version doesn't seem to play a role : from the last one to two before triggered the problem.
What changed in qemu?
Christophe.