Maybe it is worth noting that after we moved our cluster traffic to a dedicated network we had no more such issues.
When we had these issues, qemu agent was enabled and in use on most of the problematic VMs.Try to use QEMU agent, it solved this problem here.
We only use it for cluster traffic, nothing else.We have also a dedicated cluster and migration network (2 nodes only) but still facing this issue
Is there anything else i can do to assist in debugging this issue? We really would like to run backups again. Restarting the VMs every other day is quite annoying.
here tooWhen we had these issues, qemu agent was enabled and in use on most of the problematic VMs.
Test some time with the new version please. We did'nt had any problem for 1,5 weeks. But then 5 VM's death. But yes with the old version. So hoping the new version is fixed Very thanks!I tested with pve-qemu-kvm 4.0.0-7, it works very good so far ! Thanks for getting this fixed!
Yes, you need shutdown and start again the VMs. Or maybe help hibernate and start, but this I didn't test.Updating pve-qemu-kvm didn't help for me
Or need restart broken VM?
Just for clarification, I have a question in the context of starting/restarting a machine:
If I migrate a VM online, a new qemu Thread is started on the destination host. Does this count as a restart of the qemu process, the same way as if i had shut down the machine and then started it again?
INFO: starting new backup job: vzdump 1001 --remove 0 --node p2 --mode snapshot --storage nfs-n3-pvebackup --mailto sd@schnied.net --compress lzo
INFO: Starting Backup of VM 1001 (qemu)
INFO: Backup started at 2019-10-16 07:57:35
INFO: status = running
INFO: update VM 1001: -lock backup
INFO: VM Name: f2.in.of.sd.vc
INFO: include disk 'scsi0' 'ssd_vm:vm-1001-disk-0' 8G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/nfs-n3-pvebackup/dump/vzdump-qemu-1001-2019_10_16-07_57_35.vma.lzo'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 1001 qmp command 'backup-cancel' failed - got timeout
ERROR: Backup of VM 1001 failed - got timeout
INFO: Failed at 2019-10-16 08:07:40
INFO: Backup job finished with errors
TASK ERROR: job errors
VM 1001 qmp command 'change' failed - got timeout
TASK ERROR: Failed to run vncproxy.
2019-10-16 08:18:07 ERROR: migration aborted (duration 00:00:03): VM 1001 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted
EDIT:
sorry, i've seen now, that i have to restart the vm's.
i will try it (but it's a whole piece of work)
#############
hey,
after suffering a bit now, i'm a little bit confused about the solution.
- fabian said : please to try pve-qemu-kvm 4.0.0-6 from pvetest.
- vmctec said: Upgraded to pve-manager/6.0-7/28984024. but in my case that would be a downgrade. 6.0.7 is from sept 3.
which package causes the problem?
see post #59And what was the exact issue?
I know what was the issue, I started this thread...see post #59
VM didnt react, we got timouts randomly after 1 or 2 days during our nightly backups or when we tried to use the console.