Backup fails

Tom7320

Member
Jan 2, 2019
73
8
8
51
Hi

Yesterday I updated PVE from 6.1 to 6.2. Nothing else changed. After that the scheduled backup of one (out of four) VMs does not work anymore:

Code:
vzdump --all 1 --quiet 1 --storage backup-hdd --compress zstd --node pve --mode snapshot --mailnotification always --mailto ts@fenta.org


100: 2020-07-05 01:00:02 INFO: Starting Backup of VM 100 (qemu)
100: 2020-07-05 01:00:02 INFO: status = running
100: 2020-07-05 01:00:02 INFO: VM Name: WinSrv2016
100: 2020-07-05 01:00:02 INFO: include disk 'scsi0' 'local-lvm:vm-100-disk-0' 50G
100: 2020-07-05 01:00:02 INFO: include disk 'scsi1' 'local-lvm:vm-100-disk-1' 600G
100: 2020-07-05 01:00:02 INFO: backup mode: snapshot
100: 2020-07-05 01:00:02 INFO: ionice priority: 7
100: 2020-07-05 01:00:02 INFO: creating archive '/mnt/backup/dump/vzdump-qemu-100-2020_07_05-01_00_02.vma.zst'
100: 2020-07-05 01:00:02 INFO: issuing guest-agent 'fs-freeze' command
100: 2020-07-05 01:00:04 INFO: issuing guest-agent 'fs-thaw' command
100: 2020-07-05 01:00:05 INFO: started backup task '06e9d1f7-1fdf-4ea3-9d4e-68e3fc25f25f'
100: 2020-07-05 01:00:05 INFO: resuming VM again
100: 2020-07-05 01:00:08 INFO: status: 0% (525664256/697932185600), sparse 0% (189841408), duration 3, read/write 175/111 MB/s
100: 2020-07-05 01:01:30 INFO: status: 1% (6994001920/697932185600), sparse 0% (764583936), duration 85, read/write 78/71 MB/s
100: 2020-07-05 01:03:00 INFO: status: 2% (13963755520/697932185600), sparse 0% (1589239808), duration 175, read/write 77/68 MB/s
100: 2020-07-05 01:04:16 INFO: status: 3% (20939866112/697932185600), sparse 0% (3565346816), duration 251, read/write 91/65 MB/s
100: 2020-07-05 01:05:30 INFO: status: 4% (27968405504/697932185600), sparse 0% (5616566272), duration 325, read/write 94/67 MB/s
100: 2020-07-05 01:06:13 INFO: status: 5% (34965159936/697932185600), sparse 1% (10454634496), duration 368, read/write 162/50 MB/s
100: 2020-07-05 01:06:51 INFO: status: 6% (42029613056/697932185600), sparse 2% (15803584512), duration 406, read/write 185/45 MB/s
100: 2020-07-05 01:07:07 INFO: status: 7% (48911482880/697932185600), sparse 3% (22685388800), duration 422, read/write 430/0 MB/s
100: 2020-07-05 01:17:44 ERROR: VM 100 qmp command 'query-backup' failed - got timeout
100: 2020-07-05 01:17:44 INFO: aborting backup job
100: 2020-07-05 01:27:44 ERROR: VM 100 qmp command 'backup-cancel' failed - unable to connect to VM 100 qmp socket - timeout after 5980 retries
100: 2020-07-05 01:27:45 ERROR: Backup of VM 100 failed - VM 100 qmp command 'query-backup' failed - got timeout

After that the VM is not reachable anymore. Also, I can not shutdown or reset the VM ( VM 100 qmp command 'system_reset' failed - unable to connect to VM 100 qmp socket - timeout after 31 retries ). The GUI says "Guest Agent not running". I have to brute stop the VM which is horrible...

Interestingly enough manual trigger ("Run now" in GUI) of the backup job does work - I tried two times yesterday. Coincidence? Also backup of other VMs does work (Linux and Windows Server VMs). Only one VM is affected.

What's the problem? How can I solve it?

THX a lot!

Thorsten

PS: Windows VirtIO drivers are installed from Fedora stable ISO (0.1.171).
 
Last edited:
It is developing into a serious problem. The mentioned Windows Server 2016 VM is sometimes after a while not reachable anymore. RDP does not work, nor does console:
Code:
VM 100 qmp command 'change' failed - unable to connect to VM 100 qmp socket - timeout after 600 retries
TASK ERROR: Failed to run vncproxy.
CPU increases and the machine is not reachable anymore:

1593954757374.png

1593955343429.png

No changes within the VM. This never happened before the PVE 6.2 update.... :oops:
 
Last edited:
I am seeing the same issue with my Windows 2019 Standard VM

VM 100 qmp command 'query-backup' failed

I have virtio drivers from 1.185.
However, I have the same drivers on my other Windows 2016 Essentials VM and it is working.
No QEMU Guest Agent installed though.
 
In my case it turned out, that the problem was an external drive. The drive was attached by USB and passed to a Windows VM. Both the file system (NTFS) and the drive were fine - no hardware or file system issues. It looks like if there was some component of PVE (LVM???) that tried indefinitely to access the drive. This causes service daemons of PVE to timeout. As soon as I removed the external drive everything was fine again. Have there been any changes to LVM or other components between 6.1 and 6.2 that cause this strange issues???
 
A summary of the changes in Proxmox VE 6.2 can be found in the wiki. Have you managed to reproduce this? Otherwise it will be very hard to debug.
 
I really would like to help debug this issue. It definitely appeared after the update from 6.1 to 6.2. Nothing else changed. PVE host is a standard HPE ProLiant DL360 Gen9 E5-2620v4. The external usb drive is a Tandberg RDX. After the update strange thing happened: backup did not work as described, GUI did not respond anymore, PVE services daemons hung and even prevented a reboot of the system, etc. As soon as I removed the RDX everything worked smooth as before with 6.1. Plug in the drive and same sh*** again. And as I said: the file system of the drive was NTFS handed in to a Windows VM.

I can't play around with this semi-productive system too much since it has some work to do. What else can I do to debug this problem?

Thx a lot!
 
There have been some changes around backup and Qemu. Most importantly the switch to Qemu 5.0 in PVE 6.2. Which version are you now on exactly pveversion -v? Do you know which exact version was working previously?

Possibly related to bug report 2673.
 
Last edited:
Current version:

Code:
root@pve:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
root@pve:~#

Unfortunately I can't tell you the exact previous version... :confused:
 
Hi! Looks like I'm facing the same issue. Happens during backup creation, Windows VM just stucks, same resourses usage graph, overall: same.

Any news on this? @Tom7320 , maybe you've found any workaround?

In the release notes of Proxmox 6.3 I see a couple of related fixes/improvements, but not sure if those are actually the "fixing ones".
This bug report https://bugzilla.proxmox.com/show_bug.cgi?id=2673 is still opened, as far as I can see.
 
Unfortunately I switched to a very stupid solution: I don't use external drives that are handed into the VMs anymore since I never found a "real" solution... :(
 
  • Like
Reactions: someAlex
@Tom7320 , thank you. At least, now I have an idea what can cause the problem, and how can I deal with it.
P.S. Surprised to see such a fast response in 5 months old topic :cool:

Forgot to mention in my first post: I am using Proxmox 6.2 at the moment.
 
Sorry for not having a real solution yet. But I'm still hoping.... ;)
I never tried 6.3. in relation to this problem.
 
  • Like
Reactions: someAlex

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!