Daily backup get stuck 'include disk'

Dough29

Member
Dec 31, 2018
2
0
6
37
Hello,

I have a 6 nodes Proxmox cluster running version 7.2-4.

Each of this node has a daily backup task :
- node-1 at 2:00am
- node-2 at 2:30am
...
- node-6 at 4:30am

Target storage in a NFS share with about 2TB free space.

I always get 2 to 3 of the backup tasks stuck with last log line : INFO: include disk 'scsi0' 'pve-shared:502/vm-502-disk-0.qcow2' 32G

Here is a full log from one of these failing backup tasks :

Code:
INFO: starting new backup job: vzdump --node pve-a84d07 --mode snapshot --prune-backups 'keep-last=7' --mailnotification failure --all 1 --compress zstd --quiet 1 --notes-template '{{guestname}}' --storage pve-shared
INFO: Starting Backup of VM 203 (qemu)
INFO: Backup started at 2022-05-27 02:30:03
INFO: status = running
INFO: VM Name: Docker-Host-16E229
INFO: include disk 'scsi0' 'pve-shared:203/vm-203-disk-1.qcow2' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'pve-shared:203/vm-203-disk-0.qcow2' (not included into backup)
INFO: creating vzdump archive '/mnt/pve/pve-shared/dump/vzdump-qemu-203-2022_05_27-02_30_03.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'cf8e3543-d959-4756-a400-a57b28cf8c53'
INFO: resuming VM again
INFO:   1% (369.0 MiB of 32.0 GiB) in 3s, read: 123.0 MiB/s, write: 90.2 MiB/s
[...]
INFO: 100% (32.0 GiB of 32.0 GiB) in 5m 39s, read: 95.5 MiB/s, write: 0 B/s
INFO: backup is sparse: 22.48 GiB (70%) total zero data
INFO: transferred 32.00 GiB in 339 seconds (96.7 MiB/s)
INFO: archive file size: 3.45GB
INFO: adding notes to backup
INFO: prune older backups with retention: keep-last=7
INFO: removing backup 'pve-shared:backup/vzdump-qemu-203-2022_05_20-15_32_10.vma.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: Finished Backup of VM 203 (00:05:41)
INFO: Backup finished at 2022-05-27 02:35:44
INFO: Starting Backup of VM 502 (qemu)
INFO: Backup started at 2022-05-27 02:35:44
INFO: status = running
INFO: VM Name: K3S-Master-2
INFO: include disk 'scsi0' 'pve-shared:502/vm-502-disk-0.qcow2' 32G

##################################################
##################################################
Was stuck here, asked for a reboot of this node
##################################################
##################################################

interrupted by signal
could not parse qemu-img info command output for '/mnt/pve/pve-shared/images/502/vm-502-disk-0.qcow2' - malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Storage/Plugin.pm line 894.

INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/pve-shared/dump/vzdump-qemu-502-2022_05_27-02_35_44.vma.zst'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 502 qmp command 'backup-cancel' failed - interrupted by signal
INFO: resuming VM again
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Connection refused
ERROR: Backup of VM 502 failed - VM 502 qmp command 'cont' failed - unable to connect to VM 502 qmp socket - timeout after 449 retries
INFO: Failed at 2022-05-27 09:43:52
ERROR: Backup job failed - Connection refused
TASK ERROR: Connection refused

As you can see it succeeded to backup a first VM and then got stuck on the second (over a total of 3).

In the morning what I do to get everything back is rebooting each failing node.

Do someone have an idea what is going on ?

Is there a way to get DEBUG log level for backup tasks ?

Thank you for reading me !
 
Did you get to the bottom of this? I'm seeing the same:

Code:
110: 2022-08-03 21:34:41 INFO: pruned 1 backup(s) not covered by keep-retention policy
110: 2022-08-03 21:34:41 INFO: Finished Backup of VM 110 (00:01:06)

111: 2022-08-03 21:34:41 INFO: Starting Backup of VM 111 (qemu)
111: 2022-08-03 21:34:41 INFO: status = running
111: 2022-08-03 21:34:41 INFO: VM Name: Plex
111: 2022-08-03 21:34:41 INFO: include disk 'scsi0' 'nas-proxmox-vm:111/vm-111-disk-0.qcow2' 32G
111: 2022-08-04 06:22:30 INFO: backup mode: snapshot
111: 2022-08-04 06:22:30 INFO: ionice priority: 7
111: 2022-08-04 06:22:30 INFO: creating vzdump archive '/mnt/pve/nas-proxmox-backup/dump/vzdump-qemu-111-2022_08_03-21_34_41.vma.zst'
111: 2022-08-04 06:22:30 INFO: issuing guest-agent 'fs-freeze' command
111: 2022-08-04 06:22:35 INFO: issuing guest-agent 'fs-thaw' command
111: 2022-08-04 06:22:36 INFO: started backup task 'c1a2ce9f-c4e4-4d9e-8eb9-ab105a223791'
111: 2022-08-04 06:22:36 INFO: resuming VM again

The previous backup finished at 21:34, the next (Plex) starts, but then it hangs on 'include disk' for a number of hours - it did finish successfully, eventually.

Speed for the earlier and later backups is fine, it doesn't seem to hang on the snapshot which I could understand the delay if it was.
 
Looks like I've got a zombie:

Code:
root     1657739  0.0  0.0      0     0 ?        Zl   07:41   0:00 [qemu-img] <defunct>

Code:
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2022-08-04 07:41:30
INFO: status = running
INFO: VM Name: Downloads
INFO: include disk 'scsi0' 'nas-proxmox-vm:101/vm-101-disk-0.qcow2' 32G

EDIT: Going to try with IO Thread enabled on the disk(s) (VirtIO SCSI Single) for a few days and see if this helps.
 
Last edited:
what storage type are the disks on? seems like qemu-img blocks, which can happen with network storage like NFS or CIFS if there is a problem with accessing the storage..
 
Yep, it is on NFS, on a Synology NAS RAID10, but it's not heavily utilised at the time :-/

Any/all other tasks happen either side of the backup window and don't clash.

The strange thing for me, is that I could often re-run the backup during the day, when it is more heavily utilised with other clients - although I did manage to replicate it this morning when I ran it manually.

I'll perhaps add a dedicated backup drive to run some tests.
 
well, another backup is finished right before - so maybe it's busy flushing that file..
 
  • Like
Reactions: iMx
Thanks - certainly a valid point for consideration.

Since enabling IO Thread on all disks, even loading up the storage with other tasks, I've not managed to replicate it - I've run about 5 backups so far this morning, repeatedly.

Will let it run for a few days and see.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!