Daily backup get stuck 'include disk'

Dough29

Active Member
Dec 31, 2018
3
0
41
38
Hello,

I have a 6 nodes Proxmox cluster running version 7.2-4.

Each of this node has a daily backup task :
- node-1 at 2:00am
- node-2 at 2:30am
...
- node-6 at 4:30am

Target storage in a NFS share with about 2TB free space.

I always get 2 to 3 of the backup tasks stuck with last log line : INFO: include disk 'scsi0' 'pve-shared:502/vm-502-disk-0.qcow2' 32G

Here is a full log from one of these failing backup tasks :

Code:
INFO: starting new backup job: vzdump --node pve-a84d07 --mode snapshot --prune-backups 'keep-last=7' --mailnotification failure --all 1 --compress zstd --quiet 1 --notes-template '{{guestname}}' --storage pve-shared
INFO: Starting Backup of VM 203 (qemu)
INFO: Backup started at 2022-05-27 02:30:03
INFO: status = running
INFO: VM Name: Docker-Host-16E229
INFO: include disk 'scsi0' 'pve-shared:203/vm-203-disk-1.qcow2' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'pve-shared:203/vm-203-disk-0.qcow2' (not included into backup)
INFO: creating vzdump archive '/mnt/pve/pve-shared/dump/vzdump-qemu-203-2022_05_27-02_30_03.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'cf8e3543-d959-4756-a400-a57b28cf8c53'
INFO: resuming VM again
INFO:   1% (369.0 MiB of 32.0 GiB) in 3s, read: 123.0 MiB/s, write: 90.2 MiB/s
[...]
INFO: 100% (32.0 GiB of 32.0 GiB) in 5m 39s, read: 95.5 MiB/s, write: 0 B/s
INFO: backup is sparse: 22.48 GiB (70%) total zero data
INFO: transferred 32.00 GiB in 339 seconds (96.7 MiB/s)
INFO: archive file size: 3.45GB
INFO: adding notes to backup
INFO: prune older backups with retention: keep-last=7
INFO: removing backup 'pve-shared:backup/vzdump-qemu-203-2022_05_20-15_32_10.vma.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: Finished Backup of VM 203 (00:05:41)
INFO: Backup finished at 2022-05-27 02:35:44
INFO: Starting Backup of VM 502 (qemu)
INFO: Backup started at 2022-05-27 02:35:44
INFO: status = running
INFO: VM Name: K3S-Master-2
INFO: include disk 'scsi0' 'pve-shared:502/vm-502-disk-0.qcow2' 32G

##################################################
##################################################
Was stuck here, asked for a reboot of this node
##################################################
##################################################

interrupted by signal
could not parse qemu-img info command output for '/mnt/pve/pve-shared/images/502/vm-502-disk-0.qcow2' - malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Storage/Plugin.pm line 894.

INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/pve-shared/dump/vzdump-qemu-502-2022_05_27-02_35_44.vma.zst'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 502 qmp command 'backup-cancel' failed - interrupted by signal
INFO: resuming VM again
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Connection refused
ERROR: Backup of VM 502 failed - VM 502 qmp command 'cont' failed - unable to connect to VM 502 qmp socket - timeout after 449 retries
INFO: Failed at 2022-05-27 09:43:52
ERROR: Backup job failed - Connection refused
TASK ERROR: Connection refused

As you can see it succeeded to backup a first VM and then got stuck on the second (over a total of 3).

In the morning what I do to get everything back is rebooting each failing node.

Do someone have an idea what is going on ?

Is there a way to get DEBUG log level for backup tasks ?

Thank you for reading me !
 
Did you get to the bottom of this? I'm seeing the same:

Code:
110: 2022-08-03 21:34:41 INFO: pruned 1 backup(s) not covered by keep-retention policy
110: 2022-08-03 21:34:41 INFO: Finished Backup of VM 110 (00:01:06)

111: 2022-08-03 21:34:41 INFO: Starting Backup of VM 111 (qemu)
111: 2022-08-03 21:34:41 INFO: status = running
111: 2022-08-03 21:34:41 INFO: VM Name: Plex
111: 2022-08-03 21:34:41 INFO: include disk 'scsi0' 'nas-proxmox-vm:111/vm-111-disk-0.qcow2' 32G
111: 2022-08-04 06:22:30 INFO: backup mode: snapshot
111: 2022-08-04 06:22:30 INFO: ionice priority: 7
111: 2022-08-04 06:22:30 INFO: creating vzdump archive '/mnt/pve/nas-proxmox-backup/dump/vzdump-qemu-111-2022_08_03-21_34_41.vma.zst'
111: 2022-08-04 06:22:30 INFO: issuing guest-agent 'fs-freeze' command
111: 2022-08-04 06:22:35 INFO: issuing guest-agent 'fs-thaw' command
111: 2022-08-04 06:22:36 INFO: started backup task 'c1a2ce9f-c4e4-4d9e-8eb9-ab105a223791'
111: 2022-08-04 06:22:36 INFO: resuming VM again

The previous backup finished at 21:34, the next (Plex) starts, but then it hangs on 'include disk' for a number of hours - it did finish successfully, eventually.

Speed for the earlier and later backups is fine, it doesn't seem to hang on the snapshot which I could understand the delay if it was.
 
Looks like I've got a zombie:

Code:
root     1657739  0.0  0.0      0     0 ?        Zl   07:41   0:00 [qemu-img] <defunct>

Code:
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2022-08-04 07:41:30
INFO: status = running
INFO: VM Name: Downloads
INFO: include disk 'scsi0' 'nas-proxmox-vm:101/vm-101-disk-0.qcow2' 32G

EDIT: Going to try with IO Thread enabled on the disk(s) (VirtIO SCSI Single) for a few days and see if this helps.
 
Last edited:
what storage type are the disks on? seems like qemu-img blocks, which can happen with network storage like NFS or CIFS if there is a problem with accessing the storage..
 
Yep, it is on NFS, on a Synology NAS RAID10, but it's not heavily utilised at the time :-/

Any/all other tasks happen either side of the backup window and don't clash.

The strange thing for me, is that I could often re-run the backup during the day, when it is more heavily utilised with other clients - although I did manage to replicate it this morning when I ran it manually.

I'll perhaps add a dedicated backup drive to run some tests.
 
well, another backup is finished right before - so maybe it's busy flushing that file..
 
  • Like
Reactions: iMx
Thanks - certainly a valid point for consideration.

Since enabling IO Thread on all disks, even loading up the storage with other tasks, I've not managed to replicate it - I've run about 5 backups so far this morning, repeatedly.

Will let it run for a few days and see.