Daily backup get stuck 'include disk'

Dough29 · May 27, 2022

Hello,

I have a 6 nodes Proxmox cluster running version 7.2-4.

Each of this node has a daily backup task :
- node-1 at 2:00am
- node-2 at 2:30am
...
- node-6 at 4:30am

Target storage in a NFS share with about 2TB free space.

I always get 2 to 3 of the backup tasks stuck with last log line : INFO: include disk 'scsi0' 'pve-shared:502/vm-502-disk-0.qcow2' 32G

Here is a full log from one of these failing backup tasks :

Code:

INFO: starting new backup job: vzdump --node pve-a84d07 --mode snapshot --prune-backups 'keep-last=7' --mailnotification failure --all 1 --compress zstd --quiet 1 --notes-template '{{guestname}}' --storage pve-shared
INFO: Starting Backup of VM 203 (qemu)
INFO: Backup started at 2022-05-27 02:30:03
INFO: status = running
INFO: VM Name: Docker-Host-16E229
INFO: include disk 'scsi0' 'pve-shared:203/vm-203-disk-1.qcow2' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'pve-shared:203/vm-203-disk-0.qcow2' (not included into backup)
INFO: creating vzdump archive '/mnt/pve/pve-shared/dump/vzdump-qemu-203-2022_05_27-02_30_03.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'cf8e3543-d959-4756-a400-a57b28cf8c53'
INFO: resuming VM again
INFO:   1% (369.0 MiB of 32.0 GiB) in 3s, read: 123.0 MiB/s, write: 90.2 MiB/s
[...]
INFO: 100% (32.0 GiB of 32.0 GiB) in 5m 39s, read: 95.5 MiB/s, write: 0 B/s
INFO: backup is sparse: 22.48 GiB (70%) total zero data
INFO: transferred 32.00 GiB in 339 seconds (96.7 MiB/s)
INFO: archive file size: 3.45GB
INFO: adding notes to backup
INFO: prune older backups with retention: keep-last=7
INFO: removing backup 'pve-shared:backup/vzdump-qemu-203-2022_05_20-15_32_10.vma.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: Finished Backup of VM 203 (00:05:41)
INFO: Backup finished at 2022-05-27 02:35:44
INFO: Starting Backup of VM 502 (qemu)
INFO: Backup started at 2022-05-27 02:35:44
INFO: status = running
INFO: VM Name: K3S-Master-2
INFO: include disk 'scsi0' 'pve-shared:502/vm-502-disk-0.qcow2' 32G

##################################################
##################################################
Was stuck here, asked for a reboot of this node
##################################################
##################################################

interrupted by signal
could not parse qemu-img info command output for '/mnt/pve/pve-shared/images/502/vm-502-disk-0.qcow2' - malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Storage/Plugin.pm line 894.

INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/pve-shared/dump/vzdump-qemu-502-2022_05_27-02_35_44.vma.zst'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 502 qmp command 'backup-cancel' failed - interrupted by signal
INFO: resuming VM again
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Connection refused
ERROR: Backup of VM 502 failed - VM 502 qmp command 'cont' failed - unable to connect to VM 502 qmp socket - timeout after 449 retries
INFO: Failed at 2022-05-27 09:43:52
ERROR: Backup job failed - Connection refused
TASK ERROR: Connection refused

As you can see it succeeded to backup a first VM and then got stuck on the second (over a total of 3).

In the morning what I do to get everything back is rebooting each failing node.

Do someone have an idea what is going on ?

Is there a way to get DEBUG log level for backup tasks ?

Thank you for reading me !

iMx · Aug 4, 2022

Did you get to the bottom of this? I'm seeing the same:

Code:

110: 2022-08-03 21:34:41 INFO: pruned 1 backup(s) not covered by keep-retention policy
110: 2022-08-03 21:34:41 INFO: Finished Backup of VM 110 (00:01:06)

111: 2022-08-03 21:34:41 INFO: Starting Backup of VM 111 (qemu)
111: 2022-08-03 21:34:41 INFO: status = running
111: 2022-08-03 21:34:41 INFO: VM Name: Plex
111: 2022-08-03 21:34:41 INFO: include disk 'scsi0' 'nas-proxmox-vm:111/vm-111-disk-0.qcow2' 32G
111: 2022-08-04 06:22:30 INFO: backup mode: snapshot
111: 2022-08-04 06:22:30 INFO: ionice priority: 7
111: 2022-08-04 06:22:30 INFO: creating vzdump archive '/mnt/pve/nas-proxmox-backup/dump/vzdump-qemu-111-2022_08_03-21_34_41.vma.zst'
111: 2022-08-04 06:22:30 INFO: issuing guest-agent 'fs-freeze' command
111: 2022-08-04 06:22:35 INFO: issuing guest-agent 'fs-thaw' command
111: 2022-08-04 06:22:36 INFO: started backup task 'c1a2ce9f-c4e4-4d9e-8eb9-ab105a223791'
111: 2022-08-04 06:22:36 INFO: resuming VM again

The previous backup finished at 21:34, the next (Plex) starts, but then it hangs on 'include disk' for a number of hours - it did finish successfully, eventually.

Speed for the earlier and later backups is fine, it doesn't seem to hang on the snapshot which I could understand the delay if it was.

iMx · Aug 4, 2022

Looks like I've got a zombie:

Code:

root     1657739  0.0  0.0      0     0 ?        Zl   07:41   0:00 [qemu-img] <defunct>

Code:

INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2022-08-04 07:41:30
INFO: status = running
INFO: VM Name: Downloads
INFO: include disk 'scsi0' 'nas-proxmox-vm:101/vm-101-disk-0.qcow2' 32G

EDIT: Going to try with IO Thread enabled on the disk(s) (VirtIO SCSI Single) for a few days and see if this helps.

fabian · Aug 4, 2022

what storage type are the disks on? seems like qemu-img blocks, which can happen with network storage like NFS or CIFS if there is a problem with accessing the storage..

iMx · Aug 4, 2022

Yep, it is on NFS, on a Synology NAS RAID10, but it's not heavily utilised at the time :-/

Any/all other tasks happen either side of the backup window and don't clash.

The strange thing for me, is that I could often re-run the backup during the day, when it is more heavily utilised with other clients - although I did manage to replicate it this morning when I ran it manually.

I'll perhaps add a dedicated backup drive to run some tests.

fabian · Aug 4, 2022

well, another backup is finished right before - so maybe it's busy flushing that file..

iMx · Aug 4, 2022

Thanks - certainly a valid point for consideration.

Since enabling IO Thread on all disks, even loading up the storage with other tasks, I've not managed to replicate it - I've run about 5 backups so far this morning, repeatedly.

Will let it run for a few days and see.

Search

Search

Daily backup get stuck 'include disk'

Dough29

Active Member

iMx

Member

iMx

Member

fabian

Proxmox Staff Member

iMx

Member

fabian

Proxmox Staff Member

iMx

Member

We value your privacy