Backup job failed due to insufficient disk space

xilluminate · Nov 6, 2020

Hi Proxmox forum,

I am currently testing the Proxmox Backup Server at home on my own standalone Proxmox VE host. I have configured 1 backup job which backups all my important VMs. One of these VMs is quite a big VM. Around 3.6TB of provisioned space and used it's about 1.3TB.

From my understanding about how incremental backups work, PBS is using the qemu-bitmap method if available. Now recently my PVE host needed a reboot so the qemu-bitmap was lost. The next schedule triggered a full backup as expected. However, I only have space for 1 full backup on my backup store. My backup store has a size of 3TB. The backup job failed due to insufficient disk space. The large VM itself also stopped responding to any of it's requests. The PVE WebGUI reported that the guest agent was not running and opening a console session was not possible. After stopping and starting the VM again the VM seemed fine.

Other VMs continued to run fine after the failed backup job event.

Now my questions are:

- From my understanding PBS uses deduplication, why didn't it use deduplication for the second full backup that has now failed?
- Does PBS check if it has enough disk space for an incremental or full backup? And if so, why did it fail to do so this time?
- Why did the VM stop working instead of running just fine like the other VMs?

For reference I will leave the PVE version, PBS version and the failed backup job logs as attachment in this post.

Thanks in advance.

Kind Regards,

xilluminate

dcsapak · Nov 9, 2020

xilluminate said:
- From my understanding PBS uses deduplication, why didn't it use deduplication for the second full backup that has now failed?

it did, but maybe the data in the vm changed enough so that the backup server ran out of space?
(there only needs to be 1 bit different for a whole 4MB chunk to change)

can you also provide the backup log from the pbs side? also a 'df' output from there?

xilluminate said:
- Does PBS check if it has enough disk space for an incremental or full backup? And if so, why did it fail to do so this time?

no it does not an cannot, and there is no such thing in pbs as a incremental or full backup, all backups on a datastore share the chunks, and
those get deduplicated. there is no way to check beforehand if there is enough space, and if we would cancel if no 'full' backup would fit,
we would lose the benefit of the deduplication

xilluminate said:
- Why did the VM stop working instead of running just fine like the other VMs?

cannot say without further logs, maybe the journal/syslog of the pve side can help (maybe also logs from inside the vm)

xilluminate · Nov 9, 2020

Hello dcsapak,

Thank you for your reply.

dcsapak said:
it did, but maybe the data in the vm changed enough so that the backup server ran out of space?
(there only needs to be 1 bit different for a whole 4MB chunk to change)

can you also provide the backup log from the pbs side? also a 'df' output from there?

I have attached the backup logs from the PBS side. The df output:

Bash:

root@backup1:~# df
Filesystem            1K-blocks       Used  Available Use% Mounted on
udev                    1991392          0    1991392   0% /dev
tmpfs                    403056      40932     362124  11% /run
/dev/mapper/pbs-root   24381792    1665984   21454240   8% /
tmpfs                   2015260          0    2015260   0% /dev/shm
tmpfs                      5120          0       5120   0% /run/lock
tmpfs                   2015260          0    2015260   0% /sys/fs/cgroup
backups              5672196608 2044167424 3628029184  37% /backups
tmpfs                    403052          0     403052   0% /run/user/0

However, I have already deleted and re-created the full repository. So I don't think the df output is relevant anymore. If you want to see the backup storage usage history I have gathered a Grafana graph:

dcsapak said:
no it does not an cannot, and there is no such thing in pbs as a incremental or full backup, all backups on a datastore share the chunks, and
those get deduplicated. there is no way to check beforehand if there is enough space, and if we would cancel if no 'full' backup would fit,
we would lose the benefit of the deduplication

Alright, understood. I am still learning about PBS so forgive me if my interpretations are wrong.

dcsapak said:
cannot say without further logs, maybe the journal/syslog of the pve side can help (maybe also logs from inside the vm)

I have attached both the syslog of the PVE host and VM.

After a quick search I think these logs are the most interesting lines:

Bash:

Nov  6 13:12:19 pve vzdump[14552]: VM 102 qmp command failed - VM 102 qmp command 'backup-cancel' failed - unable to connect to VM 102 qmp soc
ket - timeout after 5985 retries
Nov  6 13:12:19 pve vzdump[14552]: ERROR: Backup of VM 102 failed - VM 102 qmp command 'query-backup' failed - got timeout
Nov  6 13:12:19 pve vzdump[14552]: INFO: Starting Backup of VM 105 (lxc)
Nov  6 13:12:19 pve zed: eid=17 class=history_event pool_guid=0xF4EEE9E6F0F9E061
Nov  6 13:12:21 pve vzdump[14552]: ERROR: Backup of VM 105 failed - command 'lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- /usr/bi
n/proxmox-backup-client backup '--crypt-mode=encrypt' '--keyfd=11' pct.conf:/var/tmp/vzdumptmp14552_105/etc/vzdump/pct.conf root.pxar:/mnt/vzs
nap0 --include-dev /mnt/vzsnap0/./ --skip-lost-and-found --backup-type ct --backup-id 105 --backup-time 1604664739 --repository pve1@pbs@10.13
.40.5:stripe' failed: exit code 255
Nov  6 13:12:21 pve vzdump[14552]: INFO: Backup job finished with errors
Nov  6 13:12:21 pve vzdump[14552]: job errors
Nov  6 13:12:21 pve vzdump[14425]: <root@pam> end task UPID:pve:000038D8:000D92E3:5FA483F2:vzdump::root@pam: job errors
Nov  6 13:12:21 pve postfix/pickup[8647]: 26F3D221AC: uid=0 from=<root>
Nov  6 13:12:21 pve postfix/cleanup[23691]: 26F3D221AC: message-id=<20201106121221.26F3D221AC@pve.local>
Nov  6 13:12:21 pve postfix/qmgr[2929]: 26F3D221AC: from=<root@pve.local>, size=26829, nrcpt=1 (queue active)
Nov  6 13:12:21 pve zed: eid=18 class=history_event pool_guid=0xF4EEE9E6F0F9E061
Nov  6 13:12:23 pve postfix/smtp[23694]: 26F3D221AC: to=<redacted>, relay=<redacted>
:25, delay=2.8, delays=0.04/0.02/1.8/0.89, dsn=2.6.0, status=sent (250 2.6.0 <20201106121221.26F3D221AC@pve.local> [InternalId=15006615752936,
Hostname=<redacted>] 33599 bytes in 0.256, 128.151 KB/sec Queued mail for delivery -> 250 2.1.5)
Nov  6 13:12:23 pve postfix/qmgr[2929]: 26F3D221AC: removed

In the vm-syslog.txt you can see that the syslog of the vm itself stops around this time.

Thanks in advance.

Kind Regards,

xilluminate

dcsapak · Nov 10, 2020

ok seems that the disk simply run full... maybe the data changed in a very unlucky way, such that there were a bunch of new chunks...

Bengt Nolin · Dec 20, 2020

I have also gotten this a few times and the reason has been that some tasks create multi gigabyte large files in the "/var/log/proxmox-backup/tasks/" directory, and not that the datastore is been exhausted. In my scenario it is because the datastore is remote and there was network issues causing every chunk to get logged together with an error message into a 1-2 GB large file.

Fortunately this is still a test/evaluation installation and a future production installation will use local storage, and I might even symlink /var/log/proxmox-backup to the same partition as the datastore. At least then the consumed space will be clearly visible from PVE and there will be a lot more storage available.

But I still think there there are probably ways to handle this a little differently. Maybe large task logs could be automatically compressed? I guess I could throw logrotate on it but that would disable the possibility to view the task logs from PBS.

Search

Search

Backup job failed due to insufficient disk space

xilluminate

Member

Attachments

dcsapak

Proxmox Staff Member

xilluminate

Member

Attachments

dcsapak

Proxmox Staff Member

Bengt Nolin

Well-Known Member

We value your privacy