Backup of VM failed (with exit code 5)

I have the same problem. The tried to backup a VM but their was no space left on the specific storage. The vzdump continue to run but nothing happens... All other scheduled backup are halted waiting the first vzdump to be completed (which never happen)...

After reading this thread, I have "killall vzdump" to allow the other backup to work. But The backup script is not able to "umount /mnt/vsnap0", and cannot "lvremove
/dev/pve/vzsnap-pve02-fl-0"...

[..]
Thanks!

I did a very bad thing which was causing instability and crashes of the whole cluster.
It was fully my fault and only additionaly triggering of hidden bugs.

I configured both nodes to do the backups on the disks of the other node
to be able to restart the services really soon if one nodes fully dies.

I accidently made them backup at the same time _and_ more important I didn't
use a seperate LVM-Storage _and_ I didn't leave about 10% of the disk free for
snapshotting.

This continously forced the LVM-Snapshots to run out of storage. There is some
crisis management in kernel/lvm, but If you kick it again and again the filesystems
fail and the machine dies step for step. Both of them died, mixed up the DRBD
Storage, lost some of the network links and so on. In short: don't try this at home;-)

I changed the storage >1 month ago. I'm using the storage from PVE's install routines for
OS and ISOs, already added a RAID10 for the VMs (LVM/DRBD and qcow 50% each)
and I added a separate RAID 1 for Backup. I short: Works fine now as stable as known.

I even have had to increase DRBD timeout to 15secs because Split-Brain happend
while the was no real network outage (monitoring of the enterprise switches). This
never happend again...and no degration.

I don't distribute the DRBD/LVM VMs over the two nodes. They all run on
one or the other to prevent DRBD SplitBrain (I additionally added the strongly
recommended mail alerting for DRBD) to prevent mayor damage on hot move.

I'm very happy with PVE again.

Keep in mind to save spare storage for the LVM. And keep the time for snapshot
short by having enough bandwidth :)).

Bye
 
  • Like
Reactions: 1 person
Finally found a workaround... No need to reboot the host anymore :D

Killing vzdump is not enough. In my case I also need to kill vmtar which is blocking the resources on /mnt/vzsnap0. Once everything is killed, the next vzdump will be able to unmount the volume. Maybe killing only vmtar could do the trick and resume the vzdump script (to be tested...)

Command: ps -xajf
Code:
  ...
  1  2825  2825  2825 ?           -1 Ss       0   0:27 pvedaemon worker
 2825 16290  2825  2825 ?           -1 S        0   0:09  \_ pvedaemon worker
 2825 32468  2825  2825 ?           -1 S        0   0:07  \_ pvedaemon worker
    1  2856  2856  2856 ?           -1 Ss     102   0:01 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 102:106 -g
    1  2883  2882  2882 ?           -1 Sl       0 2687:22 /usr/bin/kvm -monitor unix:/var/run/qemu-server/104.mon,server,nowait -vnc unix:/var/run/qemu-server/104.vnc,password -pidfile /var/run/qemu-server/1
    1  2905  2904  2904 ?           -1 Sl       0 2294:39 /usr/bin/kvm -monitor unix:/var/run/qemu-server/103.mon,server,nowait -vnc unix:/var/run/qemu-server/103.vnc,password -pidfile /var/run/qemu-server/1
    1  2923  2922  2922 ?           -1 Sl       0 2573:55 /usr/bin/kvm -monitor unix:/var/run/qemu-server/101.mon,server,nowait -vnc unix:/var/run/qemu-server/101.vnc,password -pidfile /var/run/qemu-server/1
    1  2939  2939  2939 ?           -1 Ss       1   0:00 /usr/sbin/atd
[COLOR=red]    1  2959  2959  2959 ?           -1 Ss       0   0:06 /usr/sbin/cron
 2959 10174  2959  2959 ?           -1 S        0   0:00  \_ /USR/SBIN/CRON
10174 10179 10179 10179 ?           -1 Ss       0   0:00      \_ /usr/bin/perl -w /usr/sbin/vzdump --quiet --snapshot --storage xxx_quotidien --mailto abc@def.ca 101
10179 10691 10179 10179 ?           -1 S        0   0:00          \_ sh -c /usr/lib/qemu-server/vmtar '/backuptmpfs/xxx/quotidien/vzdump-qemu-101-2011_01_27-02_28_43.tmp/qemu-server.conf' 'qemu-se
10691 10692 10179 10179 ?           -1 R        0 10064:35          |   \_ /usr/lib/qemu-server/vmtar /backuptmpfs/xxx/quotidien/vzdump-qemu-101-2011_01_27-02_28_43.tmp/qemu-server.conf qemu-serve
[/COLOR]10179 16941 10179 10179 ?           -1 D        0   0:00          \_ /usr/bin/perl -w /usr/sbin/qm --skiplock set 101 --lock 
    1  2982  2982  2982 ?           -1 Ss       0   0:29 /usr/sbin/apache2 -k start
...
 
Killing vmtar works better because vzdump ends properly with a status "backup failed"....
 
Hi all, I had problems like this with vzdump, sometimes for NFS fail device, other I'm not sure.
I want to share the method I'm using when vzdump fails:

1.- Look for father process of vzdump script:

prx01:~# ps -ef | grep vzdump
. . . 10097 . . .

2.- With pstree command look for dependent processes for father process number of vzdump:

prx01:~# pstree -np 10097
sh(10097)───vzdump(29477)───sh(29526)─┬─vmtar(29527)
├─gzip(29528)
└─cstream(29529)

4.- Kill all processes in inverse order that appear:

#kill -9 29529 29528 29527 29526 29477 10097

5.- Umount the snapshot mount point (normally /mnt/vzsnap0)

#mount
. . .
/dev/mapper/DRBD0-vzsnap--prx01--0 on /mnt/vzsnap0 type ext3 (rw)

#umount /mnt/vzsnap0

6.- Remove if posible the logical volumen create by vzdump. If not posible, no problem vzdump remove it in the next execution.

prx01:/var/log/vzdump# lvscan
. . .
/dev/dm-5: read failed after 0 of 4096 at 0: Error de entrada/salida
ACTIVE '/dev/pve/swap' [15,00 GB] inherit
ACTIVE '/dev/pve/root' [34,00 GB] inherit
ACTIVE '/dev/pve/data' [82,72 GB] inherit
ACTIVE '/dev/DRBD1/vm-101-disk-1' [100,00 GB] inherit
inactive Original '/dev/DRBD0/DRBD0VOLUMEN' [713,75 GB] inherit
inactive Snapshot '/dev/DRBD0/vzsnap-prx01-0' [1,00 GB] inherit

#lvremove /dev/DRBD0/vzsnap-prx01-0


7.- Unlock all our KVM VMID and remove OpenVZ file vzdump.lock.

#qm unlock 327
#qm unlock 101

#rm /var/run/vzdump.lock


8.- Finally remove temporal and working files from destination of our backups, for example /mnt/usb_storage/daily :

-. Remove directory tmp
#cd /mnt/usb_storage/daily
#rm –R *.tmp

-. Files .dat

#rm *.dat


That's all. If somebody see something that can be dangerous, please let me know.

Thanks for being there.

Best regards, Manuel.
 
Hi all, I had problems like this with vzdump, sometimes for NFS fail device, other I'm not sure.
I want to share the method I'm using when vzdump fails:

1.- Look for father process of vzdump script:

prx01:~# ps -ef | grep vzdump
. . . 10097 . . .

......

Can I safely try your method? Anyone experienced with method? It seems reasonable to try. I hope v2 wont have this issue. I have it 3 of 5 proxmox nodes. And ıt happens rarely.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!