I have the same problem. The tried to backup a VM but their was no space left on the specific storage. The vzdump continue to run but nothing happens... All other scheduled backup are halted waiting the first vzdump to be completed (which never happen)...
After reading this thread, I have "killall vzdump" to allow the other backup to work. But The backup script is not able to "umount /mnt/vsnap0", and cannot "lvremove /dev/pve/vzsnap-pve02-fl-0"...
[..]
Thanks!
I did a very bad thing which was causing instability and crashes of the whole cluster.
It was fully my fault and only additionaly triggering of hidden bugs.
I configured both nodes to do the backups on the disks of the other node
to be able to restart the services really soon if one nodes fully dies.
I accidently made them backup at the same time _and_ more important I didn't
use a seperate LVM-Storage _and_ I didn't leave about 10% of the disk free for
snapshotting.
This continously forced the LVM-Snapshots to run out of storage. There is some
crisis management in kernel/lvm, but If you kick it again and again the filesystems
fail and the machine dies step for step. Both of them died, mixed up the DRBD
Storage, lost some of the network links and so on. In short: don't try this at home;-)
I changed the storage >1 month ago. I'm using the storage from PVE's install routines for
OS and ISOs, already added a RAID10 for the VMs (LVM/DRBD and qcow 50% each)
and I added a separate RAID 1 for Backup. I short: Works fine now as stable as known.
I even have had to increase DRBD timeout to 15secs because Split-Brain happend
while the was no real network outage (monitoring of the enterprise switches). This
never happend again...and no degration.
I don't distribute the DRBD/LVM VMs over the two nodes. They all run on
one or the other to prevent DRBD SplitBrain (I additionally added the strongly
recommended mail alerting for DRBD) to prevent mayor damage on hot move.
I'm very happy with PVE again.
Keep in mind to save spare storage for the LVM. And keep the time for snapshot
short by having enough bandwidth ).
Bye