CT on NFS backup freezes and does not start again

jkalousek

Well-Known Member
Aug 28, 2016
41
0
46
I have 3 nodes and all CTs are running of NFS server and backing up (using STOP mode) to another NFS server all over 10Gb link.
Everything is running without problem but every lets say 14 days one node during backup of CT freezes, and until I manually stop it it shows just:

Code:
INFO: starting new backup job: vzdump 105 --mode stop --mailto -redacted- --mailnotification failure --prune-backups 'keep-daily=6,keep-monthly=2,keep-weekly=2' --storage IKO-serverBackup --compress zstd --quiet 1 --node peT320 --notes-template '{{guestname}}'
INFO: Starting Backup of VM 105 (lxc)
INFO: Backup started at 2024-02-07 23:59:08
INFO: status = stopped
INFO: backup mode: stop
INFO: bandwidth limit: 81920 KB/s
INFO: ionice priority: 7
INFO: CT Name: -redacted-
INFO: including mount point rootfs ('/') in backup
INFO: creating vzdump archive '/mnt/pve/IKO-serverBackup/dump/vzdump-lxc-105-2024_02_07-23_59_08.tar.zst'

After I stop the backup over GUI I get:
Code:
umount: /mnt/vzsnap0/: not mounted.
command 'umount -l -d /mnt/vzsnap0/' failed: exit code 32
ERROR: Backup of VM 105 failed - command 'mount /dev/loop0 /mnt/vzsnap0//' failed: interrupted by signal
INFO: Failed at 2024-02-08 06:13:04
INFO: Backup job finished with errors
INFO: notified via target -redacted-
TASK ERROR: job errors

After that container will just be left in stop state and if I try to start it manually everything freezes after few seconds (including GUI) and I have to connect via CLI and manually try to kill starting container which would never start.

First I thought that NFS does not respond but when I list contents of NFS server via GUI everything loads right up and all seems active.
I had a look at 'lsof' but there does not seem to be any open files under NFS mount that would be stalled.
There are also no zombie processes under 'top'.
But when I try to unmount nfs that container uses I get 'device is busy'.
Only thing that will get everything running again is restart node.
I tried both NFS3 and 4 and there does not seem to be any difference.
Other nodes are running fine and also have more containers on them without problem.
I'm using Truenas Core as NFS target - for both backup and CT storage and I have backup limited so they will not saturate network (bwlimit) and all planed to different times so they are not overlapping.

I left the CT in this broken state so I can provide logs if requested, this container is not critical but I need to get it running again.
Does anyone have any suggestion what could be causing this or where should I look.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!