PVE 8.1.3 loses NFS connection during Scheduled Backup Task

Feb 20, 2023
1
0
6
Hello Forum

It seems that the active PVE node loses the connection to the shared storage nodes based on TrueNAS and NFS during backup. The setup looks like this, the two PVE servers each connect with 1 x 1Gbit/ the management and VM interface to the internal switch. Two TrueNAS servers are available for shared storage, one NAS is equipped with SSD disks and the other NAS with spinning HDD's for backup to NFS and SMB/CIFS for file sharing, all servers are connected via a dedicated 10Gbit/s backend switch.

LXCs (11 units) and VMs (8 units) are used on the PVE nodes. The backup of the LXC/VMs is triggered by an automatic backup job at 01:00 (LXC) or 03:00 (VM). The OS disks run on the SSD NAS (shared) and the backup is stored on the HDD NAS (shared).

Now to the actual problem, when backing up the LXC's, the following has been occurring very often recently: ERROR: can't acquire lock '/var/run/vzdump.lock' - got timeout, whereby it is about the second backup job that cannot be triggered because the existing lock of the first backup job prevents this.

The backup job of the LXC's usually aborts differently, it can be that it aborts on the first or third or fourth LXC container of the job. Access to the GUI and SSH is still possible, I can only cancel the backup job to a limited extent, usually I then lose all LXC and VM systems in the GUI. All guest systems continue to run, only the LXC system that remains in backup status hangs and no longer recovers.

After that, only a restart of the active node helps to bring everything back to a working state. The backup jobs for the LXC are executed in stop mode and not in suspend and the compression is set to ZSTD.

Most LXCs run with an overlaid Docker container and yes I know that this is suboptimal. However, this should not affect the backup of the LXC containers as they are stopped. What additional information is needed to get support here in the forum so that I can fix the problem and have my backup automatically created again at night?

I have tried to analyze the log but can't interpret it clearly so I am leaving an excerpt attached.

Help is welcome and I am grateful for any kind of advice.

Regards
Nico

pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
pve-kernel-5.15: 7.4-3
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.5.11-6-pve-signed: 6.5.11-6
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2.16-18-pve: 6.2.16-18
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.3
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.4
pve-qemu-kvm: 8.1.2-4
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
 

Attachments

Last edited:
Howdy,

I saw this and ran into the exact same issue. It appears, at least on my end, that the issue is caused when running multiple parallel backup jobs or the I/O is overwhelmed (I/O contention). Usually when you encounter this and the backup is "hung" or "stuck" when you run via PVE shell
Code:
ps aux | grep vzdump
you'll see the task and its PID in shell while the GUI will show it is locked with the floppy disk icon on the thumbnail. There will be a Process State code with the tasks when using the above command. They will be something like Ds or S+.

Ds , Specifically the D means it is an uninterruptible sleep state. It typically occurs when the process is waiting for I/O (such as the disk or network access, e.g. if you're hosting an NFS backup via TrueNAS VM which was my case and seems to be similar to yours) to complete. It essentially means "Deep Sleep" as it cannot be killed until the I/O operation finishes or times out. s is the session leader or leader of the process group. Summed up it means it's difficult to kill this process without a reboot of the node itself until the external event it's waiting for concludes (like access to the NFS drive). Using a command such as kill -9 <PID> tends to not work when the process state is showing Ds.

S+ is an interruptible sleep state. It's waiting for an event and can be interrupted by a signal, such as SIGKILL. The + means it is in the foreground process group of a terminal so it's actively associated with the current shell.

I found that rebooting the VM hosting the NFS share allowed for the I/O operation to complete (at least complete enough so that I could unlock the VM). It may not be necessary, in the cases where this is possible, to reboot the entire node. To avoid this in the future was to improve I/O such as providing multiple routes for I/O allowing parallel backup jobs to run without saturating or overwhelming the system (I/O contention) or to simply do backup jobs sequentially.

On a somewhat related note when you run into these issues ensure that you check via shell, or something like winSCP, the backup directory for the *.tmp and *.vma.datfiles. You will find them when you have incomplete or interrupted backups. These can eat away heavily at your storage and they won't show up in the PVE node's web GUI so you'll be blind to them unless you go digging. In my case I found them where my backup NFS share was mounted: /mnt/pve-pool/pve-pool-backups/dump/. I had a few hundred GBs of data that was just clogging up storage.

Due to some permission issues you likely won't be able to clear these "abandoned" files via winSCP or PVE's GUI. You'll need to go through shell. Here's how I was able to accomplish clearing them:


Code:
# First command and shell output
root@pve:~# ls -lh /mnt/
total 8.0K
drwxr-xr-x 3 root root 4.0K Oct 17  2023 pve
drwxr-xr-x 2 root root 4.0K Feb 14  2024 vzsnap0

# Second command and shell output
root@pve:~# ls -lh /mnt/pve/
total 512
drwxrwxrwx 5 3000 3000 5 Oct 17  2023 pve-backups
root@pve:~# ls -lh /mnt/pve/pve-backups/
total 26K
drwxr-xr-x 4 3000 3000 23 Jan  9 11:59 dump
drwxr-xr-x 2 3000 3000  2 Oct 17  2023 images
drwxr-xr-x 2 3000 3000  2 Oct 17  2023 snippets

# Third command and shell output
root@pve:~# ls -lh /mnt/pve/pve-backups/dump/
total 19G
-rw-r--r-- 1 3000 3000  828 Jan  9 11:50 vzdump-lxc-112-2025_01_09-11_50_02.log
-rw-r--r-- 1 3000 3000 714M Jan  9 11:50 vzdump-lxc-112-2025_01_09-11_50_02.vma.dat
drwxr-xr-x 2 3000 3000    2 Jan  9 11:59 vzdump-qemu-985-2025_01_09-11_59_38.tmp

Now that you've located the "abandoned" backup files here are the commands to clear them out (using my directory structure, replace yours where necessary, i.e cd /mnt/path/to/your/backups/dump/:

Change working directory to where the files are located:
cd /mnt/pve/pve-backups/dump/

Find and delete any file matching the file extension below:
find . -type f -name "*.vma.dat" -delete

Recursively delete the .tmp directories associated with the broken/interrupted backup jobs. The find command will not handle these as they are directories with nested files. Be very careful using this and ensure you are in the proper directory as the rm -rf command will nuke anything and everything in whatever directory if you are not very specific.
rm -rf <vzdump-qemu-<VMID>-<DATE>.tmp

Hopefully you've already figured out the solution to your issue. If not then maybe this will provide you with some help, or at least help any newcomers and other proxmox users in the future-- or serve as a reference point if you forget exactly what you did the last time, like myself.

Cheers,

drnarf
 
Last edited: