Hello Forum
It seems that the active PVE node loses the connection to the shared storage nodes based on TrueNAS and NFS during backup. The setup looks like this, the two PVE servers each connect with 1 x 1Gbit/ the management and VM interface to the internal switch. Two TrueNAS servers are available for shared storage, one NAS is equipped with SSD disks and the other NAS with spinning HDD's for backup to NFS and SMB/CIFS for file sharing, all servers are connected via a dedicated 10Gbit/s backend switch.
LXCs (11 units) and VMs (8 units) are used on the PVE nodes. The backup of the LXC/VMs is triggered by an automatic backup job at 01:00 (LXC) or 03:00 (VM). The OS disks run on the SSD NAS (shared) and the backup is stored on the HDD NAS (shared).
Now to the actual problem, when backing up the LXC's, the following has been occurring very often recently: ERROR: can't acquire lock '/var/run/vzdump.lock' - got timeout, whereby it is about the second backup job that cannot be triggered because the existing lock of the first backup job prevents this.
The backup job of the LXC's usually aborts differently, it can be that it aborts on the first or third or fourth LXC container of the job. Access to the GUI and SSH is still possible, I can only cancel the backup job to a limited extent, usually I then lose all LXC and VM systems in the GUI. All guest systems continue to run, only the LXC system that remains in backup status hangs and no longer recovers.
After that, only a restart of the active node helps to bring everything back to a working state. The backup jobs for the LXC are executed in stop mode and not in suspend and the compression is set to ZSTD.
Most LXCs run with an overlaid Docker container and yes I know that this is suboptimal. However, this should not affect the backup of the LXC containers as they are stopped. What additional information is needed to get support here in the forum so that I can fix the problem and have my backup automatically created again at night?
I have tried to analyze the log but can't interpret it clearly so I am leaving an excerpt attached.
Help is welcome and I am grateful for any kind of advice.
Regards
Nico
pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
pve-kernel-5.15: 7.4-3
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.5.11-6-pve-signed: 6.5.11-6
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2.16-18-pve: 6.2.16-18
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.3
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.4
pve-qemu-kvm: 8.1.2-4
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
It seems that the active PVE node loses the connection to the shared storage nodes based on TrueNAS and NFS during backup. The setup looks like this, the two PVE servers each connect with 1 x 1Gbit/ the management and VM interface to the internal switch. Two TrueNAS servers are available for shared storage, one NAS is equipped with SSD disks and the other NAS with spinning HDD's for backup to NFS and SMB/CIFS for file sharing, all servers are connected via a dedicated 10Gbit/s backend switch.
LXCs (11 units) and VMs (8 units) are used on the PVE nodes. The backup of the LXC/VMs is triggered by an automatic backup job at 01:00 (LXC) or 03:00 (VM). The OS disks run on the SSD NAS (shared) and the backup is stored on the HDD NAS (shared).
Now to the actual problem, when backing up the LXC's, the following has been occurring very often recently: ERROR: can't acquire lock '/var/run/vzdump.lock' - got timeout, whereby it is about the second backup job that cannot be triggered because the existing lock of the first backup job prevents this.
The backup job of the LXC's usually aborts differently, it can be that it aborts on the first or third or fourth LXC container of the job. Access to the GUI and SSH is still possible, I can only cancel the backup job to a limited extent, usually I then lose all LXC and VM systems in the GUI. All guest systems continue to run, only the LXC system that remains in backup status hangs and no longer recovers.
After that, only a restart of the active node helps to bring everything back to a working state. The backup jobs for the LXC are executed in stop mode and not in suspend and the compression is set to ZSTD.
Most LXCs run with an overlaid Docker container and yes I know that this is suboptimal. However, this should not affect the backup of the LXC containers as they are stopped. What additional information is needed to get support here in the forum so that I can fix the problem and have my backup automatically created again at night?
I have tried to analyze the log but can't interpret it clearly so I am leaving an excerpt attached.
Help is welcome and I am grateful for any kind of advice.
Regards
Nico
pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
pve-kernel-5.15: 7.4-3
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.5.11-6-pve-signed: 6.5.11-6
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2.16-18-pve: 6.2.16-18
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.3
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.4
pve-qemu-kvm: 8.1.2-4
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
Attachments
Last edited: