Backup to NFS kills PVE Node

screenie · Apr 9, 2019

I setup a new 5.3 two-node cluster and created some CT on a local LVM volume.
When i backup a CT to a NFS storage it starts, says it cannot do snapshot and continues with suspend mode but never finishes.
The node or better the PVE gui gets unresponsive, all NFS mounts hang and also a reboot is not working - the node needs to be power-cycled.
It happens with both 5.3 nodes where i have a productive 3.4 two-node cluster setup the same way with the same NFS mounts and never had this issue.

Code:

root@fralxpve01:~# pveversion
pve-manager/5.3-12/5fbbbaf6 (running kernel: 4.15.18-12-pve)
root@fralxpve01:~# pvecm status
Quorum information
------------------
Date:             Tue Apr  9 11:42:33 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1/240
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.16.209.27 (local)
0x00000002          1 10.16.209.28
root@fralxpve01:~#

root@fralxpve02:~# pveversion
pve-manager/5.3-12/5fbbbaf6 (running kernel: 4.15.18-12-pve)
root@fralxpve02:~# pvecm status
Quorum information
------------------
Date:             Tue Apr  9 11:45:11 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1/240
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.16.209.27
0x00000002          1 10.16.209.28 (local)
root@fralxpve02:~#

Code:

Virtual Environment 5.3-12
Node 'fralxpve02'
Search:
Logs
INFO: starting new backup job: vzdump 1000 --storage backup.nfs --mode snapshot --compress gzip --remove 0 --node fralxpve02
INFO: Starting Backup of VM 1000 (lxc)
INFO: status = running
INFO: CT Name: fralxnoc
INFO: mode failure - some volumes do not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: CT Name: fralxnoc
INFO: temporary directory is on NFS, disabling xattr and acl support, consider configuring a local tmpdir via /etc/vzdump.conf
INFO: starting first sync /proc/26257/root// to /mnt/pve/backup.nfs/dump/vzdump-lxc-1000-2019_04_09-10_34_39.tmp
INFO: Number of files: 27,838 (reg: 21,702, dir: 2,206, link: 3,898, dev: 2, special: 30)
INFO: Number of created files: 27,837 (reg: 21,702, dir: 2,205, link: 3,898, dev: 2, special: 30)
INFO: Number of deleted files: 0
INFO: Number of regular files transferred: 21,692
INFO: Total file size: 732,235,590 bytes
INFO: Total transferred file size: 724,878,042 bytes
INFO: Literal data: 724,878,042 bytes
INFO: Matched data: 0 bytes
INFO: File list size: 1,048,536
INFO: File list generation time: 0.001 seconds
INFO: File list transfer time: 0.000 seconds
INFO: Total bytes sent: 726,631,831
INFO: Total bytes received: 437,816
INFO: sent 726,631,831 bytes received 437,816 bytes 9,759,324.12 bytes/sec
INFO: total size is 732,235,590 speedup is 1.01
INFO: first sync finished (74 seconds)
INFO: suspend vm
INFO: starting final sync /proc/26257/root// to /mnt/pve/backup.nfs/dump/vzdump-lxc-1000-2019_04_09-10_34_39.tmp
INFO: Number of files: 27,838 (reg: 21,702, dir: 2,206, link: 3,898, dev: 2, special: 30)
INFO: Number of created files: 0
INFO: Number of deleted files: 0
INFO: Number of regular files transferred: 0
INFO: Total file size: 732,235,590 bytes
INFO: Total transferred file size: 0 bytes
INFO: Literal data: 0 bytes
INFO: Matched data: 0 bytes
INFO: File list size: 0
INFO: File list generation time: 0.001 seconds
INFO: File list transfer time: 0.000 seconds
INFO: Total bytes sent: 651,037
INFO: Total bytes received: 2,392
INFO: sent 651,037 bytes received 2,392 bytes 435,619.33 bytes/sec
INFO: total size is 732,235,590 speedup is 1,120.60
INFO: final sync finished (1 seconds)
INFO: resume vm
INFO: vm is online again after 3 seconds
INFO: creating archive '/mnt/pve/backup.nfs/dump/vzdump-lxc-1000-2019_04_09-10_34_39.tar.gz'

The backup task never finishes as the NFS mounts are not working anymore on that node:

Code:

root@fralxpve02:~# df -h
Filesystem                           Size  Used Avail Use% Mounted on
udev                                  32G     0   32G   0% /dev
tmpfs                                6.3G   11M  6.3G   1% /run
/dev/sda3                             64G  2.7G   58G   5% /
tmpfs                                 32G   60M   32G   1% /dev/shm
tmpfs                                5.0M     0  5.0M   0% /run/lock
tmpfs                                 32G     0   32G   0% /sys/fs/cgroup
/dev/sda1                            333M  132K  333M   1% /boot/efi
/dev/mapper/vg001-lvol001            605G  5.3G  569G   1% /srv/vz.local
/dev/fuse                             30M   24K   30M   1% /etc/pve
10.16.209.29:/vol99fs/data_wansec    200G  524M  200G   1% /mnt/pve/vz.nfs-ssd
10.16.209.29:/vol1bfs/bkup_wansec    650G  218G  432G  34% /mnt/pve/backup.nfs
10.16.209.29:/vol1bfs/bkup_fralxnoc  437G  4.7G  433G   2% /mnt/pve/archive.nfs
10.16.209.29:/vol175fs/data_wansec    33G  381M   32G   2% /mnt/pve/templates.nfs
tmpfs                                6.3G     0  6.3G   0% /run/user/0
root@fralxpve02:~#

Need help to figure out why snapshot mode can't be used with the CT's on local LVM and NFS mounts are hanging killing the node.

The issue reproducible on both nodes every time a backup job starts or backup of a CT is done manually - disabled now backup jobs to keep the servers working.

The 3.4 cluster is running on a pair of Dell PowerEdge R710
The 5.3 cluster is running on a pair of HP ProLiant DL360 Gen10
Storage is a NetApp AllFlash running 9.1P10

thanks

mira · Apr 9, 2019

Please post the output of 'pct config <CTID>' as well as the output of 'pveversion -v'.

screenie · Apr 9, 2019

mira said:
Please post the output of 'pct config <CTID>' as well as the output of 'pveversion -v'.

root@fralxpve02:~# pct config 1000
arch: amd64
cores: 2
description: new fralxnoc instance
hostname: fralxnoc
memory: 1024
net0: name=eth0,bridge=vmbr2,gw=10.16.209.1,hwaddr=A6

D:A6:2F:9C:7C,ip=10.16.209.182/24,tag=209,type=veth
onboot: 1
ostype: debian
rootfs: vz.local:1000/vm-1000-disk-0.raw,size=50G
swap: 1024

root@fralxpve02:~# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.3-12 (running version: 5.3-12/5fbbbaf6)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-48
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-12
libpve-storage-perl: 5.0-39
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-24
pve-cluster: 5.0-34
pve-container: 2.0-35
pve-docs: 5.3-3
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-18
pve-firmware: 2.0-6
pve-ha-manager: 2.0-8
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-2
pve-xtermjs: 3.10.1-2
qemu-server: 5.0-47
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
root@fralxpve02:~#

mira · Apr 9, 2019

What exactly is it that backs '/srv/vz.local'?

screenie · Apr 9, 2019

'/srv/vz.local' is used for all the local stored CT/VM's which is the LVM device:

root@fralxpve02:~# pvdisplay
--- Physical volume ---
PV Name /dev/sda4
VG Name vg001
PV Size 651.91 GiB / not usable 3.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 166888
Free PE 9532
Allocated PE 157356
PV UUID YUbFsq-ATdw-KECH-U3ws-OQeM-svPZ-PxDJTH

root@fralxpve02:~# lvdisplay
--- Logical volume ---
LV Path /dev/vg001/lvol001
LV Name lvol001
VG Name vg001
LV UUID oGRczr-KcF4-dejq-Yeqg-fvN1-16c6-kwS573
LV Write Access read/write
LV Creation host, time fralxpve02, 2019-01-22 23:45:02 +0100
LV Status available
# open 1
LV Size 614.67 GiB
Current LE 157356
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0

root@fralxpve02:~# mount | grep vz.local
/dev/mapper/vg001-lvol001 on /srv/vz.local type ext4 (rw,noatime,nodiratime,discard,stripe=64,data=ordered)

screenie · May 6, 2019

If someone has the same issue - it seems vzdump has a problem when the nfs target name contains a dot in storage.cfg.
When the source directory where the containers are located contain a dot, that doesn't matter.
Copy files manually to the nfs mount point containing a dot is also not a problem - but the backup process kills the PVE host in this case.

Search

Search

Backup to NFS kills PVE Node

screenie

Active Member

mira

Proxmox Staff Member

screenie

Active Member

mira

Proxmox Staff Member

screenie

Active Member

screenie

Active Member

We value your privacy