I setup a new 5.3 two-node cluster and created some CT on a local LVM volume.
When i backup a CT to a NFS storage it starts, says it cannot do snapshot and continues with suspend mode but never finishes.
The node or better the PVE gui gets unresponsive, all NFS mounts hang and also a reboot is not working - the node needs to be power-cycled.
It happens with both 5.3 nodes where i have a productive 3.4 two-node cluster setup the same way with the same NFS mounts and never had this issue.
The backup task never finishes as the NFS mounts are not working anymore on that node:
Need help to figure out why snapshot mode can't be used with the CT's on local LVM and NFS mounts are hanging killing the node.
The issue reproducible on both nodes every time a backup job starts or backup of a CT is done manually - disabled now backup jobs to keep the servers working.
The 3.4 cluster is running on a pair of Dell PowerEdge R710
The 5.3 cluster is running on a pair of HP ProLiant DL360 Gen10
Storage is a NetApp AllFlash running 9.1P10
thanks
When i backup a CT to a NFS storage it starts, says it cannot do snapshot and continues with suspend mode but never finishes.
The node or better the PVE gui gets unresponsive, all NFS mounts hang and also a reboot is not working - the node needs to be power-cycled.
It happens with both 5.3 nodes where i have a productive 3.4 two-node cluster setup the same way with the same NFS mounts and never had this issue.
Code:
root@fralxpve01:~# pveversion
pve-manager/5.3-12/5fbbbaf6 (running kernel: 4.15.18-12-pve)
root@fralxpve01:~# pvecm status
Quorum information
------------------
Date: Tue Apr 9 11:42:33 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1/240
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.16.209.27 (local)
0x00000002 1 10.16.209.28
root@fralxpve01:~#
root@fralxpve02:~# pveversion
pve-manager/5.3-12/5fbbbaf6 (running kernel: 4.15.18-12-pve)
root@fralxpve02:~# pvecm status
Quorum information
------------------
Date: Tue Apr 9 11:45:11 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 1/240
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.16.209.27
0x00000002 1 10.16.209.28 (local)
root@fralxpve02:~#
Code:
Virtual Environment 5.3-12
Node 'fralxpve02'
Search:
Logs
INFO: starting new backup job: vzdump 1000 --storage backup.nfs --mode snapshot --compress gzip --remove 0 --node fralxpve02
INFO: Starting Backup of VM 1000 (lxc)
INFO: status = running
INFO: CT Name: fralxnoc
INFO: mode failure - some volumes do not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: CT Name: fralxnoc
INFO: temporary directory is on NFS, disabling xattr and acl support, consider configuring a local tmpdir via /etc/vzdump.conf
INFO: starting first sync /proc/26257/root// to /mnt/pve/backup.nfs/dump/vzdump-lxc-1000-2019_04_09-10_34_39.tmp
INFO: Number of files: 27,838 (reg: 21,702, dir: 2,206, link: 3,898, dev: 2, special: 30)
INFO: Number of created files: 27,837 (reg: 21,702, dir: 2,205, link: 3,898, dev: 2, special: 30)
INFO: Number of deleted files: 0
INFO: Number of regular files transferred: 21,692
INFO: Total file size: 732,235,590 bytes
INFO: Total transferred file size: 724,878,042 bytes
INFO: Literal data: 724,878,042 bytes
INFO: Matched data: 0 bytes
INFO: File list size: 1,048,536
INFO: File list generation time: 0.001 seconds
INFO: File list transfer time: 0.000 seconds
INFO: Total bytes sent: 726,631,831
INFO: Total bytes received: 437,816
INFO: sent 726,631,831 bytes received 437,816 bytes 9,759,324.12 bytes/sec
INFO: total size is 732,235,590 speedup is 1.01
INFO: first sync finished (74 seconds)
INFO: suspend vm
INFO: starting final sync /proc/26257/root// to /mnt/pve/backup.nfs/dump/vzdump-lxc-1000-2019_04_09-10_34_39.tmp
INFO: Number of files: 27,838 (reg: 21,702, dir: 2,206, link: 3,898, dev: 2, special: 30)
INFO: Number of created files: 0
INFO: Number of deleted files: 0
INFO: Number of regular files transferred: 0
INFO: Total file size: 732,235,590 bytes
INFO: Total transferred file size: 0 bytes
INFO: Literal data: 0 bytes
INFO: Matched data: 0 bytes
INFO: File list size: 0
INFO: File list generation time: 0.001 seconds
INFO: File list transfer time: 0.000 seconds
INFO: Total bytes sent: 651,037
INFO: Total bytes received: 2,392
INFO: sent 651,037 bytes received 2,392 bytes 435,619.33 bytes/sec
INFO: total size is 732,235,590 speedup is 1,120.60
INFO: final sync finished (1 seconds)
INFO: resume vm
INFO: vm is online again after 3 seconds
INFO: creating archive '/mnt/pve/backup.nfs/dump/vzdump-lxc-1000-2019_04_09-10_34_39.tar.gz'
The backup task never finishes as the NFS mounts are not working anymore on that node:
Code:
root@fralxpve02:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 32G 0 32G 0% /dev
tmpfs 6.3G 11M 6.3G 1% /run
/dev/sda3 64G 2.7G 58G 5% /
tmpfs 32G 60M 32G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/sda1 333M 132K 333M 1% /boot/efi
/dev/mapper/vg001-lvol001 605G 5.3G 569G 1% /srv/vz.local
/dev/fuse 30M 24K 30M 1% /etc/pve
10.16.209.29:/vol99fs/data_wansec 200G 524M 200G 1% /mnt/pve/vz.nfs-ssd
10.16.209.29:/vol1bfs/bkup_wansec 650G 218G 432G 34% /mnt/pve/backup.nfs
10.16.209.29:/vol1bfs/bkup_fralxnoc 437G 4.7G 433G 2% /mnt/pve/archive.nfs
10.16.209.29:/vol175fs/data_wansec 33G 381M 32G 2% /mnt/pve/templates.nfs
tmpfs 6.3G 0 6.3G 0% /run/user/0
root@fralxpve02:~#
Need help to figure out why snapshot mode can't be used with the CT's on local LVM and NFS mounts are hanging killing the node.
The issue reproducible on both nodes every time a backup job starts or backup of a CT is done manually - disabled now backup jobs to keep the servers working.
The 3.4 cluster is running on a pair of Dell PowerEdge R710
The 5.3 cluster is running on a pair of HP ProLiant DL360 Gen10
Storage is a NetApp AllFlash running 9.1P10
thanks
Last edited: