Hi
I am experiencing a strange issue with 2 VMs on our 4 node proxmox cluster.
The 2 VMs are on the same host.
When either scheduled or manual VZDump starts, it locks the VM completely and nothing more happens, Only way to get past the problem is to stop the vzdump job and do a unlock and reset from the shell. It seems as if the OS of the VM locks up.
Here is the output from the job:
I have removed sensitive information such as email addresses and hostnames, but otherwise the output is untouched.
When the job reaches the state where it should create the dumpfile it stays there for about 5-10 seconds before the VM locks up and our monitoring system starts reporting errors.
For the scheduled backups, this problem prevents the rest of the VMs on the node from getting backed up.
The error only comes for these two specific VMs on this specific node. So if I exclude these from the scheduled job, everything is fine. Even backup of larger VMs is running fine. There are also no snapshots on these VMs.
All VMs are being backed up to the same locationm, which is named backup-nfs in the job above.
We have also tried to reboot the node, but still no difference
VMs are running on a ZFS RAID1 root on Intel NVMe drives.
I am currently migrating one of these VMs to another node to see if the problem goes away. Has anyone previously seen a similar issue the what I have explained above?
I am experiencing a strange issue with 2 VMs on our 4 node proxmox cluster.
The 2 VMs are on the same host.
When either scheduled or manual VZDump starts, it locks the VM completely and nothing more happens, Only way to get past the problem is to stop the vzdump job and do a unlock and reset from the shell. It seems as if the OS of the VM locks up.
Here is the output from the job:
Code:
()
INFO: starting new backup job: vzdump 103 --compress gzip --storage backups-nfs --remove 0 --mailto ***@***.dk --mode snapshot --node ***
INFO: Starting Backup of VM 103 (qemu)
INFO: status = running
INFO: update VM 103: -lock backup
INFO: VM Name: vm103
INFO: include disk 'scsi0' 'local-zfs:vm-103-disk-1' 80G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backups-nfs/dump/vzdump-qemu-103-2019_01_12-04_14_12.vma.gz'
I have removed sensitive information such as email addresses and hostnames, but otherwise the output is untouched.
When the job reaches the state where it should create the dumpfile it stays there for about 5-10 seconds before the VM locks up and our monitoring system starts reporting errors.
For the scheduled backups, this problem prevents the rest of the VMs on the node from getting backed up.
The error only comes for these two specific VMs on this specific node. So if I exclude these from the scheduled job, everything is fine. Even backup of larger VMs is running fine. There are also no snapshots on these VMs.
All VMs are being backed up to the same locationm, which is named backup-nfs in the job above.
We have also tried to reboot the node, but still no difference
Code:
()
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.17-3-pve: 4.15.17-14
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
VMs are running on a ZFS RAID1 root on Intel NVMe drives.
I am currently migrating one of these VMs to another node to see if the problem goes away. Has anyone previously seen a similar issue the what I have explained above?