vzdump not working as expected on fresh install but was in prevous versions

stra4d · Jun 3, 2015

I have a fresh install of proxmox 3.4 (see version info below). Previous server was version 3.1 and the backup method we were using was working. We do a snapshot mode vzdump Mon-Thurs as follows:

Code:

vzdump --mode snapshot --compress gzip --storage drive2_backups --maxfiles 1 --mailto scott@me.whatever --all

When the snapshot is running, the VMs crash and thus the snapshot does not finish (see info below for vzdump output and VM info).

Friday do a shutdown backup as follows (this is working fine):

Code:

vzdump --compress gzip --storage drive2_backups --maxfiles 1 --mailto scott@rci.ca --all

Backup goes to a second drive (not local). Some VMs are on local storage, some are on second drive.

Code:

# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-37-pve: 2.6.32-147
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

VZDUMP log (middle removed for clarity):

Code:

100: Jun 02 20:00:09 INFO: Starting Backup of VM 100 (qemu)
100: Jun 02 20:00:09 INFO: status = running
100: Jun 02 20:00:10 INFO: update VM 100: -lock backup
100: Jun 02 20:00:10 INFO: backup mode: snapshot
100: Jun 02 20:00:10 INFO: ionice priority: 7
100: Jun 02 20:00:10 INFO: creating archive '/drive2/vzdumpbackups/dump/vzdump-qemu-100-2015_06_02-20_00_09.vma.gz'
100: Jun 02 20:00:10 INFO: started backup task 'a32f70bb-9501-4b5d-a821-82fbe518c364'
100: Jun 02 20:00:13 INFO: status: 0% (38928384/15032385536), sparse 0% (8904704), duration 3, 12/10 MB/s
100: Jun 02 20:00:19 INFO: status: 1% (167247872/15032385536), sparse 0% (71995392), duration 9, 21/10 MB/s
100: Jun 02 20:00:27 INFO: status: 2% (319291392/15032385536), sparse 0% (78508032), duration 17, 19/18 MB/s
100: Jun 02 20:00:33 INFO: status: 3% (471334912/15032385536), sparse 0% (82812928), duration 23, 25/24 MB/s
...
100: Jun 02 20:09:16 INFO: status: 50% (7523532800/15032385536), sparse 3% (508485632), duration 546, 10/10 MB/s
100: Jun 02 20:09:26 INFO: status: 51% (7668891648/15032385536), sparse 3% (513105920), duration 556, 14/14 MB/s
100: [COLOR=red]Jun 02 20:09:29 ERROR: VM 100 not running
[/COLOR]100: Jun 02 20:09:29 INFO: aborting backup job
100: [COLOR=red]Jun 02 20:09:29 ERROR: VM 100 not running
[/COLOR]100: [COLOR=red]Jun 02 20:09:30 ERROR: Backup of VM 100 failed - VM 100 not running
[/COLOR]

/var/log/messages from VM:

Code:

Jun  2 20:07:06 ediserv nmbd[2676]: [2015/06/02 20:07:06, 0] nmbd/nmbd_become_lmb.c:become_local_master_stage2(396)
Jun  2 20:07:06 ediserv nmbd[2676]:   *****
Jun  2 20:07:06 ediserv nmbd[2676]:
Jun  2  20:07:06 ediserv nmbd[2676]:   Samba name server EDISERV is now a local  master browser for workgroup MYGROUP on subnet 192.168.7.207
Jun  2 20:07:06 ediserv nmbd[2676]:
Jun  2 20:07:06 ediserv nmbd[2676]:   *****
Jun  2 20:08:41 ediserv shutdown[14971]: shutting down for system halt
Jun  2 20:08:43 ediserv smartd[2867]: smartd received signal 15: Terminated
Jun  2 20:08:43 ediserv smartd[2867]: smartd is exiting (exit status 0)
Jun  2 20:08:43 ediserv avahi-daemon[2783]: Got SIGTERM, quitting.
Jun  2 20:08:43 ediserv avahi-daemon[2783]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::8c61:ceff:fe5b:3365.
Jun  2 20:08:43 ediserv avahi-daemon[2783]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.7.207.
Jun  2 20:08:46 ediserv nmbd[2676]: [2015/06/02 20:08:46, 0] nmbd/nmbd.c:terminate(58)
Jun  2 20:08:46 ediserv nmbd[2676]:   Got SIGTERM: going down...
Jun  2 20:08:46 ediserv xinetd[2566]: Exiting...
Jun  2 20:09:04 ediserv gconfd (root-2894): GConf server is not in use, shutting down.
Jun  2 20:09:04 ediserv gconfd (root-2894): Exiting
Jun  2 20:09:09 ediserv ntpd[2578]: ntpd exiting on signal 15
Jun  2 20:09:09 ediserv hcid[2310]: Got disconnected from the system message bus
Jun  2 20:09:09 ediserv nm-system-settings: disconnected from the system bus, exiting.
Jun  2 20:09:10 ediserv auditd[2144]: The audit daemon is exiting.
Jun  2 20:09:10 ediserv kernel: audit(1433290150.008:1077): audit_pid=0 old=2144 by auid=4294967295
Jun  2 20:09:10 ediserv pcscd: pcscdaemon.c:572:signal_trap() Preparing for suicide
Jun  2 20:09:10 ediserv pcscd: hotplug_libusb.c:376:HPRescanUsbBus() Hotplug stopped
Jun  2 20:09:11 ediserv pcscd: readerfactory.c:1379:RFCleanupReaders() entering cleaning function
Jun  2 20:09:11 ediserv pcscd: pcscdaemon.c:532:at_exit() cleaning /var/run
Jun  2 20:09:11 ediserv kernel: Kernel logging (proc) stopped.
Jun  2 20:09:11 ediserv kernel: Kernel log daemon terminating.
Jun  2 20:09:12 ediserv exiting on signal 15
Jun  2 21:26:28 ediserv syslogd 1.4.1: restart.

tom · Jun 3, 2015

stra4d said:
I have a fresh install of proxmox 3.4 (see version info below). ....

not sure if it helps to your problem, you should always update a new installation from ISO with latest packages,

see http://pve.proxmox.com/wiki/Downloads#Update_a_running_Proxmox_Virtual_Environment_3.x_to_latest_3.4

stra4d · Jun 3, 2015

Ok, did the apt-get udpate and rebooted. Will see how it goes tonight.

stra4d · Jun 4, 2015

Same results:

Code:

Jun 03 20:09:43 INFO: status: 53% (7978483712/15032385536), sparse 3% (527204352), duration 572, 11/11 MB/s
Jun 03 20:09:53 INFO: status: 54% (8126726144/15032385536), sparse 3% (531656704), duration 582, 14/14 MB/s
Jun 03 20:10:05 INFO: status: 55% (8271167488/15032385536), sparse 3% (535670784), duration 594, 12/11 MB/s
Jun 03 20:10:14 INFO: status: 56% (8419213312/15032385536), sparse 3% (537821184), duration 603, 16/16 MB/s
Jun 03 20:10:18 ERROR: VM 100 not running
Jun 03 20:10:18 INFO: aborting backup job
Jun 03 20:10:18 ERROR: VM 100 not running
Jun 03 20:10:19 ERROR: Backup of VM 100 failed - VM 100 not running

Other VMs have also stopped (some, not all). The next VM, is not running when it moves to its turn to backup:

Code:

Jun 03 20:10:19 INFO: Starting Backup of VM 101 (qemu)
Jun 03 20:10:19 INFO: status = stopped
Jun 03 20:10:19 INFO: update VM 101: -lock backup
Jun 03 20:10:19 INFO: backup mode: stop
Jun 03 20:10:19 INFO: ionice priority: 7
Jun 03 20:10:19 INFO: creating archive '/drive2/vzdumpbackups/dump/vzdump-qemu-101-2015_06_03-20_10_19.vma.gz'
Jun 03 20:10:19 INFO: starting kvm to execute backup task

Updated:

Code:

# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-37-pve: 2.6.32-150
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-17
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

stra4d · Jun 5, 2015

It seems that 4 out of the 5 qemu VMs crash approx 50% of the way through the first vm snapshot backup. Not sure why. Any ideas? Something to do with the fact that the 'discs' were created under 3.1 and now are running under 3.4?

stra4d · Jun 5, 2015

Any further insight? Anybody?

stra4d · Jun 9, 2015

I have tried suspend mode (--mode suspend) and still have inconsistent results. The backups now finish, but 3 out of the 5 VMs crash part way through the backup of the first VM. Machine has the lvm, and a second drive added for addtional space (not LVM). Here is info, results on backup:



VMID	NAME	DISK LOCATION	BACKUP MODE	STATUS AT START	RUNNING AFTER

100	ediserv	drive2	suspend	running	YES
101	rciserv	local	stop	stopped	NO
102	devserv	local	stop	stopped	NO
103	zmail	drive2	stop	stopped	NO
110	arrow	drive2	suspend	running	YES

[TD="width: 113"][/TD]
[TD="width: 132"][/TD]
[TD="width: 121"][/TD]

First VM backup starts at 20:00:09. And a section of /var/log/messages from the first crashed VM:

Code:

Jun  8 20:08:30 rciserv nmbd[2707]: [2015/06/08 20:08:30, 0] nmbd/nmbd_become_lmb.c:become_local_master_stage2(396)
Jun  8 20:08:30 rciserv nmbd[2707]:   *****
Jun  8 20:08:30 rciserv nmbd[2707]:
Jun  8 20:08:30 rciserv nmbd[2707]:   Samba name server RCISERV is now a local master browser for workgroup MYGROUP on subnet 192.168.7.106
Jun  8 20:08:30 rciserv nmbd[2707]:
Jun  8 20:08:30 rciserv nmbd[2707]:   *****
Jun  8 20:08:47 rciserv shutdown[22122]: shutting down for system halt
Jun  8 20:08:49 rciserv smartd[2896]: smartd received signal 15: Terminated
Jun  8 20:08:49 rciserv smartd[2896]: smartd is exiting (exit status 0)
Jun  8 20:08:49 rciserv avahi-daemon[2814]: Got SIGTERM, quitting.
Jun  8 20:08:49 rciserv avahi-daemon[2814]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::646f:e6ff:fec7:4699.
Jun  8 20:08:49 rciserv avahi-daemon[2814]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.7.106.
Jun  8 20:08:51 rciserv nmbd[2707]: [2015/06/08 20:08:51, 0] nmbd/nmbd.c:terminate(58)
Jun  8 20:08:51 rciserv nmbd[2707]:   Got SIGTERM: going down...
Jun  8 20:08:52 rciserv xinetd[2604]: Exiting...
Jun  8 20:09:13 rciserv ntpd[3062]: ntpd exiting on signal 15
Jun  8 20:09:14 rciserv nm-system-settings: disconnected from the system bus, exiting.
Jun  8 20:09:14 rciserv gconfd (root-2901): GConf server is not in use, shutting down.
Jun  8 20:09:14 rciserv gconfd (root-2901): Exiting
Jun  8 20:09:14 rciserv auditd[2157]: The audit daemon is exiting.
Jun  8 20:09:14 rciserv kernel: audit(1433808554.559:518): audit_pid=0 old=2157 by auid=4294967295
Jun  8 20:09:14 rciserv pcscd: pcscdaemon.c:572:signal_trap() Preparing for suicide

stra4d · Jun 9, 2015

Update - more info on disks. Added the disk type to my info, not sure if raw vs qcow2 has any affect:

Backup:

vzdump --mode suspend --compress gzip --storage drive2_backups --maxfiles 1 --mailto scott@me.com --all



VMID	NAME	DISK LOCATION	DISK TYPE	BACKUP MODE	STATUS AT START	RUNNING AFTER

100	ediserv	drive2	qcow2	suspend	running	YES
101	rciserv	local	raw	stop	stopped	NO
102	devserv	local	raw	stop	stopped	NO
103	zmail	drive2	qcow2	stop	stopped	NO
110	arrow	drive2	qcow2	suspend	running	YES

[TD="colspan: 4"]BACKUP RESULTS (--mode suspend)[/TD]
[TD="width: 113"][/TD]
[TD="width: 132"][/TD]
[TD="width: 121"][/TD]

stra4d · Jun 9, 2015

I found it - in /var/log/secure was a root login and shutdown. This was caused by an old server (still running for testing) that was shutting things down. Duh!

Search

Search

vzdump not working as expected on fresh install but was in prevous versions

stra4d

Well-Known Member

tom

Proxmox Staff Member

stra4d

Well-Known Member

stra4d

Well-Known Member

stra4d

Well-Known Member

stra4d

Well-Known Member

stra4d

Well-Known Member

stra4d

Well-Known Member

stra4d

Well-Known Member