LXC backup sometimes forget to remove snapshot information from container config

dirk.nilius · Nov 5, 2015

When backup a LXC container via ZFS snapshots there is a [vzdump] part added to the container configuration file. After the backup is finished the ZFS snapshot is correctly destroyed but sometimes the [vzdump] config stays in the config file.
When the next backup start there is a fatal error that a snapshot is already existing but cannot be deleted. After removing that added part from the config file backup works again.

Known bug?

dirk.nilius · Nov 7, 2015

It looks like there is a bug in writing the LXC .conf while while backup. Now I have a container which config file is completely empty after backup. After a backup there are also many XXX.conf.tmp.XXX files in the LXC config folder which never gets deleted.

tom · Nov 7, 2015

dirk.nilius said:
It looks like there is a bug in writing the LXC .conf while while backup. Now I have a container which config file is completely empty after backup. After a backup there are also many XXX.conf.tmp.XXX files in the LXC config folder which never gets deleted.

Can you specify this case in detail, so that we can re-produce this in our lab? if yes, please file a bug on https://bugzilla.proxmox.com

dietmar · Nov 7, 2015

dirk.nilius said:
It looks like there is a bug in writing the LXC .conf while while backup.

Please make sure you run the latest version. What it the output of

# pveversion -v

dirk.nilius · Nov 7, 2015

I have temporarily activated pve-no-subscription repo over enterprise. I have some other nasty problems after upgrading to 4.0 so I hope to get fixes a little faster.

proxmox-ve: 4.0-19 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-19
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-20
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie

The remaining XXX.conf.tmp.XXX files apply to all nodes. But the problem with the keeped snapshot information in the .conf file is specific to one container. Here is the config:

arch: amd64
cpulimit: 8
cpuunits: 1024
hostname: ckc-b-ober.ckc.de
memory: 8192
net0: bridge=vmbr0,firewall=1,gw=192.168.13.253,hwaddr=36:31:32:35:62:63,ip=192.168.13.67/24,name=eth0,type=veth
onboot: 1
ostype: ubuntu
parent: vzdump
rootfs: zfs:subvol-240-disk-1,size=250G
swap: 12288

Any ideas?

dirk.nilius · Nov 7, 2015

The config file looks like this after a backup:

arch: amd64
cpulimit: 8
cpuunits: 1024
hostname: ckc-b-ober.ckc.de
memory: 8192
net0: bridge=vmbr0,firewall=1,gw=192.168.13.253,hwaddr=36:31:32:35:62:63,ip=192.168.13.67/24,name=eth0,type=veth
onboot: 1
ostype: ubuntu
parent: vzdump
rootfs: zfs:subvol-240-disk-1,size=250G
swap: 12288

[vzdump]
#vzdump backup snapshot
arch: amd64
cpulimit: 8
cpuunits: 1024
hostname: ckc-b-ober.ckc.de
memory: 8192
net0: bridge=vmbr0,firewall=1,gw=192.168.13.253,hwaddr=36:31:32:35:62:63,ip=192.168.13.67/24,name=eth0,type=veth
onboot: 1
ostype: ubuntu
rootfs: zfs:subvol-240-disk-1,size=250G
snaptime: 1446850992
swap: 12288

But the ZFS snapshot is corretly deleted:

zfs list -t snapshot
no datasets available

dirk.nilius · Nov 7, 2015

When trying to make a new backup:

INFO: starting new backup job: vzdump 240 --remove 0 --node ckc-b-p0005 --compress lzo --mode snapshot --storage FreeNAS
INFO: Starting Backup of VM 240 (lxc)
INFO: status = running
INFO: found old vzdump snapshot (force removal)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot snapshot
ERROR: Backup of VM 240 failed - snapshot name 'vzdump' already used
INFO: Backup job finished with errors
TASK ERROR: job errors

But deleting the snapshot within the web interface doesn't work.

dietmar · Nov 7, 2015

Is this a cluster, or a single node setup?

dirk.nilius · Nov 7, 2015

It's a 3 node cluster.

uorum information
------------------
Date: Sat Nov 7 11:35:09 2015
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 6500
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.13.12 (local)
0x00000003 1 192.168.13.13
0x00000001 1 192.168.13.14

dietmar · Nov 8, 2015

I am unable to reproduce that here. Are there any errors in /var/log/syslog? Can you reproduce that with smaller disks, or is this related to the large 250GB disk?

dirk.nilius · Nov 8, 2015

Here is the log from the whole day where the backup run:

Nov 7 00:00:04 ckc-b-p0005 vzdump[32477]: <root@pam> starting task UPID:ckc-b-p0005:00007EF8:01BF8802:563D30F4:vzdump::root@pam:
Nov 7 01:36:57 ckc-b-p0005 vzdump[32477]: <root@pam> end task UPID:ckc-b-p0005:00007EF8:01BF8802:563D30F4:vzdump::root@pam: OK
Nov 7 03:55:08 ckc-b-p0005 pvedailycron[17742]: <root@pam> starting task UPID:ckc-b-p0005:00004595:01D50D81:563D680C:aptupdate::root@pam:
Nov 7 03:55:20 ckc-b-p0005 pvedailycron[17742]: <root@pam> end task UPID:ckc-b-p0005:00004595:01D50D81:563D680C:aptupdate::root@pam: OK
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150240] pveproxy D ffff88081f056a00 0 3520 1 0x00000004
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150245] ffff88067be07dc8 0000000000000082 ffff88081bb69900 ffff8803d0375780
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150248] ffff880803ad0000 ffff88067be08000 ffff880803ad00ac ffff8803d0375780
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150250] 00000000ffffffff ffff880803ad00b0 ffff88067be07de8 ffffffff818036a7
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150252] Call Trace:
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150262] [<ffffffff818036a7>] schedule+0x37/0x80
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150265] [<ffffffff8180395e>] schedule_preempt_disabled+0xe/0x10
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150267] [<ffffffff81805665>] __mutex_lock_slowpath+0xb5/0x120
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150270] [<ffffffff818056f3>] mutex_lock+0x23/0x40
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150274] [<ffffffff8120d19a>] filename_create+0x7a/0x160
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150276] [<ffffffff8120e133>] SyS_mkdir+0x53/0x100
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150279] [<ffffffff818077b2>] entry_SYSCALL_64_fastpath+0x16/0x75
Nov 7 06:29:38 ckc-b-p0005 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1104" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Nov 7 06:31:40 ckc-b-p0005 rsyslogd0: action 'action 17' resumed (module 'builtin

mpipe') [try http://www.rsyslog.com/e/0 ]
Nov 7 06:31:40 ckc-b-p0005 rsyslogd-2359: action 'action 17' resumed (module 'builtin

mpipe') [try http://www.rsyslog.com/e/2359 ]
Nov 7 06:53:47 ckc-b-p0005 rsyslogd-2007: action 'action 17' suspended, next retry is Sat Nov 7 06:54:17 2015 [try http://www.rsyslog.com/e/2007 ]

As you see there are only the backup start and finish message. No errors

I have another container with 250 GB disk. No problems there. I have not seen this on any other container. But maybe this is also related to the cluster node. All nodes have different hardware. The other large disk container runs on another node.

dirk.nilius · Nov 8, 2015

Did a manual backup today and everything works as expected. Will take a look after the next nightly full backup.

dietmar · Nov 8, 2015

dirk.nilius said:
As you see there are only the backup start and finish message. No errors

What are those kernel messages at 6:27:50? Is this really the complete log?

dirk.nilius · Nov 9, 2015

Those kernel messages are another problem. I opened a new thread: http://forum.proxmox.com/threads/24386-pveproxy-become-blocked-state-and-cannot-be-killed

Sorry, here is the full log:

https://www.dropbox.com/s/1lnctnzpyo8mnyg/full-log.gz?dl=0

dirk.nilius · Nov 10, 2015

Now there comes another problem in. After a container backup the snapshot information was correctly removed but the wrong disk information was written:

local:202/vm-202-disk-1.raw,size=25G

instead of

zfs:subvol-202-disk-1,size=25G

So the conclusion is that the config file handler is broken. If you habe problems to reproduce give me a link to your git codebase of this tool and I make a review.

dietmar · Nov 10, 2015

dirk.nilius said:
Now there comes another problem in. After a container backup the snapshot information was correctly removed but the wrong disk information was written:

But the backup does not write/change the disk information - strange.

https://git.proxmox.com/?p=pve-container.git;a=blob;f=src/PVE/VZDump/LXC.pm

dirk.nilius · Nov 10, 2015

Dietmar,

I found a few minutes to look into the code. First I found a problem which causes that the old snaphot is not force removed.

PVE/VZDump/LXC.pm:131

PVE::LXC::snapshot_delete($vmid, 'vzdump', 0);

calls snapshot_delete with force 0 instead of 1. So that in

PVE/LXC.pm from line 1827

eval {
PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snapname); <- will fail because snapshot is already gone
};
my $err = $@; <- has err

if(!$err || ($err && $force)) { <- $err is there but not $force
lock_container($vmid, 10, $del_snap);
if ($err) {
die "Can't delete snapshot: $vmid $snapname $err\n";
}
}
-> so the function end without error code.

So I would think that

PVE::LXC::snapshot_delete($vmid, 'vzdump', 1);

will remove the snapshot from the config but raise an error while backup because of this:

if(!$err || ($err && $force)) { <- $err is there and now $force
lock_container($vmid, 10, $del_snap);
if ($err) { <- $err because ZFS snapshot could not be destroyed
die "Can't delete snapshot: $vmid $snapname $err\n";
}
}

This is not so nice. But if the backup runs again it should work.

So that should solve the problem taht once this issue occured you have to manually cleanup the config.

But I can't see the problem for now why the config becomes broken.

Search

Search

LXC backup sometimes forget to remove snapshot information from container config

dirk.nilius

Member

dirk.nilius

Member

tom

Proxmox Staff Member

dietmar

Proxmox Staff Member

dirk.nilius

Member

dirk.nilius

Member

dirk.nilius

Member

dietmar

Proxmox Staff Member

dirk.nilius

Member

dietmar

Proxmox Staff Member

dirk.nilius

Member

dirk.nilius

Member

dietmar

Proxmox Staff Member

dirk.nilius

Member

dirk.nilius

Member

dietmar

Proxmox Staff Member

dirk.nilius

Member

We value your privacy