LXC backup sometimes forget to remove snapshot information from container config

dirk.nilius

Member
Nov 5, 2015
47
0
6
Berlin, Germany
When backup a LXC container via ZFS snapshots there is a [vzdump] part added to the container configuration file. After the backup is finished the ZFS snapshot is correctly destroyed but sometimes the [vzdump] config stays in the config file.
When the next backup start there is a fatal error that a snapshot is already existing but cannot be deleted. After removing that added part from the config file backup works again.

Known bug?
 
It looks like there is a bug in writing the LXC .conf while while backup. Now I have a container which config file is completely empty after backup. After a backup there are also many XXX.conf.tmp.XXX files in the LXC config folder which never gets deleted.
 
Last edited:
It looks like there is a bug in writing the LXC .conf while while backup. Now I have a container which config file is completely empty after backup. After a backup there are also many XXX.conf.tmp.XXX files in the LXC config folder which never gets deleted.

Can you specify this case in detail, so that we can re-produce this in our lab? if yes, please file a bug on https://bugzilla.proxmox.com
 
I have temporarily activated pve-no-subscription repo over enterprise. I have some other nasty problems after upgrading to 4.0 so I hope to get fixes a little faster.

proxmox-ve: 4.0-19 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-19
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-20
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie

The remaining XXX.conf.tmp.XXX files apply to all nodes. But the problem with the keeped snapshot information in the .conf file is specific to one container. Here is the config:

arch: amd64
cpulimit: 8
cpuunits: 1024
hostname: ckc-b-ober.ckc.de
memory: 8192
net0: bridge=vmbr0,firewall=1,gw=192.168.13.253,hwaddr=36:31:32:35:62:63,ip=192.168.13.67/24,name=eth0,type=veth
onboot: 1
ostype: ubuntu
parent: vzdump
rootfs: zfs:subvol-240-disk-1,size=250G
swap: 12288

Any ideas?
 
The config file looks like this after a backup:

arch: amd64
cpulimit: 8
cpuunits: 1024
hostname: ckc-b-ober.ckc.de
memory: 8192
net0: bridge=vmbr0,firewall=1,gw=192.168.13.253,hwaddr=36:31:32:35:62:63,ip=192.168.13.67/24,name=eth0,type=veth
onboot: 1
ostype: ubuntu
parent: vzdump
rootfs: zfs:subvol-240-disk-1,size=250G
swap: 12288

[vzdump]
#vzdump backup snapshot
arch: amd64
cpulimit: 8
cpuunits: 1024
hostname: ckc-b-ober.ckc.de
memory: 8192
net0: bridge=vmbr0,firewall=1,gw=192.168.13.253,hwaddr=36:31:32:35:62:63,ip=192.168.13.67/24,name=eth0,type=veth
onboot: 1
ostype: ubuntu
rootfs: zfs:subvol-240-disk-1,size=250G
snaptime: 1446850992
swap: 12288



But the ZFS snapshot is corretly deleted:

zfs list -t snapshot
no datasets available
 
When trying to make a new backup:

INFO: starting new backup job: vzdump 240 --remove 0 --node ckc-b-p0005 --compress lzo --mode snapshot --storage FreeNAS
INFO: Starting Backup of VM 240 (lxc)
INFO: status = running
INFO: found old vzdump snapshot (force removal)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot snapshot
ERROR: Backup of VM 240 failed - snapshot name 'vzdump' already used
INFO: Backup job finished with errors
TASK ERROR: job errors

But deleting the snapshot within the web interface doesn't work.
 
It's a 3 node cluster.

uorum information
------------------
Date: Sat Nov 7 11:35:09 2015
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 6500
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.13.12 (local)
0x00000003 1 192.168.13.13
0x00000001 1 192.168.13.14
 
I am unable to reproduce that here. Are there any errors in /var/log/syslog? Can you reproduce that with smaller disks, or is this related to the large 250GB disk?
 
Here is the log from the whole day where the backup run:

Nov 7 00:00:04 ckc-b-p0005 vzdump[32477]: <root@pam> starting task UPID:ckc-b-p0005:00007EF8:01BF8802:563D30F4:vzdump::root@pam:
Nov 7 01:36:57 ckc-b-p0005 vzdump[32477]: <root@pam> end task UPID:ckc-b-p0005:00007EF8:01BF8802:563D30F4:vzdump::root@pam: OK
Nov 7 03:55:08 ckc-b-p0005 pvedailycron[17742]: <root@pam> starting task UPID:ckc-b-p0005:00004595:01D50D81:563D680C:aptupdate::root@pam:
Nov 7 03:55:20 ckc-b-p0005 pvedailycron[17742]: <root@pam> end task UPID:ckc-b-p0005:00004595:01D50D81:563D680C:aptupdate::root@pam: OK
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150240] pveproxy D ffff88081f056a00 0 3520 1 0x00000004
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150245] ffff88067be07dc8 0000000000000082 ffff88081bb69900 ffff8803d0375780
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150248] ffff880803ad0000 ffff88067be08000 ffff880803ad00ac ffff8803d0375780
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150250] 00000000ffffffff ffff880803ad00b0 ffff88067be07de8 ffffffff818036a7
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150252] Call Trace:
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150262] [<ffffffff818036a7>] schedule+0x37/0x80
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150265] [<ffffffff8180395e>] schedule_preempt_disabled+0xe/0x10
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150267] [<ffffffff81805665>] __mutex_lock_slowpath+0xb5/0x120
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150270] [<ffffffff818056f3>] mutex_lock+0x23/0x40
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150274] [<ffffffff8120d19a>] filename_create+0x7a/0x160
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150276] [<ffffffff8120e133>] SyS_mkdir+0x53/0x100
Nov 7 06:27:50 ckc-b-p0005 kernel: [316555.150279] [<ffffffff818077b2>] entry_SYSCALL_64_fastpath+0x16/0x75
Nov 7 06:29:38 ckc-b-p0005 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1104" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Nov 7 06:31:40 ckc-b-p0005 rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ]
Nov 7 06:31:40 ckc-b-p0005 rsyslogd-2359: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/2359 ]
Nov 7 06:53:47 ckc-b-p0005 rsyslogd-2007: action 'action 17' suspended, next retry is Sat Nov 7 06:54:17 2015 [try http://www.rsyslog.com/e/2007 ]

As you see there are only the backup start and finish message. No errors :(

I have another container with 250 GB disk. No problems there. I have not seen this on any other container. But maybe this is also related to the cluster node. All nodes have different hardware. The other large disk container runs on another node.
 
Now there comes another problem in. After a container backup the snapshot information was correctly removed but the wrong disk information was written:

local:202/vm-202-disk-1.raw,size=25G

instead of

zfs:subvol-202-disk-1,size=25G

So the conclusion is that the config file handler is broken. If you habe problems to reproduce give me a link to your git codebase of this tool and I make a review.
 
Dietmar,

I found a few minutes to look into the code. First I found a problem which causes that the old snaphot is not force removed.

PVE/VZDump/LXC.pm:131

PVE::LXC::snapshot_delete($vmid, 'vzdump', 0);

calls snapshot_delete with force 0 instead of 1. So that in

PVE/LXC.pm from line 1827

eval {
PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snapname); <- will fail because snapshot is already gone
};
my $err = $@; <- has err

if(!$err || ($err && $force)) { <- $err is there but not $force
lock_container($vmid, 10, $del_snap);
if ($err) {
die "Can't delete snapshot: $vmid $snapname $err\n";
}
}
-> so the function end without error code.


So I would think that

PVE::LXC::snapshot_delete($vmid, 'vzdump', 1);

will remove the snapshot from the config but raise an error while backup because of this:

if(!$err || ($err && $force)) { <- $err is there and now $force
lock_container($vmid, 10, $del_snap);
if ($err) { <- $err because ZFS snapshot could not be destroyed
die "Can't delete snapshot: $vmid $snapname $err\n";
}
}

This is not so nice. But if the backup runs again it should work.

So that should solve the problem taht once this issue occured you have to manually cleanup the config.

But I can't see the problem for now why the config becomes broken.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!