[SOLVED] LXC Backup randomly hangs at suspend

RobFantini · Jan 19, 2016

I am certain the bugs will be fixed. Backup bugs are evil, and probably related to other issues getting worked on.

So lxc backups need to be done attended . So I will schedule accordingly. KVM any time and LXC when someone is checking backup progress.

Also - pve-zsync backups are working great.... In seconds I get a zfs snapshot of vm's to a backup pve system. Restoring is not hard - I've done it and have notes to refer to.

whyitsderp · Jan 21, 2016

windinternet said:
@whyitsderp

No, it is not solved yet. It seems to happen mostly on systems with NFS. Does your system use NFS mounts also?

Snapshot and suspend both use lxc-freeze. I suppose you could temporarily divert lxc-freeze to a wrapper /usr/bin/lxc-freeze with a timeout, like this:

Code:

#!/bin/bash timeout 5 /usr/bin/lxc-freeze.real "$@"

If the container isn't frozen in 5 seconds then it probably won't be frozen even in an hour or more, and the programs it is waiting for all the time are not likely to start during the tiny snapshot interval.

no, not using nfs storage, this is a local disk that everything gets dumped too.. as others have mentioned, it seems to be hit or miss.. some VMs work during a scheduled backup, others do not. *sniffle* kinda makes me miss 3.4, but it is what it is and i'm sure proxmox dev team is working the issue.

Michael B · Jan 21, 2016

Same problem here, my daily LXC backup randomly freezes when trying to backup a specific container with suspend mode.

Here is the log output :

Code:

Jan 21 02:27:47 INFO: Starting Backup of VM 157 (lxc)
Jan 21 02:27:47 INFO: status = running
Jan 21 02:27:47 INFO: mode failure - some volumes does not support snapshots
Jan 21 02:27:47 INFO: trying 'suspend' mode instead
Jan 21 02:27:47 INFO: backup mode: suspend
Jan 21 02:27:47 INFO: ionice priority: 7
Jan 21 02:27:47 INFO: starting first sync /proc/15205/root// to /var/lib/vz/dump/dump/vzdump-lxc-157-2016_01_21-02_27_47.tmp
Jan 21 02:28:58 INFO: Number of files: 56,075 (reg: 45,292, dir: 4,849, link: 5,897, dev: 2, special: 35)
Jan 21 02:28:58 INFO: Number of created files: 56,074 (reg: 45,292, dir: 4,848, link: 5,897, dev: 2, special: 35)
Jan 21 02:28:58 INFO: Number of deleted files: 0
Jan 21 02:28:58 INFO: Number of regular files transferred: 45,283
Jan 21 02:28:58 INFO: Total file size: 1,690,029,085 bytes
Jan 21 02:28:58 INFO: Total transferred file size: 1,685,071,849 bytes
Jan 21 02:28:58 INFO: Literal data: 1,685,071,849 bytes
Jan 21 02:28:58 INFO: Matched data: 0 bytes
Jan 21 02:28:58 INFO: File list size: 1,113,978
Jan 21 02:28:58 INFO: File list generation time: 0.001 seconds
Jan 21 02:28:58 INFO: File list transfer time: 0.000 seconds
Jan 21 02:28:58 INFO: Total bytes sent: 1,688,913,160
Jan 21 02:28:58 INFO: Total bytes received: 907,124
Jan 21 02:28:58 INFO: sent 1,688,913,160 bytes  received 907,124 bytes  23,633,850.13 bytes/sec
Jan 21 02:28:58 INFO: total size is 1,690,029,085  speedup is 1.00
Jan 21 02:28:58 INFO: first sync finished (71 seconds)
Jan 21 02:28:58 INFO: suspend vm

cFire · Jan 26, 2016

Some follow up information on what I've observed when this happened again; The processes that are running inside the container get put into the D state by lxc-freeze. However, lxc-freeze itself does not enter the D state when this issue occurs.

Example:
root@freya:~# ps axl | awk '$10 ~ /D/'
1 0 330 24927 20 0 82224 44 refrig Ds ? 0:14 /usr/sbin/apache2 -k start
4 0 334 330 20 0 23988 216 refrig Dsl ? 0:00 PassengerWatchdog
0 0 339 334 20 0 117572 1976 refrig Dl ? 2:07 PassengerHelperAgent
...etc.
root@freya:~# ps axl | grep freeze
0 0 29954 26193 20 0 30240 2960 hrtime S ? 0:00 lxc-freeze -n 141

If lxc-freeze is killed at this point the backup job skips to the next container. lxc-unfreeze or a container restart is then required to get the skipped container running again.
In the case that the failed container is restarted, a newly submitted backup job seems to usually succeed. When it is lxc-unfreeze'ed it seems to usually fail again on retry.

Disclaimer: These are all my personal observations from tests with a sample size of just one or two. Your mileage may vary.

Matt Harris · Jan 26, 2016

I'm at my day job so I can't provide many details, but I experienced this last night on a box I'm assembling. No NFS, but backing up to the same volume as the container lives on.

This happened to one of my containers, while the other two (and two VMs) kept running happily.

I should note that the problem lxc container also froze a few hours prior to the backup; lxc-attach did not respond, and I had to restart the container manually.

seneca214 · Jan 27, 2016

Hello,

We are seeing LXC container suspend backups freeze intermittently as well. We're using the latest Proxmox 4.1. Any further updates on this issue?

Matt Harris · Jan 27, 2016

I found many apparmor errors in my /var/log/syslog; googled and placed an exception in /etc/ (sorry for the vagueness; at work and no access to my home search history), and have not had a freeze on suspend since. It's only been two days, and I'm going to let the system burn in for a while longer, but that's one more point of data for us.

RobFantini · Jan 27, 2016

Matt Harris said:
I found many apparmor errors in my /var/log/syslog; googled and placed an exception in /etc/ (sorry for the vagueness; at work and no access to my home search history), and have not had a freeze on suspend since. It's only been two days, and I'm going to let the system burn in for a while longer, but that's one more point of data for us.

could you tell me where you placed the exception for /etc ?

Matt Harris · Jan 28, 2016

https://forum.proxmox.com/threads/after-upgrade-to-4-apparmor-errors-in-syslog.24114/#post-121794 is the fix I tried.

iMer · Jan 29, 2016

Not sure if this is related, but I just had a lxc container freeze during snapshot mode
I noticed the machine wasnt responding so I checked what was going on - after a bit of searching I noticed an lxc-freeze command for the vm (the backup had been idling for a good 4 hours, usually takes 20min to complete)

INFO: Starting Backup of VM 102 (lxc)
INFO: status = running
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot snapshot
// i killed lxc-freeze here
ERROR: Backup of VM 102 failed - VM is locked (snapshot)

I dug around in the code a bit (PVE/VZDump.pm) and the only time it should try to suspend/freeze the vm is when a mysterious snapshot_count is set
Which.. gets set if the vm has multiple disk volumes?

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
local 385G 1.33T 104K /local
local/subvol-100-disk-1 57.5G 143G 57.5G /local/subvol-100-disk-1
local/subvol-101-disk-1 96K 50.0G 96K /local/subvol-101-disk-1
local/subvol-101-disk-2 2.53G 47.5G 2.53G /local/subvol-101-disk-2
local/subvol-102-disk-1 108G 32.0G 108G /local/subvol-102-disk-1
local/subvol-103-disk-1 5.61G 44.4G 5.61G /local/subvol-103-disk-1
local/subvol-104-disk-1 27.6G 36.4G 27.6G /local/subvol-104-disk-1
local/subvol-105-disk-1 28.9G 3.08G 28.9G /local/subvol-105-disk-1

Not sure why the 101 container has two disks, the ui only lists disk-2 as used
As far as I understand each vm has an independant task, so even if it thinks vm 101 has two disks that should not influence the other vms, correct?

I did another snapshot manually now and that did not seem to want to freeze the vm

iMer · Feb 1, 2016

Same thing happened again, got nothing to add really

fabian · Feb 1, 2016

Could you post the complete config of this container?

Code:

pct config 102

Are you running an uptodate version of PVE? Could you post the output of

Code:

pveversion -v

iMer · Feb 1, 2016

#pct config 102
arch: amd64
cpulimit: 4
cpuunits: 1024
hostname:xxx
memory: 12288
net0: bridge=vmbr0,gw=xxx,hwaddr=xxx,ip=xxx,name=eth0,type=veth
ostype: debian
rootfs: zfsLocal:subvol-102-disk-1,size=140G
swap: 12800
# pveversion -v
proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-34
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-31
qemu-server: 4.0-49
pve-firmware: 1.1-7
libpve-common-perl: 4.0-45
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-3
pve-container: 1.0-39
pve-firewall: 2.0-15
pve-ha-manager: 1.0-19
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-6
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie

Should be up to date, it's a fresh install - installed it 5 days ago and i just checked for updates again

It only happened today again btw, that's 2/10 times (two snapshots a day)

fabian · Feb 1, 2016

lxc-freeze is called for every snapshot backup, regardless of the number of mounted volumes/mountpoints (snapshotting anything other than the rootfs is not supported at the moment anyway). this is handled in the snapshot code itself, not in the vzdump code. since the problem doesn't seem to occur every time you try to make a backup (snapshot), you could try running something like atop or a similar tool to record the state of all processes and correlate this with the times when lxc-freeze fails/hangs. maybe it is possible to find the culprit then.

iMer · Feb 2, 2016

Ah I see it freezes the vm while waiting for the filesystem to create the snapshot volume
I still had the pstree output saved from when it froze:
├─cron───cron───vzdump───task UPID:host:───lxc-freeze # <-- hanging lxc-freeze
├─lxc-start───init─┬─atd # <-- container in question
│ ├─cron───cron───sh───php5───sh─┬─awk
│ │ └─ps
│ ├─dbus-daemon
│ ├─exim4
│ ├─2*[getty]
│ ├─nginx───8*[nginx]
│ ├─node─┬─{SignalSender}
│ │ └─4*[{node}]
│ ├─php5-fpm───62*[php5-fpm]
│ ├─redis-server───2*[{redis-server}]
│ ├─rpcbind
│ ├─rsyslogd─┬─{rs:main Q:Reg}
│ │ └─2*[{rsyslogd}]
│ ├─sshd───sshd
│ └─zabbix_agentd───5*[zabbix_agentd]

I don't see any odd processes running really

I'll be sure to take more detailed info of the processes if it happens again

iMer · Feb 3, 2016

It's happening again
Stacktrace where lxc-freeze is stuck:
(gdb) bt
#0 0x00007fcd89ebc060 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007fcd89ebbf14 in __sleep (seconds=0, seconds@entry=1) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2 0x00007fcd8ac1d202 in do_freeze_thaw (freeze=freeze@entry=1, name=0x1ece300 "102", lxcpath=0x1ece320 "/var/lib/lxc") at freezer.c:74
#3 0x00007fcd8ac1d3df in lxc_freeze (name=<optimized out>, lxcpath=<optimized out>) at freezer.c:81
#4 0x00007fcd8ac49bbc in do_lxcapi_freeze (c=0x1eceb80) at lxccontainer.c:424
#5 lxcapi_freeze (c=0x1eceb80) at lxccontainer.c:430
#6 0x0000000000400b1e in main (argc=<optimized out>, argv=<optimized out>) at lxc_freeze.c:83
Which seems to be waiting for https://github.com/lxc/lxc/blob/master/src/lxc/freezer.c#L71
strace

Code:

# strace -p 13563
Process 13563 attached
restart_syscall(<... resuming interrupted call ...>) = 0
pipe([3, 4])  = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fcd8b088a50) = 24322
close(4)  = 0
read(3, "\10\0\0\0", 4)  = 4
read(3, "FREEZING", 8)  = 8
close(3)  = 0
wait4(24322, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 24322
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24322, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffdf91fb380)  = 0
pipe([3, 4])  = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fcd8b088a50) = 24323
close(4)  = 0
read(3, "\10\0\0\0", 4)  = 4
read(3, "FREEZING", 8)  = 8
close(3)  = 0
wait4(24323, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 24323
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24323, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffdf91fb380)  = 0
pipe([3, 4])  = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fcd8b088a50) = 24324
close(4)  = 0
read(3, "\10\0\0\0", 4)  = 4
read(3, "FREEZING", 8)  = 8
close(3)  = 0
wait4(24324, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 24324
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24324, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---

Processes: (see cont) As far as i can tell everything's frozen?
cont.. https://gist.github.com/imerr/ecddf68d177c7f50f88b#file-lxcfrezeproxmox-L51

fabian · Feb 3, 2016

@iMer: Thanks for the additional information.

Could you tell us more about the container in question? What distribution and version is used? What are the usual network and I/O load patterns? Any observable difference regarding the network / IO load during backups or during backups that fail?

To dig down further, could you try running ps in a loop (saving the output), then in parallel manually run lxc-freeze -n <CTID> and post the results? This should allow us to see in which order processes are frozen and whether anything unusual occurs.

Access to an (anonymized) copy of the container might help in finding the root cause, if that is not possible, you could also try to create a copy of the container (try to keep everything as close to the original as possible!) and disable network access (e.g., by DROPping all input/output using the PVE firewall) and check if the problem persists. If it does not, the culprit is probably one of the network-related services (nginx, zabbix, redis, ?).

@other affected users:

Could you describe your setups in more detail? I.e., which services are running in the container? Which distributions are you using? What storages are you using for the containers themselves? Has anybody observed this issue with a container that does not have (a lot) of network I/O?

Having a setup/configuration that allows us to reproduce this issue reliably would help a lot!

iMer · Feb 3, 2016

The container is a converted openvz debian wheezy one, updating is on my todo list, but I havent gotten to it yet

Network traffic to the vm looks fairly tame across the board

Freezes are 29th and 3rd around 6am

As for the ps loop.. couldn't really get much resolution out of it (see attached file)
I just did a bash loop (on the host, or did you mean vm?)
while true
do
ps options >> ps.txt
echo '----' >> ps.txt
done

any other suggestions?

I'll give the copy without network a try and report back if anything happens

manu · Feb 3, 2016

Hi Imer and other people taking part to this thread

Can you please try to run your backup on a local storage instead of NFS, to see if in that case the lxc-freeze commands succeeds ?

If would be interesting to know if the problem appears also in that case ( my guess is no )

camaran · Feb 3, 2016

i use in local storage and i have problem (ext4 my file system)

[SOLVED] LXC Backup randomly hangs at suspend

Famous Member

New Member

Member

New Member

Member

Active Member

Member

Famous Member

Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

Attachments

Proxmox Staff Member

Active Member

We value your privacy