[SOLVED] LXC Backup randomly hangs at suspend

pa657 · Dec 28, 2015

Hello,
Not always but almost always, the backup hangs at this step :

Code:

INFO: Starting Backup of VM 112 (lxc)
INFO: status = running
INFO: mode failure - some volumes does not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: starting first sync /proc/31901/root// to /disk2/backup/vz/dump/vzdump-lxc-112-2015_12_27-18_41_46.tmp
INFO: Number of files: 557,820 (reg: 485,738, dir: 62,260, link: 9,783, dev: 33, special: 6)
INFO: Number of created files: 557,819 (reg: 485,738, dir: 62,259, link: 9,783, dev: 33, special: 6)
INFO: Number of deleted files: 0
INFO: Number of regular files transferred: 468,545
INFO: Total file size: 25,628,021,190 bytes
INFO: Total transferred file size: 23,971,267,967 bytes
INFO: Literal data: 23,971,267,967 bytes
INFO: Matched data: 0 bytes
INFO: File list size: 23,459,109
INFO: File list generation time: 0.008 seconds
INFO: File list transfer time: 0.000 seconds
INFO: Total bytes sent: 24,016,188,421
INFO: Total bytes received: 11,432,920
INFO: sent 24,016,188,421 bytes received 11,432,920 bytes 57,551,188.84 bytes/sec
INFO: total size is 25,628,021,190 speedup is 1.07
INFO: first sync finished (417 seconds)
INFO: suspend vm

Thanks for your help.

peterx · Dec 28, 2015

Same problem here.
KVM's are backuped without any problem.
LXC containers are backup once on a clean nfs share, next backups fail, at first on the last lxc container and then on the backup of the first.
vzdump.conf changed:

# vzdump default settings

tmpdir: /tmp
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#ionice: PRI
#lockwait: MINUTES
#stopwait: MINUTES
#size: MB
#maxfiles: N
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N:

peterx · Jan 4, 2016

Nobody having the solution?
Peter

ugenk · Jan 10, 2016

same story.
i've looked with strace on lxc-freeze:

Process 25173 attached
restart_syscall(<... resuming interrupted call ...>) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5cb76e5a50) = 32292
close(4) = 0
read(3, "\10\0\0\0", 4) = 4
read(3, "FREEZING", 8) = 8
close(3) = 0
wait4(32292, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32292
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32292, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffd4aed18d0) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5cb76e5a50) = 32309
close(4) = 0
read(3, "\10\0\0\0", 4) = 4
read(3, "FREEZING", 8) = 8
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32309, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
close(3) = 0
wait4(32309, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32309
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffd4aed18d0) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5cb76e5a50) = 32362
close(4) = 0
read(3, "\10\0\0\0", 4) = 4
read(3, "FREEZING", 8) = 8
close(3) = 0
wait4(32362, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32362
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32362, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---

peterx · Jan 10, 2016

Still the same problem here. Did a complete reinstall of proxmox 4.1, with all the latest updates. Vm and lxc vt's backup fine two times. Then the vm's are still well backupped but the lyxcct's start to fail at suspend vm. At first the second one then the first one.
After a restart of the proxmox machine it backups well a few times and then the problems start again.
Peter

ugenk · Jan 10, 2016

peterx, I have exactly the same problems.

windinternet · Jan 10, 2016

Maybe some details to allow for reproduction. What storage is used for the LXC containers? Are they all running or are there also some containers that are already stopped before the backups?

peterx · Jan 10, 2016

In my case on a NFS share and all the vm and ct are running

windinternet · Jan 10, 2016

peterx said:
In my case on a NFS share and all the vm and ct are running

on NFS, the containers themselves or the backup?

ugenk · Jan 10, 2016

containers - on regular
/dev/mapper/pve-data on /var/lib/vz type ext4 (rw,relatime,data=ordered)

backups - on NFS:
xxx:/mnt/disk1/proxmox2 on /mnt/pve/backup1 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=x.x.x.x,mountvers=3,mountport=60126,mountproto=udp,local_lock=none,addr=x.x.x.x)

windinternet · Jan 10, 2016

Do the containers you backup use connections to the same NFS server? Can you reproduce the problem by just using vzdump repeatedly on one container?

Code:

vzdump containerid -mode suspend

It just seems as if the backup itself freezes also. You can check the state of a container with lxc-info -n containerid.

cFire · Jan 10, 2016

It looks like I'm having the same issue with a similar setup. Except my backups are written to an external drive rather than an NFS.
lxc-info shows "state: FREEZING"
A section of the strace output it attached below. Best I can tell from a quick look at it is that it is just looping over some status updating.

[pid 15672] write(4, "\10\0\0\0", 4) = 4
[pid 15672] write(4, "FREEZING", 8) = 8
[pid 15672] exit_group(0) = ?
[pid 15672] +++ exited with 0 +++
close(4) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=15672, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
read(3, "\10\0\0\0", 4) = 4
read(3, "FREEZING", 8) = 8
close(3) = 0
wait4(15672, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 15672
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffea10109a0) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5e7fe2ea50) = 15674
Process 15674 attached
[pid 15674] set_robust_list(0x7f5e7fe2ea60, 24) = 0
[pid 15674] socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0) = 5
[pid 15674] connect(5, {sa_family=AF_LOCAL, sun_path="/sys/fs/cgroup/cgmanager/sock"}, 31) = 0
[pid 15674] fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
[pid 15674] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 15674] geteuid() = 0
[pid 15674] getsockname(5, {sa_family=AF_LOCAL, NULL}, [2]) = 0
[pid 15674] poll([{fd=5, events=POLLOUT}], 1, 0) = 1 ([{fd=5, revents=POLLOUT}])
[pid 15674] sendto(5, "\0", 1, MSG_NOSIGNAL, NULL, 0) = 1
[pid 15674] sendto(5, "AUTH EXTERNAL 30\r\n", 18, MSG_NOSIGNAL, NULL, 0) = 18
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] read(5, "OK 77cd8f7839a3483509ae732a568a5"..., 2048) = 37
[pid 15674] poll([{fd=5, events=POLLOUT}], 1, 4294967295) = 1 ([{fd=5, revents=POLLOUT}])
[pid 15674] sendto(5, "NEGOTIATE_UNIX_FD\r\n", 19, MSG_NOSIGNAL, NULL, 0) = 19
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] read(5, "AGREE_UNIX_FD\r\n", 2048) = 15
[pid 15674] poll([{fd=5, events=POLLOUT}], 1, 4294967295) = 1 ([{fd=5, revents=POLLOUT}])
[pid 15674] sendto(5, "BEGIN\r\n", 7, MSG_NOSIGNAL, NULL, 0) = 7
[pid 15674] poll([{fd=5, events=POLLIN|POLLOUT}], 1, 4294967295) = 1 ([{fd=5, revents=POLLOUT}])
[pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\0018\0\0\0\1\0\0\0h\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 120}, {" \0\0\0org.linuxcontainers.cgmanage"..., 56}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 176
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\10\0\0\0\1\0\0\0\17\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1v\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 40
[pid 15674] recvmsg(5, 0x7ffea1010610, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15674] socket(PF_LOCAL, SOCK_STREAM, 0) = 8
[pid 15674] connect(8, {sa_family=AF_LOCAL, sun_path=@"/var/lib/lxc/106/command"}, 27) = 0
[pid 15674] getuid() = 0
[pid 15674] getgid() = 0
[pid 15674] sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=15674, uid=0, gid=0}}, msg_flags=0}, MSG_NOSIGNAL) = 16
[pid 15674] recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0j\36\0\0\0\0\0\0", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
[pid 15674] close(8) = 0
[pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\20\0\0\0\2\0\0\0x\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 136}, {"\7\0\0\0freezer\0j\36\0\0", 16}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 152
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\r\0\0\0\2\0\0\0\17\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1s\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 45
[pid 15674] recvmsg(5, 0x7ffea1010690, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\34\0\0\0\3\0\0\0y\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 144}, {"\7\0\0\0freezer\0\4\0\0\0/lxc\0\0\0\0:=\0\0", 28}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 172
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\0\0\0\0\3\0\0\0\10\0\0\0\5\1u\0\3\0\0\0", 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 24
[pid 15674] recvmsg(5, 0x7ffea10106a0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1&\0\0\0\4\0\0\0y\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 144}, {"\7\0\0\0freezer\0\3\0\0\000106\0\r\0\0\0freezer."..., 38}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 182
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\r\0\0\0\4\0\0\0\17\0\0\0\5\1u\0\4\0\0\0\10\1g\0\1s\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 45
[pid 15674] recvmsg(5, 0x7ffea10106b0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15674] close(5) = 0
[pid 15674] write(4, "\10\0\0\0", 4) = 4
[pid 15674] write(4, "FREEZING", 8) = 8
[pid 15674] exit_group(0) = ?
[pid 15674] +++ exited with 0 +++
close(4) = 0

I can pastebin longer strace output if that'll help in any way.

ugenk · Jan 10, 2016

problem definitely is not related to storage engine - it is in lxc-freeze.

ugenk · Jan 10, 2016

btw,
root@proxmox2:~# cat /etc/vzdump.conf | grep -v ^#
tmpdir: /tmp
dumpdir: /tmp

deranjer · Jan 10, 2016

Same issue here, with NFS storage for backups only, the LXC containers run on an LVM striped disk array.

windinternet · Jan 10, 2016

Nominally the freeze should be near instantaneous but in your cases it seems to stretch out indefinitely. The lxc-freeze command does not return until the state becomes FROZEN instead of FREEZING.

Are there processes in uninterruptible state before the freeze in the container? Any stat columns with a D?

Code:

ps -xf -ostat,comm,pid,ppid

cFire · Jan 11, 2016

You may be on to something there. I ran the backup jobs with -mode stop once and the suspend and snapshot options seem to work again after that. (So far at least.)
I'll keep an eye on it and see if next time it fails and processes are in the D-state

windinternet · Jan 11, 2016

Then you will have to look at the process list from the host console, because any process entering the containers cgroup would "freeze" immediately.

I looked at the source code of lxc and the lxc-freeze program just loops and checks once per second if the entire group has become FROZEN. If you kill the lxc-freeze command, the backup should continue. Proxmox should kill it too after a timeout, because if the D state processes don't wake after a while, they likely won't give the rsync any trouble.

cFire · Jan 11, 2016

The checking once a second looks to be consistent with the strace output I saw. However proxmox definitely does not kill it in this case within any reasonable timeout. I've let the backup process run for several hours on a few occasions, without making any progress past the container freezing.

windinternet · Jan 12, 2016

marigo said:
I have the same issue when a -stop lxc backup starts. This is the log Proxmox produces. I had no issues with Proxmox4.0.

This is a different issue. Please start a new thread for it.

[SOLVED] LXC Backup randomly hangs at suspend

Active Member

Renowned Member

Renowned Member

Member

Renowned Member

Member

Member

Renowned Member

Member

Member

Member

New Member

Member

Member

Active Member

Member

New Member

Member

New Member

Member