[SOLVED] LXC Backup randomly hangs at suspend

pa657

New Member
Dec 7, 2015
7
0
1
Hello,
Not always but almost always, the backup hangs at this step :

Code:
INFO: Starting Backup of VM 112 (lxc)
INFO: status = running
INFO: mode failure - some volumes does not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: starting first sync /proc/31901/root// to /disk2/backup/vz/dump/vzdump-lxc-112-2015_12_27-18_41_46.tmp
INFO: Number of files: 557,820 (reg: 485,738, dir: 62,260, link: 9,783, dev: 33, special: 6)
INFO: Number of created files: 557,819 (reg: 485,738, dir: 62,259, link: 9,783, dev: 33, special: 6)
INFO: Number of deleted files: 0
INFO: Number of regular files transferred: 468,545
INFO: Total file size: 25,628,021,190 bytes
INFO: Total transferred file size: 23,971,267,967 bytes
INFO: Literal data: 23,971,267,967 bytes
INFO: Matched data: 0 bytes
INFO: File list size: 23,459,109
INFO: File list generation time: 0.008 seconds
INFO: File list transfer time: 0.000 seconds
INFO: Total bytes sent: 24,016,188,421
INFO: Total bytes received: 11,432,920
INFO: sent 24,016,188,421 bytes received 11,432,920 bytes 57,551,188.84 bytes/sec
INFO: total size is 25,628,021,190 speedup is 1.07
INFO: first sync finished (417 seconds)
INFO: suspend vm
Thanks for your help.
 

peterx

Active Member
May 5, 2008
39
1
28
Same problem here.
KVM's are backuped without any problem.
LXC containers are backup once on a clean nfs share, next backups fail, at first on the last lxc container and then on the backup of the first.
vzdump.conf changed:

# vzdump default settings

tmpdir: /tmp
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#ionice: PRI
#lockwait: MINUTES
#stopwait: MINUTES
#size: MB
#maxfiles: N
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N:
 

ugenk

New Member
Jan 10, 2016
6
0
1
33
same story.
i've looked with strace on lxc-freeze:

Process 25173 attached
restart_syscall(<... resuming interrupted call ...>) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5cb76e5a50) = 32292
close(4) = 0
read(3, "\10\0\0\0", 4) = 4
read(3, "FREEZING", 8) = 8
close(3) = 0
wait4(32292, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32292
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32292, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffd4aed18d0) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5cb76e5a50) = 32309
close(4) = 0
read(3, "\10\0\0\0", 4) = 4
read(3, "FREEZING", 8) = 8
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32309, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
close(3) = 0
wait4(32309, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32309
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffd4aed18d0) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5cb76e5a50) = 32362
close(4) = 0
read(3, "\10\0\0\0", 4) = 4
read(3, "FREEZING", 8) = 8
close(3) = 0
wait4(32362, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32362
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32362, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
 

peterx

Active Member
May 5, 2008
39
1
28
Still the same problem here. Did a complete reinstall of proxmox 4.1, with all the latest updates. Vm and lxc vt's backup fine two times. Then the vm's are still well backupped but the lyxcct's start to fail at suspend vm. At first the second one then the first one.
After a restart of the proxmox machine it backups well a few times and then the problems start again.
Peter
 
  • Like
Reactions: felx

windinternet

Member
Oct 8, 2015
159
7
18
Maybe some details to allow for reproduction. What storage is used for the LXC containers? Are they all running or are there also some containers that are already stopped before the backups?
 

ugenk

New Member
Jan 10, 2016
6
0
1
33
containers - on regular
/dev/mapper/pve-data on /var/lib/vz type ext4 (rw,relatime,data=ordered)

backups - on NFS:
xxx:/mnt/disk1/proxmox2 on /mnt/pve/backup1 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=x.x.x.x,mountvers=3,mountport=60126,mountproto=udp,local_lock=none,addr=x.x.x.x)
 

windinternet

Member
Oct 8, 2015
159
7
18
Do the containers you backup use connections to the same NFS server? Can you reproduce the problem by just using vzdump repeatedly on one container?

Code:
vzdump containerid -mode suspend
It just seems as if the backup itself freezes also. You can check the state of a container with lxc-info -n containerid.
 

cFire

New Member
Jan 10, 2016
4
2
3
33
It looks like I'm having the same issue with a similar setup. Except my backups are written to an external drive rather than an NFS.
lxc-info shows "state: FREEZING"
A section of the strace output it attached below. Best I can tell from a quick look at it is that it is just looping over some status updating.
[pid 15672] write(4, "\10\0\0\0", 4) = 4
[pid 15672] write(4, "FREEZING", 8) = 8
[pid 15672] exit_group(0) = ?
[pid 15672] +++ exited with 0 +++
close(4) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=15672, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
read(3, "\10\0\0\0", 4) = 4
read(3, "FREEZING", 8) = 8
close(3) = 0
wait4(15672, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 15672
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffea10109a0) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5e7fe2ea50) = 15674
Process 15674 attached
[pid 15674] set_robust_list(0x7f5e7fe2ea60, 24) = 0
[pid 15674] socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0) = 5
[pid 15674] connect(5, {sa_family=AF_LOCAL, sun_path="/sys/fs/cgroup/cgmanager/sock"}, 31) = 0
[pid 15674] fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
[pid 15674] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 15674] geteuid() = 0
[pid 15674] getsockname(5, {sa_family=AF_LOCAL, NULL}, [2]) = 0
[pid 15674] poll([{fd=5, events=POLLOUT}], 1, 0) = 1 ([{fd=5, revents=POLLOUT}])
[pid 15674] sendto(5, "\0", 1, MSG_NOSIGNAL, NULL, 0) = 1
[pid 15674] sendto(5, "AUTH EXTERNAL 30\r\n", 18, MSG_NOSIGNAL, NULL, 0) = 18
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] read(5, "OK 77cd8f7839a3483509ae732a568a5"..., 2048) = 37
[pid 15674] poll([{fd=5, events=POLLOUT}], 1, 4294967295) = 1 ([{fd=5, revents=POLLOUT}])
[pid 15674] sendto(5, "NEGOTIATE_UNIX_FD\r\n", 19, MSG_NOSIGNAL, NULL, 0) = 19
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] read(5, "AGREE_UNIX_FD\r\n", 2048) = 15
[pid 15674] poll([{fd=5, events=POLLOUT}], 1, 4294967295) = 1 ([{fd=5, revents=POLLOUT}])
[pid 15674] sendto(5, "BEGIN\r\n", 7, MSG_NOSIGNAL, NULL, 0) = 7
[pid 15674] poll([{fd=5, events=POLLIN|POLLOUT}], 1, 4294967295) = 1 ([{fd=5, revents=POLLOUT}])
[pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\0018\0\0\0\1\0\0\0h\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 120}, {" \0\0\0org.linuxcontainers.cgmanage"..., 56}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 176
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\10\0\0\0\1\0\0\0\17\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1v\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 40
[pid 15674] recvmsg(5, 0x7ffea1010610, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15674] socket(PF_LOCAL, SOCK_STREAM, 0) = 8
[pid 15674] connect(8, {sa_family=AF_LOCAL, sun_path=@"/var/lib/lxc/106/command"}, 27) = 0
[pid 15674] getuid() = 0
[pid 15674] getgid() = 0
[pid 15674] sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=15674, uid=0, gid=0}}, msg_flags=0}, MSG_NOSIGNAL) = 16
[pid 15674] recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0j\36\0\0\0\0\0\0", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
[pid 15674] close(8) = 0
[pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\20\0\0\0\2\0\0\0x\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 136}, {"\7\0\0\0freezer\0j\36\0\0", 16}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 152
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\r\0\0\0\2\0\0\0\17\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1s\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 45
[pid 15674] recvmsg(5, 0x7ffea1010690, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\34\0\0\0\3\0\0\0y\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 144}, {"\7\0\0\0freezer\0\4\0\0\0/lxc\0\0\0\0:=\0\0", 28}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 172
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\0\0\0\0\3\0\0\0\10\0\0\0\5\1u\0\3\0\0\0", 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 24
[pid 15674] recvmsg(5, 0x7ffea10106a0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1&\0\0\0\4\0\0\0y\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 144}, {"\7\0\0\0freezer\0\3\0\0\000106\0\r\0\0\0freezer."..., 38}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 182
[pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
[pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\r\0\0\0\4\0\0\0\17\0\0\0\5\1u\0\4\0\0\0\10\1g\0\1s\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 45
[pid 15674] recvmsg(5, 0x7ffea10106b0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15674] close(5) = 0
[pid 15674] write(4, "\10\0\0\0", 4) = 4
[pid 15674] write(4, "FREEZING", 8) = 8
[pid 15674] exit_group(0) = ?
[pid 15674] +++ exited with 0 +++
close(4) = 0
I can pastebin longer strace output if that'll help in any way.
 
  • Like
Reactions: ugenk

ugenk

New Member
Jan 10, 2016
6
0
1
33
problem definitely is not related to storage engine - it is in lxc-freeze.
 

ugenk

New Member
Jan 10, 2016
6
0
1
33
btw,
root@proxmox2:~# cat /etc/vzdump.conf | grep -v ^#
tmpdir: /tmp
dumpdir: /tmp
 

deranjer

Member
Dec 28, 2010
50
1
8
Same issue here, with NFS storage for backups only, the LXC containers run on an LVM striped disk array.
 

windinternet

Member
Oct 8, 2015
159
7
18
Nominally the freeze should be near instantaneous but in your cases it seems to stretch out indefinitely. The lxc-freeze command does not return until the state becomes FROZEN instead of FREEZING.

Are there processes in uninterruptible state before the freeze in the container? Any stat columns with a D?
Code:
ps -xf -ostat,comm,pid,ppid
 

cFire

New Member
Jan 10, 2016
4
2
3
33
You may be on to something there. I ran the backup jobs with -mode stop once and the suspend and snapshot options seem to work again after that. (So far at least.)
I'll keep an eye on it and see if next time it fails and processes are in the D-state
 

windinternet

Member
Oct 8, 2015
159
7
18
Then you will have to look at the process list from the host console, because any process entering the containers cgroup would "freeze" immediately.

I looked at the source code of lxc and the lxc-freeze program just loops and checks once per second if the entire group has become FROZEN. If you kill the lxc-freeze command, the backup should continue. Proxmox should kill it too after a timeout, because if the D state processes don't wake after a while, they likely won't give the rsync any trouble.
 

cFire

New Member
Jan 10, 2016
4
2
3
33
The checking once a second looks to be consistent with the strace output I saw. However proxmox definitely does not kill it in this case within any reasonable timeout. I've let the backup process run for several hours on a few occasions, without making any progress past the container freezing.
 
  • Like
Reactions: deranjer

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!