[SOLVED] LXC Backup randomly hangs at suspend

Discussion in 'Proxmox VE: Installation and configuration' started by pa657, Dec 28, 2015.

  1. pa657

    pa657 New Member

    Joined:
    Dec 7, 2015
    Messages:
    7
    Likes Received:
    0
    Hello,
    Not always but almost always, the backup hangs at this step :

    Code:
    INFO: Starting Backup of VM 112 (lxc)
    INFO: status = running
    INFO: mode failure - some volumes does not support snapshots
    INFO: trying 'suspend' mode instead
    INFO: backup mode: suspend
    INFO: ionice priority: 7
    INFO: starting first sync /proc/31901/root// to /disk2/backup/vz/dump/vzdump-lxc-112-2015_12_27-18_41_46.tmp
    INFO: Number of files: 557,820 (reg: 485,738, dir: 62,260, link: 9,783, dev: 33, special: 6)
    INFO: Number of created files: 557,819 (reg: 485,738, dir: 62,259, link: 9,783, dev: 33, special: 6)
    INFO: Number of deleted files: 0
    INFO: Number of regular files transferred: 468,545
    INFO: Total file size: 25,628,021,190 bytes
    INFO: Total transferred file size: 23,971,267,967 bytes
    INFO: Literal data: 23,971,267,967 bytes
    INFO: Matched data: 0 bytes
    INFO: File list size: 23,459,109
    INFO: File list generation time: 0.008 seconds
    INFO: File list transfer time: 0.000 seconds
    INFO: Total bytes sent: 24,016,188,421
    INFO: Total bytes received: 11,432,920
    INFO: sent 24,016,188,421 bytes received 11,432,920 bytes 57,551,188.84 bytes/sec
    INFO: total size is 25,628,021,190 speedup is 1.07
    INFO: first sync finished (417 seconds)
    INFO: suspend vm
    Thanks for your help.
     
  2. peterx

    peterx Member

    Joined:
    May 5, 2008
    Messages:
    39
    Likes Received:
    1
    Same problem here.
    KVM's are backuped without any problem.
    LXC containers are backup once on a clean nfs share, next backups fail, at first on the last lxc container and then on the backup of the first.
    vzdump.conf changed:

    # vzdump default settings

    tmpdir: /tmp
    #dumpdir: DIR
    #storage: STORAGE_ID
    #mode: snapshot|suspend|stop
    #bwlimit: KBPS
    #ionice: PRI
    #lockwait: MINUTES
    #stopwait: MINUTES
    #size: MB
    #maxfiles: N
    #script: FILENAME
    #exclude-path: PATHLIST
    #pigz: N:
     
  3. peterx

    peterx Member

    Joined:
    May 5, 2008
    Messages:
    39
    Likes Received:
    1
    Nobody having the solution?
    Peter
     
  4. ugenk

    ugenk New Member

    Joined:
    Jan 10, 2016
    Messages:
    6
    Likes Received:
    0
    same story.
    i've looked with strace on lxc-freeze:

    Process 25173 attached
    restart_syscall(<... resuming interrupted call ...>) = 0
    pipe([3, 4]) = 0
    clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5cb76e5a50) = 32292
    close(4) = 0
    read(3, "\10\0\0\0", 4) = 4
    read(3, "FREEZING", 8) = 8
    close(3) = 0
    wait4(32292, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32292
    --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32292, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
    rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
    rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
    rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    nanosleep({1, 0}, 0x7ffd4aed18d0) = 0
    pipe([3, 4]) = 0
    clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5cb76e5a50) = 32309
    close(4) = 0
    read(3, "\10\0\0\0", 4) = 4
    read(3, "FREEZING", 8) = 8
    --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32309, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
    close(3) = 0
    wait4(32309, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32309
    rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
    rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
    rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    nanosleep({1, 0}, 0x7ffd4aed18d0) = 0
    pipe([3, 4]) = 0
    clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5cb76e5a50) = 32362
    close(4) = 0
    read(3, "\10\0\0\0", 4) = 4
    read(3, "FREEZING", 8) = 8
    close(3) = 0
    wait4(32362, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32362
    --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32362, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
     
  5. peterx

    peterx Member

    Joined:
    May 5, 2008
    Messages:
    39
    Likes Received:
    1
    Still the same problem here. Did a complete reinstall of proxmox 4.1, with all the latest updates. Vm and lxc vt's backup fine two times. Then the vm's are still well backupped but the lyxcct's start to fail at suspend vm. At first the second one then the first one.
    After a restart of the proxmox machine it backups well a few times and then the problems start again.
    Peter
     
    felx likes this.
  6. ugenk

    ugenk New Member

    Joined:
    Jan 10, 2016
    Messages:
    6
    Likes Received:
    0
    peterx, I have exactly the same problems.
     
  7. windinternet

    windinternet Member

    Joined:
    Oct 8, 2015
    Messages:
    159
    Likes Received:
    7
    Maybe some details to allow for reproduction. What storage is used for the LXC containers? Are they all running or are there also some containers that are already stopped before the backups?
     
  8. peterx

    peterx Member

    Joined:
    May 5, 2008
    Messages:
    39
    Likes Received:
    1
    In my case on a NFS share and all the vm and ct are running
     
  9. windinternet

    windinternet Member

    Joined:
    Oct 8, 2015
    Messages:
    159
    Likes Received:
    7
    on NFS, the containers themselves or the backup?
     
  10. ugenk

    ugenk New Member

    Joined:
    Jan 10, 2016
    Messages:
    6
    Likes Received:
    0
    containers - on regular
    /dev/mapper/pve-data on /var/lib/vz type ext4 (rw,relatime,data=ordered)

    backups - on NFS:
    xxx:/mnt/disk1/proxmox2 on /mnt/pve/backup1 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=x.x.x.x,mountvers=3,mountport=60126,mountproto=udp,local_lock=none,addr=x.x.x.x)
     
  11. windinternet

    windinternet Member

    Joined:
    Oct 8, 2015
    Messages:
    159
    Likes Received:
    7
    Do the containers you backup use connections to the same NFS server? Can you reproduce the problem by just using vzdump repeatedly on one container?

    Code:
    vzdump containerid -mode suspend
    
    It just seems as if the backup itself freezes also. You can check the state of a container with lxc-info -n containerid.
     
  12. cFire

    cFire New Member

    Joined:
    Jan 10, 2016
    Messages:
    4
    Likes Received:
    2
    It looks like I'm having the same issue with a similar setup. Except my backups are written to an external drive rather than an NFS.
    lxc-info shows "state: FREEZING"
    A section of the strace output it attached below. Best I can tell from a quick look at it is that it is just looping over some status updating.
    [pid 15672] write(4, "\10\0\0\0", 4) = 4
    [pid 15672] write(4, "FREEZING", 8) = 8
    [pid 15672] exit_group(0) = ?
    [pid 15672] +++ exited with 0 +++
    close(4) = 0
    --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=15672, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
    read(3, "\10\0\0\0", 4) = 4
    read(3, "FREEZING", 8) = 8
    close(3) = 0
    wait4(15672, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 15672
    rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
    rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
    rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    nanosleep({1, 0}, 0x7ffea10109a0) = 0
    pipe([3, 4]) = 0
    clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5e7fe2ea50) = 15674
    Process 15674 attached
    [pid 15674] set_robust_list(0x7f5e7fe2ea60, 24) = 0
    [pid 15674] socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0) = 5
    [pid 15674] connect(5, {sa_family=AF_LOCAL, sun_path="/sys/fs/cgroup/cgmanager/sock"}, 31) = 0
    [pid 15674] fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
    [pid 15674] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    [pid 15674] geteuid() = 0
    [pid 15674] getsockname(5, {sa_family=AF_LOCAL, NULL}, [2]) = 0
    [pid 15674] poll([{fd=5, events=POLLOUT}], 1, 0) = 1 ([{fd=5, revents=POLLOUT}])
    [pid 15674] sendto(5, "\0", 1, MSG_NOSIGNAL, NULL, 0) = 1
    [pid 15674] sendto(5, "AUTH EXTERNAL 30\r\n", 18, MSG_NOSIGNAL, NULL, 0) = 18
    [pid 15674] poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5, revents=POLLIN}])
    [pid 15674] read(5, "OK 77cd8f7839a3483509ae732a568a5"..., 2048) = 37
    [pid 15674] poll([{fd=5, events=POLLOUT}], 1, 4294967295) = 1 ([{fd=5, revents=POLLOUT}])
    [pid 15674] sendto(5, "NEGOTIATE_UNIX_FD\r\n", 19, MSG_NOSIGNAL, NULL, 0) = 19
    [pid 15674] poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5, revents=POLLIN}])
    [pid 15674] read(5, "AGREE_UNIX_FD\r\n", 2048) = 15
    [pid 15674] poll([{fd=5, events=POLLOUT}], 1, 4294967295) = 1 ([{fd=5, revents=POLLOUT}])
    [pid 15674] sendto(5, "BEGIN\r\n", 7, MSG_NOSIGNAL, NULL, 0) = 7
    [pid 15674] poll([{fd=5, events=POLLIN|POLLOUT}], 1, 4294967295) = 1 ([{fd=5, revents=POLLOUT}])
    [pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\0018\0\0\0\1\0\0\0h\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 120}, {" \0\0\0org.linuxcontainers.cgmanage"..., 56}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 176
    [pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
    [pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\10\0\0\0\1\0\0\0\17\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1v\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 40
    [pid 15674] recvmsg(5, 0x7ffea1010610, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
    [pid 15674] socket(PF_LOCAL, SOCK_STREAM, 0) = 8
    [pid 15674] connect(8, {sa_family=AF_LOCAL, sun_path=@"/var/lib/lxc/106/command"}, 27) = 0
    [pid 15674] getuid() = 0
    [pid 15674] getgid() = 0
    [pid 15674] sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=15674, uid=0, gid=0}}, msg_flags=0}, MSG_NOSIGNAL) = 16
    [pid 15674] recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0j\36\0\0\0\0\0\0", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
    [pid 15674] close(8) = 0
    [pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\20\0\0\0\2\0\0\0x\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 136}, {"\7\0\0\0freezer\0j\36\0\0", 16}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 152
    [pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
    [pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\r\0\0\0\2\0\0\0\17\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1s\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 45
    [pid 15674] recvmsg(5, 0x7ffea1010690, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
    [pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\34\0\0\0\3\0\0\0y\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 144}, {"\7\0\0\0freezer\0\4\0\0\0/lxc\0\0\0\0:=\0\0", 28}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 172
    [pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
    [pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\0\0\0\0\3\0\0\0\10\0\0\0\5\1u\0\3\0\0\0", 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 24
    [pid 15674] recvmsg(5, 0x7ffea10106a0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
    [pid 15674] sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1&\0\0\0\4\0\0\0y\0\0\0\1\1o\0\36\0\0\0/org/lin"..., 144}, {"\7\0\0\0freezer\0\3\0\0\000106\0\r\0\0\0freezer."..., 38}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 182
    [pid 15674] poll([{fd=5, events=POLLIN}], 1, 25000) = 1 ([{fd=5, revents=POLLIN}])
    [pid 15674] recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\r\0\0\0\4\0\0\0\17\0\0\0\5\1u\0\4\0\0\0\10\1g\0\1s\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 45
    [pid 15674] recvmsg(5, 0x7ffea10106b0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
    [pid 15674] close(5) = 0
    [pid 15674] write(4, "\10\0\0\0", 4) = 4
    [pid 15674] write(4, "FREEZING", 8) = 8
    [pid 15674] exit_group(0) = ?
    [pid 15674] +++ exited with 0 +++
    close(4) = 0
    I can pastebin longer strace output if that'll help in any way.
     
    ugenk likes this.
  13. ugenk

    ugenk New Member

    Joined:
    Jan 10, 2016
    Messages:
    6
    Likes Received:
    0
    problem definitely is not related to storage engine - it is in lxc-freeze.
     
  14. ugenk

    ugenk New Member

    Joined:
    Jan 10, 2016
    Messages:
    6
    Likes Received:
    0
    btw,
    root@proxmox2:~# cat /etc/vzdump.conf | grep -v ^#
    tmpdir: /tmp
    dumpdir: /tmp
     
  15. deranjer

    deranjer Member

    Joined:
    Dec 28, 2010
    Messages:
    50
    Likes Received:
    1
    Same issue here, with NFS storage for backups only, the LXC containers run on an LVM striped disk array.
     
  16. windinternet

    windinternet Member

    Joined:
    Oct 8, 2015
    Messages:
    159
    Likes Received:
    7
    Nominally the freeze should be near instantaneous but in your cases it seems to stretch out indefinitely. The lxc-freeze command does not return until the state becomes FROZEN instead of FREEZING.

    Are there processes in uninterruptible state before the freeze in the container? Any stat columns with a D?
    Code:
    ps -xf -ostat,comm,pid,ppid
    
     
  17. cFire

    cFire New Member

    Joined:
    Jan 10, 2016
    Messages:
    4
    Likes Received:
    2
    You may be on to something there. I ran the backup jobs with -mode stop once and the suspend and snapshot options seem to work again after that. (So far at least.)
    I'll keep an eye on it and see if next time it fails and processes are in the D-state
     
  18. windinternet

    windinternet Member

    Joined:
    Oct 8, 2015
    Messages:
    159
    Likes Received:
    7
    Then you will have to look at the process list from the host console, because any process entering the containers cgroup would "freeze" immediately.

    I looked at the source code of lxc and the lxc-freeze program just loops and checks once per second if the entire group has become FROZEN. If you kill the lxc-freeze command, the backup should continue. Proxmox should kill it too after a timeout, because if the D state processes don't wake after a while, they likely won't give the rsync any trouble.
     
  19. cFire

    cFire New Member

    Joined:
    Jan 10, 2016
    Messages:
    4
    Likes Received:
    2
    The checking once a second looks to be consistent with the strace output I saw. However proxmox definitely does not kill it in this case within any reasonable timeout. I've let the backup process run for several hours on a few occasions, without making any progress past the container freezing.
     
    deranjer likes this.
  20. windinternet

    windinternet Member

    Joined:
    Oct 8, 2015
    Messages:
    159
    Likes Received:
    7
    This is a different issue. Please start a new thread for it.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice