[SOLVED] LXC Backup randomly hangs at suspend

was this ever solved by chance? i've just upgraded to proxmox 4.1-2 and my backups freeze my lxc containers. The backups are configured for snapshots at first, but after the first freeze i changed them to suspend, with the same issue; INFO: suspend vm ... and it hangs there.. never times out never does anything until i manually shutdown the container, which causes the backup to fail.
 
@whyitsderp

No, it is not solved yet. It seems to happen mostly on systems with NFS. Does your system use NFS mounts also?

Snapshot and suspend both use lxc-freeze. I suppose you could temporarily divert lxc-freeze to a wrapper /usr/bin/lxc-freeze with a timeout, like this:

Code:
#!/bin/bash
timeout 5 /usr/bin/lxc-freeze.real "$@"

If the container isn't frozen in 5 seconds then it probably won't be frozen even in an hour or more, and the programs it is waiting for all the time are not likely to start during the tiny snapshot interval.
 
Hello
We've had the same issue for awhile. We were using NFS as a backup target.

To accomplish LXC backup we have uses stop mode for about a month.

Later due to cluster issues I eliminated all NFS storage.

After reading that the LXC backup issue may have been related to NFS , I tried a suspend mode backup The job is still running after 20 minutes.. this is a small lxc.

The test is on an all ZFS storage system.
Backup is to a local zfs directory.

So the issue here has nothing to do with NFS.

Code:
INFO: starting new backup job: vzdump 3122 --compress lzo --node sys3 --storage dump-save --mode snapshot --remove 0
INFO: Starting Backup of VM 3122 (lxc)
INFO: status = running
INFO: backup mode: snapshot
INFO: ionice priority: 8
INFO: create storage snapshot snapshot

Code:
sys3  ~ # pveversion -v
proxmox-ve: 4.1-32 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-4 (running version: 4.1-4/ccba54b0)
pve-kernel-4.2.6-1-pve: 4.2.6-32
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-1-pve: 4.2.3-18
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-30
qemu-server: 4.0-44
pve-firmware: 1.1-7
libpve-common-perl: 4.0-43
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-20
pve-container: 1.0-37
pve-firewall: 2.0-15
pve-ha-manager: 1.0-17
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
 
  • Like
Reactions: felx
@RobFantini

So, you see a lxc-freeze stuck in the process list? Can you look in the process list of the container if there are processes in a different state? All frozen processes should be listed as state "D".


Code:
ps -exf -ostat,pid,comm

Also:
The /sys/fs/cgroup/freezer/lxc/containerid/tasks should list all the frozen tasks, while cgroup.procs should list all procs that are meant to freeze.
 
@RobFantini

So, you see a lxc-freeze stuck in the process list? Can you look in the process list of the container if there are processes in a different state? All frozen processes should be listed as state "D".


Code:
ps -exf -ostat,pid,comm

Also:
The /sys/fs/cgroup/freezer/lxc/containerid/tasks should list all the frozen tasks, while cgroup.procs should list all procs that are meant to freeze.

Code:
ps -exf -ostat,pid,comm
...
Ss  18273 task UPID:sys3:
S  18276  \_ lxc-freeze

I'm not sure if this helps, do not see 18273 or 18276 in files at /sys/fs/cgroup/freezer/lxc/3122
grep 18273 /sys/fs/cgroup/freezer/lxc/3122/*
and
grep 18276 /sys/fs/cgroup/freezer/lxc/3122/*

do not return anything.
 
But where are the other processes, the ones that run inside the container? There should at the very least be an init process and some gettys etc.
 
But where are the other processes, the ones that run inside the container? There should at the very least be an init process and some gettys etc.

Code:
sys3  /sys/fs/cgroup/freezer/lxc/3122 # cat cgroup.procs
8942
9811
9937
10005
10075
10134
10143
10197
10546
10843
10853
10895
10896
15764
16601
21128
22915
24399
24400
24401
24402
24403
26951

I'm not sure that is what you are looking for.
 
Whether there is a difference between the id list in the tasks file and that in the cgroup.procs file, and whether by the output of the ps -exf -ostat,comm.pid command you can see if there is any process not in "D" state.

All of which is meant to find what process is stopping the cgroup from freezing, and see if that is some strange bug in freezer cgroups that until now hasn't been reported. I suspect it will be a process that is in noninteruptable state beforehand, maybe waiting on some TCP buffer that is only exchanging single packet messages every 24 hours or something like that.

Actually I think it might even be better to look at the process states before you do the freeze, because at this point of the ongoing freeze it may even be impossible to identify the culprit from userspace.
 
If you want me to do tests then please supply specific commands to run before and during the attempted backup. In the middle of other things here so can not do much more then that for now..

So I'll stop the backup and unlock the vm.

when I stopped on pve web page:
Code:
INFO: starting new backup job: vzdump 3122 --compress lzo --node sys3 --storage dump-save --mode snapshot --remove 0
INFO: Starting Backup of VM 3122 (lxc)
INFO: status = running
INFO: backup mode: snapshot
INFO: ionice priority: 8
INFO: create storage snapshot snapshot
ERROR: Backup of VM 3122 failed - VM is locked (snapshot)
INFO: Backup job finished with errors
TASK ERROR: job errors
 
Yesterday 15/1/2016 I updated the node and rebooted it. Backup to NFS was perfect. Today the scheduled backup failed: the vm's were O K.
The first CT hung on suspend vm. Still the same pattern: one backup is perfect, and then the next backups have this issue. After reboot one goog backup. Etc.
So the problem still persists.
Why do we see no reaction on the issue from the Proxmox Team? If it is hard to find and you are working on it. Tell us.

Peter
 
I have exactly the same problem.
My workaround, until Proxmox team can fix this, was to schedule every single CT in a single backup job, in 15min steps.
But tomorrow after I checked backup logs my plan was destroyed.
The very first backup job stuck at "suspend vm", and of course the other jobs couldn't get the global lock and crashed too.
 
This is a serious problem for me. And I would like to know whether it has the attention of the Proxmox Team.
So, dear Dietmar, give us one of your famous one word answers on my question: is the Proxmox Team working on this problem?

Peter
 
I have exactly the same problem.
My workaround, until Proxmox team can fix this, was to schedule every single CT in a single backup job, in 15min steps.
But tomorrow after I checked backup logs my plan was destroyed.
The very first backup job stuck at "suspend vm", and of course the other jobs couldn't get the global lock and crashed too.

This is what we've concluded to get backups done:

as of now:

lxc : 'stop' mode backups are working.

kvm : 'suspend' works here.

we I've set up two backup jobs per node - stop lxc and suspend kvm .
 
same one here, backup on an ordinary folder just for testing.
101: Jan 18 18:00:18 INFO: first sync finished (17 seconds)
101: Jan 18 18:00:18 INFO: suspend vm
101: Jan 18 20:34:41 INFO: lxc-freeze: freezer.c: do_freeze_thaw: 64 Failed to get new freezer state for /var/lib/lxc:101
101: Jan 18 20:34:41 INFO: lxc-freeze: lxc_freeze.c: main: 84 Failed to freeze /var/lib/lxc:101
101: Jan 18 20:34:41 ERROR: Backup of VM 101 failed - command 'lxc-freeze -n 101' failed: exit code 1
 
This is what we've concluded to get backups done:

as of now:

lxc : 'stop' mode backups are working.

kvm : 'suspend' works here.

we I've set up two backup jobs per node - stop lxc and suspend kvm .

I tried this too, but I had once the situation that vzdump didn't start the lxc after it backuped it.
 
I tried this too, but I had once the situation that vzdump didn't start the lxc after it backuped it.

We have had the same issue. the backup completes which is a good thing. then randomly some lxc's do not start. from pve web page:
Code:
INFO: restarting vm
INFO: lxc-start: lxc_start.c: main: 344 The container failed to start.
command 'lxc-start -n 4444' failed: exit code 1


The backup email reports no issues:
Code:
4444: Jan 19 02:03:59 INFO: Starting Backup of VM 4444 (lxc)
4444: Jan 19 02:03:59 INFO: status = running
4444: Jan 19 02:03:59 INFO: backup mode: stop
4444: Jan 19 02:03:59 INFO: ionice priority: 8
4444: Jan 19 02:03:59 INFO: stopping vm
4444: Jan 19 02:04:11 INFO: creating archive '/bkup/dump/vzdump-lxc-4444-2016_01_19-02_03_59.tar.lzo'
4444: Jan 19 02:05:19 INFO: Total bytes written: 7220039680 (6.8GiB, 101MiB/s)
4444: Jan 19 02:05:19 INFO: archive file size: 2.02GB
4444: Jan 19 02:05:19 INFO: delete old backup '/bkup/dump/vzdump-lxc-4444-2016_01_09-02_23_51.tar.lzo'
4444: Jan 19 02:05:19 INFO: delete old backup '/bkup/dump/vzdump-lxc-4444-2016_01_16-02_19_46.tar.lzo'
4444: Jan 19 02:05:19 INFO: restarting vm
4444: Jan 19 02:05:25 INFO: Finished Backup of VM 4444 (00:01:26)
 
So in the end I can choose between 3 things: 1) No Backups 2) Backup mode "stop" with the risk of not starting VM's 3) Backup mode "suspend" with the risk of no backup
I feel like in "Herzblatt"...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!