[SOLVED] LXC Backup randomly hangs at suspend

eislon · Feb 13, 2016

I also have this problem, I normally backup to NFS but due to this problem I put a local disk on it and did backups, indeed the first backup seems ok, but also on local storage the backup is hung at suspend command.

Is this solved yet?

EricM · Feb 13, 2016

here is the logs http://pastebin.com/6nj4gMqN
and gdb bt : http://pastebin.com/rhsxd3DH

hop that will help you to solve this issue.

fyi : debian 8.3 with lastest package proxmox

Jacob Tranholm · Feb 13, 2016

fabian said:
@iMerPlease attach the full gdb output and the full output of manu's ps command to this thread. Thanks in advance!

All of our servers are in production use. So these lockups during vzdump-backups are disasters. For most of our LXC-based VMs I have been forced to switch backup-methods. At all 3 of our Proxmox VE servers I only have one LXC-VM left using vzdump for backups, and that's only because I haven't found any other methods for creating a snapshot of the 400GB B+ data tree with billions of very small files (without taking the server offline).

So you should seek the testing data elsewhere…

iMer · Feb 13, 2016

@fabian: The fix I tried (just checking if the /proc/$PID dir exists) seems to have worked great for me so far
The other containers I set up havent hung so far, but it's kinda hard without production data to give the cronjob something proper to work on
I'll try and see if I can sort something out next week when im less busy

fabian · Feb 18, 2016

Could you try the updated lxcfs packages in pve-test ( http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs_2.0.0-pve1_amd64.deb and http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs-dbg_2.0.0-pve1_amd64.deb ). This is a new upstream version which does not fix the issue completely (yet), but should move it from "occurs rarely" to "occurs almost never" territory. A complete fix is in the works and will hopefully follow soon.

Note that you need to stop your affected containers after updating before the new lxcfs binary is used, and should probably try this on non-production systems first (I did not encounter issues so far, but better be safe than sorry!)

Jacob Tranholm · Feb 22, 2016

fabian said:
Could you try the updated lxcfs packages in pve-test ( http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs_2.0.0-pve1_amd64.deb and http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs-dbg_2.0.0-pve1_amd64.deb ). This is a new upstream version which does not fix the issue completely (yet), but should move it from "occurs rarely" to "occurs almost never" territory. A complete fix is in the works and will hopefully follow soon.

Note that you need to stop your affected containers after updating before the new lxcfs binary is used, and should probably try this on non-production systems first (I did not encounter issues so far, but better be safe than sorry!)

I have tried your new lxcfs-package at all 3 of our Proxmox VE servers. I followed your advise and tested if everything worked at one of the servers. And when I didn't see any problems, I installed the package at the other two Proxmox VE servers.

Now the servers have been running with the new lxcfs-package in a few days. And I have again activated scheduled vzdumps for all of the LXC-based VM. And up until now I haven't seen any lockups… Before the lxcfs-package update, the VM's locked up about one third of the times.

So I would like to say thanks… And from my experiences it looks like your changes are in the correct area.

fabian · Feb 23, 2016

Jacob Tranholm said:
I have tried your new lxcfs-package at all 3 of our Proxmox VE servers. I followed your advise and tested if everything worked at one of the servers. And when I didn't see any problems, I installed the package at the other two Proxmox VE servers.

Now the servers have been running with the new lxcfs-package in a few days. And I have again activated scheduled vzdumps for all of the LXC-based VM. And up until now I haven't seen any lockups… Before the lxcfs-package update, the VM's locked up about one third of the times.

So I would like to say thanks… And from my experiences it looks like your changes are in the correct area.

Sounds good! Please keep us posted if you experience any issues with the updated packages.

dzunk · Feb 24, 2016

Hey, just wanted to check in and see if there is a timeframe for the lxcfs update/the next stable release - we're running PVE 3.4 in production, and in starting the process of upgrading, I set up a test server on 4.1 and ran in to this dealbreaker issue immediately (within the first 24 hours).

Since the server in question is just for testing the upgrade, I can use pvetest for now, would love to see it in stable before end of support for 3.4 in April though!

fabian · Feb 25, 2016

The more (positive) feedback we get, the sooner it will move to the non-test repositories. Did the package from pve-test fix the issue for you?

peterx · Feb 25, 2016

fabian said:
The more (positive) feedback we get, the sooner it will move to the non-test repositories. Did the package from pve-test fix the issue for you?

Since a few days after the last updates I do backups every night wthout any problem: lxc, on NFS, LZO, snapshot.
Peter

dzunk · Feb 25, 2016

So far so good, backups ran last night without a problem on:

Code:

lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1

Using an NFS target, GZIP, suspend mode. I'm going to migrate a few more containers over today and see what happens.

RobFantini · Feb 27, 2016

we did 12 LXC backups snapshot mode after upgrade and reboot.

10 worked the 1-st time. [ big improvement ].

2 backups failed on different hosts .

here is error output :

Code:

2228: Feb 25 22:26:30 INFO: Starting Backup of VM 2228 (lxc)
2228: Feb 25 22:26:30 INFO: status = running
2228: Feb 25 22:26:30 INFO: found old vzdump snapshot (force removal)
2228: Feb 25 22:26:30 ERROR: Backup of VM 2228 failed - Can't delete snapshot: 2228 vzdump zfs error: could not find any snapshots to destroy; check snapshot names.

Code:

INFO: Starting Backup of VM 7597 (lxc)
INFO: status = running
INFO: found old vzdump snapshot (force removal)
ERROR: Backup of VM 7597 failed - Can't delete snapshot: 7597 vzdump zfs error: could not find any snapshots to destroy; check snapshot names.

in both cases rerunning the backup worked.

Today there were no backup errors:

Code:

  VMID  STATUS  TIME  SIZE  FILENAME
  2100  ok  00:00:28  298MB  /bkup2/dump/vzdump-lxc-2100-2016_02_27-16_32_02.tar.lzo
  2214  ok  00:00:52  1.03GB  /bkup2/dump/vzdump-lxc-2214-2016_02_27-16_32_30.tar.lzo
  2217  ok  00:04:20  3.17GB  /bkup2/dump/vzdump-lxc-2217-2016_02_27-16_33_22.tar.lzo
  2219  ok  00:01:57  959MB  /bkup2/dump/vzdump-lxc-2219-2016_02_27-16_37_42.tar.lzo
  2227  ok  00:03:34  2.62GB  /bkup2/dump/vzdump-lxc-2227-2016_02_27-16_39_39.tar.lzo
  2228  ok  00:01:41  960MB  /bkup2/dump/vzdump-lxc-2228-2016_02_27-16_43_13.tar.lzo
  2249  ok  00:01:54  985MB  /bkup2/dump/vzdump-lxc-2249-2016_02_27-16_44_54.tar.lzo

fabian · Feb 29, 2016

Those two errors are because the config referenced a snapshot which does not actually exist on the storage level (probably an old failed backup run where you killed lxc-freeze?). This is/was expected behaviour and fixes itself on subsequent backup runs, like you said. We improved the error handling in that area already, so this hopefully does not happen again.

Thanks for the feedback!

fips · Feb 29, 2016

Well I installed that 2 patches on my 3 hosts, since that day I can run backups on nfs storage like a charm.
BUT already a third time it happens that 1 host cant quit processes and start more and more until it stuck.
I can't stop container or migrate them, just got a timeout. I even can't reboot via shell, I have to make it via IPMI.

As I said it happens since that patch and only on 1 host, to investigate I moved alle container except 2 to another host.

Syslog says just before reboot:
Feb 29 10:32:20 vmbase3 pvedaemon[6326]: Use of uninitialized value in concatenation (.) or string at /usr/share/perl5/PVE/Tools.pm line 827.

Feb 29 10:32:20 vmbase3 pvedaemon[6326]: Use of uninitialized value in concatenation (.) or string at /usr/share/perl5/PVE/Tools.pm line 827.

Feb 29 10:32:20 vmbase3 pvedaemon[6326]: Use of uninitialized value in concatenation (.) or string at /usr/share/perl5/PVE/Tools.pm line 827.

Feb 29 10:32:20 vmbase3 pvedaemon[3226]: Argument "\n" isn't numeric in int at /usr/share/perl5/PVE/Tools.pm line 840, <GEN1142> line 1.

Feb 29 10:32:20 vmbase3 pvedaemon[3226]: Argument "\n" isn't numeric in int at /usr/share/perl5/PVE/Tools.pm line 841, <GEN1142> line 2.

Feb 29 10:32:20 vmbase3 pvedaemon[3226]: Argument "\n" isn't numeric in int at /usr/share/perl5/PVE/Tools.pm line 842, <GEN1142> line 3.

fabian · Feb 29, 2016

The patch did not change anything in our own codebase - only in lxcfs. Are the other packages uptodate? Could you post the output of "pveversion -v"?

What exactly do you mean with "1 host cant quit processes and start more and more until it stuck" ? vzdump processes? lxc processes?

camaran · Mar 3, 2016

hi, this problem is fixed with the latest proxmox version?

pa657 · Mar 3, 2016

Since last version, for the time being, no more issue for me...

fabian · Mar 4, 2016

Should be fixed with lxcfs-2.0.0-pve1

camaran · Mar 4, 2016

fabian said:
Should be fixed with lxcfs-2.0.0-pve1

and it's released?

fabian · Mar 4, 2016

yes, available in both pve-no-subscription and pve-enterprise.

[SOLVED] LXC Backup randomly hangs at suspend

Renowned Member

Member

Member

New Member

Proxmox Staff Member

Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Renowned Member

New Member

Famous Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Active Member

Renowned Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

We value your privacy