[SOLVED] LXC Backup randomly hangs at suspend

I also have this problem, I normally backup to NFS but due to this problem I put a local disk on it and did backups, indeed the first backup seems ok, but also on local storage the backup is hung at suspend command.

Is this solved yet?
 
@iMerPlease attach the full gdb output and the full output of manu's ps command to this thread. Thanks in advance!

All of our servers are in production use. So these lockups during vzdump-backups are disasters. For most of our LXC-based VMs I have been forced to switch backup-methods. At all 3 of our Proxmox VE servers I only have one LXC-VM left using vzdump for backups, and that's only because I haven't found any other methods for creating a snapshot of the 400GB B+ data tree with billions of very small files (without taking the server offline).

So you should seek the testing data elsewhere…
 
@fabian: The fix I tried (just checking if the /proc/$PID dir exists) seems to have worked great for me so far
The other containers I set up havent hung so far, but it's kinda hard without production data to give the cronjob something proper to work on
I'll try and see if I can sort something out next week when im less busy
 
Could you try the updated lxcfs packages in pve-test ( http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs_2.0.0-pve1_amd64.deb and http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs-dbg_2.0.0-pve1_amd64.deb ). This is a new upstream version which does not fix the issue completely (yet), but should move it from "occurs rarely" to "occurs almost never" territory. A complete fix is in the works and will hopefully follow soon.

Note that you need to stop your affected containers after updating before the new lxcfs binary is used, and should probably try this on non-production systems first (I did not encounter issues so far, but better be safe than sorry!)
 
  • Like
Reactions: Jacob Tranholm
Could you try the updated lxcfs packages in pve-test ( http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs_2.0.0-pve1_amd64.deb and http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs-dbg_2.0.0-pve1_amd64.deb ). This is a new upstream version which does not fix the issue completely (yet), but should move it from "occurs rarely" to "occurs almost never" territory. A complete fix is in the works and will hopefully follow soon.

Note that you need to stop your affected containers after updating before the new lxcfs binary is used, and should probably try this on non-production systems first (I did not encounter issues so far, but better be safe than sorry!)

I have tried your new lxcfs-package at all 3 of our Proxmox VE servers. I followed your advise and tested if everything worked at one of the servers. And when I didn't see any problems, I installed the package at the other two Proxmox VE servers.

Now the servers have been running with the new lxcfs-package in a few days. And I have again activated scheduled vzdumps for all of the LXC-based VM. And up until now I haven't seen any lockups… Before the lxcfs-package update, the VM's locked up about one third of the times.

So I would like to say thanks… And from my experiences it looks like your changes are in the correct area.
 
I have tried your new lxcfs-package at all 3 of our Proxmox VE servers. I followed your advise and tested if everything worked at one of the servers. And when I didn't see any problems, I installed the package at the other two Proxmox VE servers.

Now the servers have been running with the new lxcfs-package in a few days. And I have again activated scheduled vzdumps for all of the LXC-based VM. And up until now I haven't seen any lockups… Before the lxcfs-package update, the VM's locked up about one third of the times.

So I would like to say thanks… And from my experiences it looks like your changes are in the correct area.

Sounds good! Please keep us posted if you experience any issues with the updated packages.
 
Hey, just wanted to check in and see if there is a timeframe for the lxcfs update/the next stable release - we're running PVE 3.4 in production, and in starting the process of upgrading, I set up a test server on 4.1 and ran in to this dealbreaker issue immediately (within the first 24 hours).

Since the server in question is just for testing the upgrade, I can use pvetest for now, would love to see it in stable before end of support for 3.4 in April though!
 
The more (positive) feedback we get, the sooner it will move to the non-test repositories. Did the package from pve-test fix the issue for you?
 
The more (positive) feedback we get, the sooner it will move to the non-test repositories. Did the package from pve-test fix the issue for you?
Since a few days after the last updates I do backups every night wthout any problem: lxc, on NFS, LZO, snapshot.
Peter
 
So far so good, backups ran last night without a problem on:
Code:
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
Using an NFS target, GZIP, suspend mode. I'm going to migrate a few more containers over today and see what happens.
 
we did 12 LXC backups snapshot mode after upgrade and reboot.

10 worked the 1-st time. [ big improvement ].

2 backups failed on different hosts .

here is error output :
Code:
2228: Feb 25 22:26:30 INFO: Starting Backup of VM 2228 (lxc)
2228: Feb 25 22:26:30 INFO: status = running
2228: Feb 25 22:26:30 INFO: found old vzdump snapshot (force removal)
2228: Feb 25 22:26:30 ERROR: Backup of VM 2228 failed - Can't delete snapshot: 2228 vzdump zfs error: could not find any snapshots to destroy; check snapshot names.
Code:
INFO: Starting Backup of VM 7597 (lxc)
INFO: status = running
INFO: found old vzdump snapshot (force removal)
ERROR: Backup of VM 7597 failed - Can't delete snapshot: 7597 vzdump zfs error: could not find any snapshots to destroy; check snapshot names.

in both cases rerunning the backup worked.


Today there were no backup errors:
Code:
  VMID  STATUS  TIME  SIZE  FILENAME
  2100  ok  00:00:28  298MB  /bkup2/dump/vzdump-lxc-2100-2016_02_27-16_32_02.tar.lzo
  2214  ok  00:00:52  1.03GB  /bkup2/dump/vzdump-lxc-2214-2016_02_27-16_32_30.tar.lzo
  2217  ok  00:04:20  3.17GB  /bkup2/dump/vzdump-lxc-2217-2016_02_27-16_33_22.tar.lzo
  2219  ok  00:01:57  959MB  /bkup2/dump/vzdump-lxc-2219-2016_02_27-16_37_42.tar.lzo
  2227  ok  00:03:34  2.62GB  /bkup2/dump/vzdump-lxc-2227-2016_02_27-16_39_39.tar.lzo
  2228  ok  00:01:41  960MB  /bkup2/dump/vzdump-lxc-2228-2016_02_27-16_43_13.tar.lzo
  2249  ok  00:01:54  985MB  /bkup2/dump/vzdump-lxc-2249-2016_02_27-16_44_54.tar.lzo
 
Those two errors are because the config referenced a snapshot which does not actually exist on the storage level (probably an old failed backup run where you killed lxc-freeze?). This is/was expected behaviour and fixes itself on subsequent backup runs, like you said. We improved the error handling in that area already, so this hopefully does not happen again.

Thanks for the feedback!
 
Well I installed that 2 patches on my 3 hosts, since that day I can run backups on nfs storage like a charm.
BUT already a third time it happens that 1 host cant quit processes and start more and more until it stuck.
I can't stop container or migrate them, just got a timeout. I even can't reboot via shell, I have to make it via IPMI.

As I said it happens since that patch and only on 1 host, to investigate I moved alle container except 2 to another host.

Syslog says just before reboot:
Feb 29 10:32:20 vmbase3 pvedaemon[6326]: Use of uninitialized value in concatenation (.) or string at /usr/share/perl5/PVE/Tools.pm line 827.

Feb 29 10:32:20 vmbase3 pvedaemon[6326]: Use of uninitialized value in concatenation (.) or string at /usr/share/perl5/PVE/Tools.pm line 827.

Feb 29 10:32:20 vmbase3 pvedaemon[6326]: Use of uninitialized value in concatenation (.) or string at /usr/share/perl5/PVE/Tools.pm line 827.

Feb 29 10:32:20 vmbase3 pvedaemon[3226]: Argument "\n" isn't numeric in int at /usr/share/perl5/PVE/Tools.pm line 840, <GEN1142> line 1.

Feb 29 10:32:20 vmbase3 pvedaemon[3226]: Argument "\n" isn't numeric in int at /usr/share/perl5/PVE/Tools.pm line 841, <GEN1142> line 2.

Feb 29 10:32:20 vmbase3 pvedaemon[3226]: Argument "\n" isn't numeric in int at /usr/share/perl5/PVE/Tools.pm line 842, <GEN1142> line 3.
 
The patch did not change anything in our own codebase - only in lxcfs. Are the other packages uptodate? Could you post the output of "pveversion -v"?

What exactly do you mean with "1 host cant quit processes and start more and more until it stuck" ? vzdump processes? lxc processes?
 
Should be fixed with lxcfs-2.0.0-pve1
 
yes, available in both pve-no-subscription and pve-enterprise.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!