[SOLVED] ceph lxc vzdump backups fail

Apr 13, 2018
15
0
1
Hi, I'm trying to run vzdump snapshot backups of my centos 7 lxc containers to a local ZFS raid 1 storage. The LXC containers run on CEPH bluestore.
BTW: I've also tried this to a shared NFS storage and a local LVM storage.

The backup of some containers hangs until I manually stop it. I think only the bigger (200GB+) containers have this problem.

One side effect is that I can't do shutdowns / reboots from within a Centos 7 container that live on a proxmox host that had this problem. The weird thing is, that ubuntu and debian containers don't seem to have these problems.

What can I try next?

The full log:
Code:
  vzdump 137 --mode snapshot --remove 0 --node node003 --compress lzo --storage local --mailto ***@***.nl
 
  137: 2018-10-22 12:07:31 INFO: Starting Backup of VM 137 (lxc)
  137: 2018-10-22 12:07:31 INFO: status = running
  137: 2018-10-22 12:07:31 INFO: CT Name: ***
  137: 2018-10-22 12:07:31 INFO: backup mode: snapshot
  137: 2018-10-22 12:07:31 INFO: ionice priority: 7
  137: 2018-10-22 12:07:31 INFO: create storage snapshot 'vzdump'
  137: 2018-10-22 12:07:34 INFO: creating archive '/var/lib/vz/dump/vzdump-lxc-137-2018_10_22-12_07_31.tar.lzo'
  137: 2018-10-22 15:23:04 INFO: remove vzdump snapshot
  137: 2018-10-22 15:23:05 ERROR: Backup of VM 137 failed - command 'set -o pipefail && tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/var/tmp/vzdumptmp2179734' ./etc/vzdump/pct.conf '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | lzop >/var/lib/vz/dump/vzdump-lxc-137-2018_10_22-12_07_31.tar.dat' failed: interrupted by signal

This backup is stopped after a few hours, but I've also tried to let them runs for several days.
 
Is there something to see in the syslog/journal? Does the backup work in stop/suspend mode?
 
thx :) This is running in production, so can't do stop/suspend backup. I'm trying to clone it right now to try some other things.

syslog:
Code:
Oct 25 09:44:01 node007 pvedaemon[3776888]: command 'umount -l -d /mnt/vzsnap0/' failed: exit code 32
Oct 25 09:44:01 node007 pvedaemon[3776888]: ERROR: Backup of VM 109 failed - command 'mount -o ro,noload /dev/rbd1 /mnt/vzsnap0//' failed: interrupted by signal
Oct 25 09:44:01 node007 pvedaemon[3776888]: INFO: Backup job finished with errors
Oct 25 09:44:01 node007 pvedaemon[3776888]: job errors
Oct 25 09:44:01 node007 postfix/pickup[3831999]: 940E229C5D: uid=0 from=<root>
Oct 25 09:44:01 node007 pvedaemon[2580596]: <root@pam> end task UPID:node007:0039A178:04116EE3:5BD16C2E:vzdump::root@pam: job errors

journal:
Code:
Oct 25 10:02:08 node007 pvedaemon[12704]: INFO: starting new backup job: vzdump 109 --storage local --mailto ***@***.nl --remove 0 --compress 0 --node node007 --mode snapshot
Oct 25 10:02:08 node007 pvedaemon[12704]: INFO: Starting Backup of VM 109 (lxc)
Oct 25 10:02:09 node007 kernel: rbd: rbd1: capacity 806380109824 features 0x1
Oct 25 10:02:09 node007 kernel: EXT4-fs (rbd1): write access unavailable, skipping orphan cleanup
Oct 25 10:02:09 node007 kernel: EXT4-fs (rbd1): mounted filesystem without journal. Opts: noload

and this?:
Code:
Oct 25 10:04:43 node007 kernel: EXT4-fs error (device rbd1): ext4_lookup:1575: inode #1048713: comm tar: deleted inode referenced: 1058321
Oct 25 10:04:43 node007 kernel:
                                Assertion failure in rbd_queue_workfn() at line 4035:

                                        rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);
Oct 25 10:04:43 node007 kernel: ------------[ cut here ]------------
Oct 25 10:04:43 node007 kernel: kernel BUG at drivers/block/rbd.c:4035!
Oct 25 10:04:43 node007 kernel: invalid opcode: 0000 [#1] SMP PTI
Oct 25 10:04:43 node007 kernel: Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 nf_nat veth rbd libceph ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 nf_conntra
Oct 25 10:04:43 node007 kernel:  cryptd soundcore joydev input_leds intel_cstate intel_rapl_perf pcspkr mei_me shpchp mei lpc_ich ioatdma wmi mac_hid ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad vhost
Oct 25 10:04:43 node007 kernel: CPU: 14 PID: 211 Comm: kworker/14:1 Tainted: P           O     4.15.18-5-pve #1
Oct 25 10:04:43 node007 kernel: Hardware name: Supermicro Super Server/X10SRW-F, BIOS 2.0a 08/02/2016
Oct 25 10:04:43 node007 kernel: Workqueue: rbd rbd_queue_workfn [rbd]
Oct 25 10:04:43 node007 kernel: RIP: 0010:rbd_queue_workfn+0x462/0x4f0 [rbd]
Oct 25 10:04:43 node007 kernel: RSP: 0018:ffff9be2871c3e18 EFLAGS: 00010286
Oct 25 10:04:43 node007 kernel: RAX: 0000000000000086 RBX: ffff90e85b8c9800 RCX: 0000000000000006
Oct 25 10:04:43 node007 kernel: RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff90e87f796490
Oct 25 10:04:43 node007 kernel: RBP: ffff9be2871c3e60 R08: 0000000000000000 R09: 0000000000000574
Oct 25 10:04:43 node007 kernel: R10: 000000000000014d R11: 00000000ffffffff R12: ffff90e87826efc0
Oct 25 10:04:43 node007 kernel: R13: ffff90e6ca156c00 R14: 0000000000000000 R15: 0000000000001000
Oct 25 10:04:43 node007 kernel: FS:  0000000000000000(0000) GS:ffff90e87f780000(0000) knlGS:0000000000000000
Oct 25 10:04:43 node007 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 25 10:04:43 node007 kernel: CR2: 000055ccc2e94008 CR3: 00000006e780a005 CR4: 00000000003606e0
Oct 25 10:04:43 node007 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 25 10:04:43 node007 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 25 10:04:43 node007 kernel: Call Trace:
Oct 25 10:04:43 node007 kernel:  ? __schedule+0x3e8/0x870
Oct 25 10:04:43 node007 kernel:  process_one_work+0x1e0/0x400
Oct 25 10:04:43 node007 kernel:  worker_thread+0x4b/0x420
Oct 25 10:04:43 node007 kernel:  kthread+0x105/0x140
Oct 25 10:04:43 node007 kernel:  ? process_one_work+0x400/0x400
Oct 25 10:04:43 node007 kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
Oct 25 10:04:43 node007 kernel:  ret_from_fork+0x35/0x40
Oct 25 10:04:43 node007 kernel: Code: 00 48 83 78 20 fe 0f 84 6a fc ff ff 48 c7 c1 a8 d8 fd c0 ba c3 0f 00 00 48 c7 c6 b0 ec fd c0 48 c7 c7 90 cd fd c0 e8 6e 66 31 d4 <0f> 0b 48 8b 75 d0 4d 89 d0 44 89 f1 4c 89 fa 48 8
Oct 25 10:04:43 node007 kernel: RIP: rbd_queue_workfn+0x462/0x4f0 [rbd] RSP: ffff9be2871c3e18
Oct 25 10:04:43 node007 kernel: ---[ end trace 045d1b7cd55a306d ]---
 
Some new info. I located some extra CT's that have this problem. They are all centos, but installed at different times and running all kinds of software. Some are big, some are small. I tried stop backup on 2 so far and that worked on both CT's.

I ran pct fsck on them, but found no errors. Any more suggestions?
 
and: this dat file stop growing on all of the failed CT's:

Code:
root@node004:~# du -s /var/lib/vz/dump/vzdump-lxc-138-2018_10_25-16_23_13.tar.dat
73201   /var/lib/vz/dump/vzdump-lxc-138-2018_10_25-16_23_13.tar.dat

It's always around (maybe exactly) the same size when it stops.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!