[SOLVED] ceph lxc vzdump backups fail

Apr 13, 2018
15
0
1
Hi, I'm trying to run vzdump snapshot backups of my centos 7 lxc containers to a local ZFS raid 1 storage. The LXC containers run on CEPH bluestore.
BTW: I've also tried this to a shared NFS storage and a local LVM storage.

The backup of some containers hangs until I manually stop it. I think only the bigger (200GB+) containers have this problem.

One side effect is that I can't do shutdowns / reboots from within a Centos 7 container that live on a proxmox host that had this problem. The weird thing is, that ubuntu and debian containers don't seem to have these problems.

What can I try next?

The full log:
Code:
  vzdump 137 --mode snapshot --remove 0 --node node003 --compress lzo --storage local --mailto ***@***.nl
 
  137: 2018-10-22 12:07:31 INFO: Starting Backup of VM 137 (lxc)
  137: 2018-10-22 12:07:31 INFO: status = running
  137: 2018-10-22 12:07:31 INFO: CT Name: ***
  137: 2018-10-22 12:07:31 INFO: backup mode: snapshot
  137: 2018-10-22 12:07:31 INFO: ionice priority: 7
  137: 2018-10-22 12:07:31 INFO: create storage snapshot 'vzdump'
  137: 2018-10-22 12:07:34 INFO: creating archive '/var/lib/vz/dump/vzdump-lxc-137-2018_10_22-12_07_31.tar.lzo'
  137: 2018-10-22 15:23:04 INFO: remove vzdump snapshot
  137: 2018-10-22 15:23:05 ERROR: Backup of VM 137 failed - command 'set -o pipefail && tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/var/tmp/vzdumptmp2179734' ./etc/vzdump/pct.conf '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | lzop >/var/lib/vz/dump/vzdump-lxc-137-2018_10_22-12_07_31.tar.dat' failed: interrupted by signal

This backup is stopped after a few hours, but I've also tried to let them runs for several days.
 
Is there something to see in the syslog/journal? Does the backup work in stop/suspend mode?
 
thx :) This is running in production, so can't do stop/suspend backup. I'm trying to clone it right now to try some other things.

syslog:
Code:
Oct 25 09:44:01 node007 pvedaemon[3776888]: command 'umount -l -d /mnt/vzsnap0/' failed: exit code 32
Oct 25 09:44:01 node007 pvedaemon[3776888]: ERROR: Backup of VM 109 failed - command 'mount -o ro,noload /dev/rbd1 /mnt/vzsnap0//' failed: interrupted by signal
Oct 25 09:44:01 node007 pvedaemon[3776888]: INFO: Backup job finished with errors
Oct 25 09:44:01 node007 pvedaemon[3776888]: job errors
Oct 25 09:44:01 node007 postfix/pickup[3831999]: 940E229C5D: uid=0 from=<root>
Oct 25 09:44:01 node007 pvedaemon[2580596]: <root@pam> end task UPID:node007:0039A178:04116EE3:5BD16C2E:vzdump::root@pam: job errors

journal:
Code:
Oct 25 10:02:08 node007 pvedaemon[12704]: INFO: starting new backup job: vzdump 109 --storage local --mailto ***@***.nl --remove 0 --compress 0 --node node007 --mode snapshot
Oct 25 10:02:08 node007 pvedaemon[12704]: INFO: Starting Backup of VM 109 (lxc)
Oct 25 10:02:09 node007 kernel: rbd: rbd1: capacity 806380109824 features 0x1
Oct 25 10:02:09 node007 kernel: EXT4-fs (rbd1): write access unavailable, skipping orphan cleanup
Oct 25 10:02:09 node007 kernel: EXT4-fs (rbd1): mounted filesystem without journal. Opts: noload

and this?:
Code:
Oct 25 10:04:43 node007 kernel: EXT4-fs error (device rbd1): ext4_lookup:1575: inode #1048713: comm tar: deleted inode referenced: 1058321
Oct 25 10:04:43 node007 kernel:
                                Assertion failure in rbd_queue_workfn() at line 4035:

                                        rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);
Oct 25 10:04:43 node007 kernel: ------------[ cut here ]------------
Oct 25 10:04:43 node007 kernel: kernel BUG at drivers/block/rbd.c:4035!
Oct 25 10:04:43 node007 kernel: invalid opcode: 0000 [#1] SMP PTI
Oct 25 10:04:43 node007 kernel: Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 nf_nat veth rbd libceph ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 nf_conntra
Oct 25 10:04:43 node007 kernel:  cryptd soundcore joydev input_leds intel_cstate intel_rapl_perf pcspkr mei_me shpchp mei lpc_ich ioatdma wmi mac_hid ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad vhost
Oct 25 10:04:43 node007 kernel: CPU: 14 PID: 211 Comm: kworker/14:1 Tainted: P           O     4.15.18-5-pve #1
Oct 25 10:04:43 node007 kernel: Hardware name: Supermicro Super Server/X10SRW-F, BIOS 2.0a 08/02/2016
Oct 25 10:04:43 node007 kernel: Workqueue: rbd rbd_queue_workfn [rbd]
Oct 25 10:04:43 node007 kernel: RIP: 0010:rbd_queue_workfn+0x462/0x4f0 [rbd]
Oct 25 10:04:43 node007 kernel: RSP: 0018:ffff9be2871c3e18 EFLAGS: 00010286
Oct 25 10:04:43 node007 kernel: RAX: 0000000000000086 RBX: ffff90e85b8c9800 RCX: 0000000000000006
Oct 25 10:04:43 node007 kernel: RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff90e87f796490
Oct 25 10:04:43 node007 kernel: RBP: ffff9be2871c3e60 R08: 0000000000000000 R09: 0000000000000574
Oct 25 10:04:43 node007 kernel: R10: 000000000000014d R11: 00000000ffffffff R12: ffff90e87826efc0
Oct 25 10:04:43 node007 kernel: R13: ffff90e6ca156c00 R14: 0000000000000000 R15: 0000000000001000
Oct 25 10:04:43 node007 kernel: FS:  0000000000000000(0000) GS:ffff90e87f780000(0000) knlGS:0000000000000000
Oct 25 10:04:43 node007 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 25 10:04:43 node007 kernel: CR2: 000055ccc2e94008 CR3: 00000006e780a005 CR4: 00000000003606e0
Oct 25 10:04:43 node007 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 25 10:04:43 node007 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 25 10:04:43 node007 kernel: Call Trace:
Oct 25 10:04:43 node007 kernel:  ? __schedule+0x3e8/0x870
Oct 25 10:04:43 node007 kernel:  process_one_work+0x1e0/0x400
Oct 25 10:04:43 node007 kernel:  worker_thread+0x4b/0x420
Oct 25 10:04:43 node007 kernel:  kthread+0x105/0x140
Oct 25 10:04:43 node007 kernel:  ? process_one_work+0x400/0x400
Oct 25 10:04:43 node007 kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
Oct 25 10:04:43 node007 kernel:  ret_from_fork+0x35/0x40
Oct 25 10:04:43 node007 kernel: Code: 00 48 83 78 20 fe 0f 84 6a fc ff ff 48 c7 c1 a8 d8 fd c0 ba c3 0f 00 00 48 c7 c6 b0 ec fd c0 48 c7 c7 90 cd fd c0 e8 6e 66 31 d4 <0f> 0b 48 8b 75 d0 4d 89 d0 44 89 f1 4c 89 fa 48 8
Oct 25 10:04:43 node007 kernel: RIP: rbd_queue_workfn+0x462/0x4f0 [rbd] RSP: ffff9be2871c3e18
Oct 25 10:04:43 node007 kernel: ---[ end trace 045d1b7cd55a306d ]---
 
Some new info. I located some extra CT's that have this problem. They are all centos, but installed at different times and running all kinds of software. Some are big, some are small. I tried stop backup on 2 so far and that worked on both CT's.

I ran pct fsck on them, but found no errors. Any more suggestions?
 
and: this dat file stop growing on all of the failed CT's:

Code:
root@node004:~# du -s /var/lib/vz/dump/vzdump-lxc-138-2018_10_25-16_23_13.tar.dat
73201   /var/lib/vz/dump/vzdump-lxc-138-2018_10_25-16_23_13.tar.dat

It's always around (maybe exactly) the same size when it stops.