LXC-Backups Hang (via NFS and CIFS)

fstrankowski · Aug 31, 2018

Hello everyone,

i'd like to ask for help regarding a problem i recently got my hands on. We're running 3 Proxmox clusters over 3 datacenters. Backup routines run by night for all 3 clusters. Backups are done via CIFS and also NFS. From time to time i'm running into a problem, where the backup-job for a LXC-container just hangs and renders the whole host useless (Host-IO-Delay > 5). I have to move remaining VMs from the Node, restart the whole hypervisor and manually unlock the LXC which hang (pct unlock <num>).

The problem is not related to missing rights for NFS or CIFS because backups usually run just fine. From time to time it looks like the drive just got disconnected or w/e. Anyone encountered a similiar problem?

Thanks in advance.

kroki0815 · Sep 11, 2018

Hi,
i captured this from kernel log for a backup of an lxc-vm:

Code:

Sep 11 12:51:40 PX10-BW-N03 pvedaemon[10867]: <xxx@pve> starting task UPID:PX10-BW-N03:000041E1:000075F3:5B979E3C:vzdump::xxx@pve:
Sep 11 12:51:41 PX10-BW-N03 kernel: [  303.122196] rbd: rbd3: capacity 8589934592 features 0x1
Sep 11 12:51:41 PX10-BW-N03 kernel: [  303.160660] EXT4-fs (rbd3): mounted filesystem without journal. Opts: noload
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862007] EXT4-fs error (device rbd3): ext4_find_extent:915: inode #163: comm tar: pblk 38627 bad header/extent: invalid magic - magic 7466, entries 30836, max 769(0), depth 65281(0)
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862086]
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862086] Assertion failure in rbd_queue_workfn() at line 4035:
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862086]
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862086]      rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862086]
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862164] ------------[ cut here ]------------
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862165] kernel BUG at drivers/block/rbd.c:4035!
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862187] invalid opcode: 0000 [#1] SMP PTI
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862202] Modules linked in: nfsv3 nfs_acl nfs lockd grace cmac arc4 md4 nls_utf8 cifs ccm fscache ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack veth rbd libceph ip_set ip6table_filter ip6_tables xfs iptable_filter 8021q garp mrp bonding softdog nfnetlink_log nfnetlink nls_iso8859_1 dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel zfs(PO) kvm irqbypass crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel pcbc ttm aesni_intel aes_x86_64 crypto_simd zunicode(PO) drm_kms_helper ipmi_ssif glue_helper zavl(PO) cryptd icp(PO) drm intel_cstate i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt cdc_ether
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862430]  usbnet snd_pcm mii snd_timer joydev input_leds snd soundcore intel_rapl_perf mei_me mxm_wmi pcspkr mei lpc_ich shpchp ipmi_si ipmi_devintf wmi mac_hid ipmi_msghandler acpi_power_meter acpi_pad zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser sunrpc rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq hid_generic usbkbd usbmouse usbhid hid uas usb_storage i2c_i801 megaraid_sas bnx2x ptp pps_core mdio libcrc32c
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862578] CPU: 0 PID: 6182 Comm: kworker/0:4 Tainted: P           O     4.15.18-2-pve #1
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862602] Hardware name: LENOVO Lenovo Flex System x240 M5 Compute Node -[9532B2G]-/-[9532AC1]-, BIOS -[C4E134K-2.61]- 03/05/2018
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862641] Workqueue: rbd rbd_queue_workfn [rbd]
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862659] RIP: 0010:rbd_queue_workfn+0x462/0x4f0 [rbd]
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862675] RSP: 0018:ffffa3c30c7f7e18 EFLAGS: 00010286
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862692] RAX: 0000000000000086 RBX: ffff89e1fc056800 RCX: 0000000000000006
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862712] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff89e240016490
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862733] RBP: ffffa3c30c7f7e60 R08: 0000000000000000 R09: 0000000000000712
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862753] R10: 0000000000000073 R11: 00000000ffffffff R12: ffff89e21b4df5c0
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862774] R13: ffff89e1d6580480 R14: 0000000000000000 R15: 0000000000001000
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862795] FS:  0000000000000000(0000) GS:ffff89e240000000(0000) knlGS:0000000000000000
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862835] CR2: 000056282ba0c000 CR3: 0000001be080a002 CR4: 00000000001606f0
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862855] Call Trace:
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.862873]  ? __schedule+0x3e8/0x870
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.863663]  process_one_work+0x1e0/0x400
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.864473]  worker_thread+0x4b/0x420
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.865381]  kthread+0x105/0x140
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.866193]  ? process_one_work+0x400/0x400
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.866980]  ? kthread_create_worker_on_cpu+0x70/0x70
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.867764]  ? do_syscall_64+0x73/0x130
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.868583]  ? SyS_exit_group+0x14/0x20
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.869366]  ret_from_fork+0x35/0x40
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.870121] Code: 00 48 83 78 20 fe 0f 84 6a fc ff ff 48 c7 c1 a8 58 c3 c0 ba c3 0f 00 00 48 c7 c6 b0 6c c3 c0 48 c7 c7 90 4d c3 c0 e8 3e e5 6b fc <0f> 0b 48 8b 75 d0 4d 89 d0 44 89 f1 4c 89 fa 48 89 df 4c 89 55
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.871699] RIP: rbd_queue_workfn+0x462/0x4f0 [rbd] RSP: ffffa3c30c7f7e18
Sep 11 12:52:22 PX10-BW-N03 kernel: [  344.872505] ---[ end trace 7d0aec6097202426 ]---
Sep 11 12:58:23 PX10-BW-N03 pvedaemon[10867]: <xxx@pve> end task UPID:PX10-BW-N03:000041E1:000075F3:5B979E3C:vzdump::xxx@pve: job errors

kroki0815 · Sep 11, 2018

Another one:

Code:

Sep  8 00:00:01 PX10-HBS-N04 vzdump[3606131]: <root@pam> starting task UPID:PX10-HBS-N04:00370679:05595DA3:5B92F4E1:vzdump::root@pam:
Sep  8 00:00:02 PX10-HBS-N04 kernel: [897424.397945] rbd: rbd3: capacity 8589934592 features 0x1
Sep  8 00:00:02 PX10-HBS-N04 kernel: [897424.446142] EXT4-fs (rbd3): write access unavailable, skipping orphan cleanup
Sep  8 00:00:02 PX10-HBS-N04 kernel: [897424.446177] EXT4-fs (rbd3): mounted filesystem without journal. Opts: noload
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554098] EXT4-fs error (device rbd3): ext4_lookup:1575: inode #17913: comm tar: deleted inode referenced: 18577
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554158]
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554158] Assertion failure in rbd_queue_workfn() at line 4035:
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554158]
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554158]    rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554158]
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554231] ------------[ cut here ]------------
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554232] kernel BUG at drivers/block/rbd.c:4035!
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554254] invalid opcode: 0000 [#1] SMP PTI
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554268] Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_tcpudp ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack veth rbd libceph nfsv3 nfs_acl nfs lockd grace cmac arc4 md4 nls_utf8 cifs ccm fscache ip_set ip6table_filter ip6_tables xfs iptable_filter 8021q garp mrp bonding softdog nfnetlink_log nfnetlink nls_iso8859_1 dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel zfs(PO) kvm irqbypass crct10dif_pclmul mgag200 crc32_pclmul ghash_clmulni_intel ttm pcbc zunicode(PO) drm_kms_helper aesni_intel aes_x86_64 crypto_simd zavl(PO) glue_helper cryptd icp(PO) drm intel_cstate ipmi_ssif i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554497]  snd_pcm snd_timer cdc_ether snd usbnet soundcore mii pcspkr mei_me intel_rapl_perf mxm_wmi mei lpc_ich shpchp ipmi_si ipmi_devintf wmi ipmi_msghandler acpi_power_meter mac_hid acpi_pad zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq uas usb_storage i2c_i801 megaraid_sas bnx2x ptp pps_core mdio libcrc32c
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554636] CPU: 4 PID: 3280683 Comm: kworker/4:2 Tainted: P           O     4.15.18-2-pve #1
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554660] Hardware name: LENOVO Lenovo Flex System x240 M5 Compute Node -[9532B2G]-/-[9532AC1]-, BIOS -[C4E134K-2.61]- 03/05/2018
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554699] Workqueue: rbd rbd_queue_workfn [rbd]
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554717] RIP: 0010:rbd_queue_workfn+0x462/0x4f0 [rbd]
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554733] RSP: 0018:ffffaf6588c63e18 EFLAGS: 00010286
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554750] RAX: 0000000000000086 RBX: ffff89b985eaa000 RCX: 0000000000000006
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554770] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff89bbe0116490
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554790] RBP: ffffaf6588c63e60 R08: 0000000000000000 R09: 0000000000000776
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554811] R10: 0000000000000221 R11: 00000000ffffffff R12: ffff89bb1f7d8240
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554832] R13: ffff89b8fc8d1b00 R14: 0000000000000000 R15: 0000000000001000
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554853] FS:  0000000000000000(0000) GS:ffff89bbe0100000(0000) knlGS:0000000000000000
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.554876] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.555709] CR2: 000055abd1461568 CR3: 0000000c3d00a004 CR4: 00000000001626e0
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.556548] Call Trace:
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.557413]  ? __schedule+0x3e8/0x870
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.558196]  process_one_work+0x1e0/0x400
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.559024]  worker_thread+0x4b/0x420
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.559805]  kthread+0x105/0x140
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.560613]  ? process_one_work+0x400/0x400
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.561396]  ? kthread_create_worker_on_cpu+0x70/0x70
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.562177]  ret_from_fork+0x35/0x40
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.562937] Code: 00 48 83 78 20 fe 0f 84 6a fc ff ff 48 c7 c1 a8 98 e4 c0 ba c3 0f 00 00 48 c7 c6 b0 ac e4 c0 48 c7 c7 90 8d e4 c0 e8 3e a5 0a d1 <0f> 0b 48 8b 75 d0 4d 89 d0 44 89 f1 4c 89 fa 48 89 df 4c 89 55
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.564628] RIP: rbd_queue_workfn+0x462/0x4f0 [rbd] RSP: ffffaf6588c63e18
Sep  8 00:00:45 PX10-HBS-N04 kernel: [897466.565518] ---[ end trace c72d9181b05c9860 ]---

kroki0815 · Sep 14, 2018

Hi,
today we found out the real Problem.
We testet out on one proxmos-node with one lxc-container.

Sometimes the backup finished without erros and somtimes the backup-process hangs with the kernel error.

The error always points to a ext4 fs-error in the vzdump-snapshot. So wy we got an error in EXT4-FS, it is a read-only snapshot. The vm-image is always clean on fsck. So the Problem must be on snapshot-create ... we thought.

After reading the cpeh documentsation we found this note in big letters on snapshots:

>Note STOP I/O BEFORE snapshotting an image. If the image contains a filesystem, the filesystem must be in a consistent state BEFORE snapshotting.

So i looked in the vzdump perl module for the lxc-container and i figured out that only on containers with more than one volume the vm are freezed before the snapshot is taken.

So i changed it to always freeze the vm before take the snapshot and the problem didn't happens any more ....

https://bugzilla.proxmox.com/show_bug.cgi?id=1911

Kind regards
Thorsten

James Crook · Oct 3, 2018

Thank you !
Found this and it's what is happening with us.

Did you find a quicker way of removing the @vzdump and mounted snapshot images, that this hangs and leaves ?
I just remove them with
rbd snap rm Pool/vm-1223-disk-0@vzdump
rbd unmap /dev/rbd20
some fail, so I then reboot the host and try again.

Many thanks

karnz · Oct 10, 2018

Also here! but strangely happened to only one node on 4 nodes cluster.
I applied patch from your link https://bugzilla.proxmox.com/show_bug.cgi?id=1911 is fixed on problem node. Thanks so much!

Search

Search

LXC-Backups Hang (via NFS and CIFS)

fstrankowski

Well-Known Member

kroki0815

New Member

kroki0815

New Member

kroki0815

New Member

James Crook

Well-Known Member

karnz

Renowned Member