Backup hangup with Ceph/rbd

kawataso · Jul 30, 2018

I use Ceph / rbd as storage and operate the container environment, but backup occasionally fails. Is there someone in a similar situation?

Code:

[304948.926528] EXT4-fs error (device rbd5): ext4_lookup:1575: inode #2621882: comm tar: deleted inode referenced: 2643543
[304948.927428]
                Assertion failure in rbd_queue_workfn() at line 4035:

                        rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);

[304948.931565] ------------[ cut here ]------------
[304948.931566] kernel BUG at drivers/block/rbd.c:4035!
[304948.932423] invalid opcode: 0000 [#1] SMP PTI
[304948.933255] Modules linked in: xt_set ip_set_hash_ip xt_multiport xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw veth 8021q garp mrp rbd libceph nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables xfs libcrc32c mptctl mptbase iptable_filter dell_rbu softdog nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper dcdbas cryptd snd_pcm intel_cstate intel_rapl_perf
[304948.940099]  snd_timer snd mgag200 ttm soundcore drm_kms_helper pcspkr joydev input_leds ipmi_ssif drm cdc_ether usbnet mii i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt lpc_ich mei_me mei ipmi_si shpchp ipmi_devintf ipmi_msghandler wmi mac_hid acpi_power_meter vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp sunrpc libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq hid_generic usbmouse usbkbd usbhid hid tg3 ahci ptp libahci pps_core mpt3sas raid_class scsi_transport_sas megaraid_sas
[304948.947276] CPU: 3 PID: 3819431 Comm: kworker/3:1 Tainted: P           O     4.15.18-1-pve #1
[304948.948549] Hardware name: Dell Inc. PowerEdge R420/XXXXXX, BIOS 2.4.2 01/29/2015
[304948.949836] Workqueue: rbd rbd_queue_workfn [rbd]
[304948.951124] RIP: 0010:rbd_queue_workfn+0x462/0x4f0 [rbd]
[304948.952414] RSP: 0018:ffffb6e0cf33be18 EFLAGS: 00010286
[304948.953699] RAX: 0000000000000086 RBX: ffff8ca8c1260800 RCX: 0000000000000000
[304948.954998] RDX: 0000000000000000 RSI: ffff8cabdead6498 RDI: ffff8cabdead6498
[304948.956297] RBP: ffffb6e0cf33be60 R08: 0000000000000101 R09: 0000000000000498
[304948.957598] R10: 000000000000039a R11: 00000000ffffffff R12: ffff8ca8dc0f6480
[304948.958907] R13: ffff8ca92b8a3780 R14: 0000000000000000 R15: 0000000000001000
[304948.960219] FS:  0000000000000000(0000) GS:ffff8cabdeac0000(0000) knlGS:0000000000000000
[304948.961548] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[304948.962875] CR2: 00007f5d79bb2000 CR3: 00000002ba40a001 CR4: 00000000000626e0
[304948.964222] Call Trace:
[304948.965573]  ? __schedule+0x3e8/0x870
[304948.966907]  process_one_work+0x1e0/0x400
[304948.968236]  worker_thread+0x4b/0x420
[304948.969546]  kthread+0x105/0x140
[304948.970861]  ? process_one_work+0x400/0x400
[304948.972187]  ? kthread_create_worker_on_cpu+0x70/0x70
[304948.973503]  ? do_syscall_64+0x73/0x130
[304948.974824]  ? SyS_exit_group+0x14/0x20
[304948.976155]  ret_from_fork+0x35/0x40
[304948.977488] Code: 00 48 83 78 20 fe 0f 84 6a fc ff ff 48 c7 c1 a8 28 0d c1 ba c3 0f 00 00 48 c7 c6 b0 3c 0d c1 48 c7 c7 90 1d 0d c1 e8 ae 0c 82 d8 <0f> 0b 48 8b 75 d0 4d 89 d0 44 89 f1 4c 89 fa 48 89 df 4c 89 55
[304948.980361] RIP: rbd_queue_workfn+0x462/0x4f0 [rbd] RSP: ffffb6e0cf33be18
[304948.981831] ---[ end trace aba98a911c548647 ]---


# ps
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     3646254  0.3  0.4  86352 68836 ?        D    Jul29   2:09 tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs --xattrs-include=user.* --xattrs-include=security.capability --warning=no-file-ignored --warning=no-xattr-write --one-file-system --warning=no-file-ignored --directory=/mnt/pve/%NFSHOST%/dump/vzdump-lxc-215-2018_07_29-22_13_33.tmp ./etc/vzdump/pct.conf --directory=/mnt/vzsnap0 --no-anchored --exclude=lost+found --anchored --exclude=./tmp/?* --exclude=./var/tmp/?* --exclude=./var/run/?*.pid ./

Alwin · Jul 30, 2018

Can you please give us some more information on your setup?

kawataso · Jul 31, 2018

I made a block diagram of the cluster.
Clusters are made up of 6 units, and 2 models are mixed. OSD is running only on R420, but all hosts are accessing Ceph. LXC container and KVM VM are running on all hosts.
The stack trace when vzdump hangs occasionally occurred on any node irregularly, and occurrence frequency is about once every two weeks. It seems to be occurring during dump processing of LXC container.
Is there a command you want to know about the execution result?

Alwin · Jul 31, 2018

What 'pveversion -v' are you running? Are the same containers crashing? Do the containers have external mounts?

Side note: You only need three MONs, more then that comes into play, when you have 1000s of nodes/clients. It reduces the latency and saves resources on the OSD nodes. Also your corosync traffic needs to be separated to have a stable working cluster.

kawataso · Jul 31, 2018

- What 'pveversion -v' are you running?

Code:

# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-1-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)
pve-kernel-4.15: 5.2-4
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-1-pve: 4.15.18-16
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-35
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-24
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-28
pve-container: 2.0-24
pve-docs: 5.2-4
pve-firewall: 3.0-13
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

# uname -a
Linux HOST4 4.15.18-1-pve #1 SMP PVE 4.15.18-16 (Mon, 23 Jul 2018 15:59:19 +0200) x86_64 GNU/Linux

NOTE: Each host is rolling update at any time and keeps up to date as much as possible. Although I have renewed kernel several times over the last six months, the frequency of problems does not change.

- Are the same containers crashing?
No.
After detecting the vzdump hang up, we move (migrate with restart mode) to deal with the problem.
Although pct unlock is required, container migration always succeeds.

- Do the containers have external mounts?
No.

Alwin · Jul 31, 2018

And what do you see in the logs (especially around vzdump hangs)?

kawataso · Jul 31, 2018

When vzdump hangs up, the output stops at line 8 of the task log. When hang up is detected and the task is stopped, the process is canceled, processing of the next container starts after the error is recorded, but the entire task is canceled immediately after that and the log output also ends there.

Code:

INFO: Starting Backup of VM 215 (lxc)
INFO: status = running
INFO: CT Name: XXXCT1
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd5
INFO: creating archive '/mnt/pve/ckrs03-nfs/dump/vzdump-lxc-215-2018_07_29-22_13_33.tar.gz'
INFO: remove vzdump snapshot
rbd: sysfs write failed
can't unmap rbd volume vm-215-disk-1: rbd: sysfs write failed
ERROR: Backup of VM 215 failed - command 'set -o pipefail && tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/mnt/pve/NFSHOST/dump/vzdump-lxc-215-2018_07_29-22_13_33.tmp' ./etc/vzdump/pct.conf '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | gzip >/mnt/pve/NFSHOST/dump/vzdump-lxc-215-2018_07_29-22_13_33.tar.dat' failed: interrupted by signal
INFO: Starting Backup of VM 2238 (lxc)
INFO: status = running
INFO: CT Name: XXXCT2
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd6
INFO: creating archive '/mnt/pve/ckrs03-nfs/dump/vzdump-lxc-2238-2018_07_30-09_25_25.tar.gz'

Alwin · Jul 31, 2018

And what's in the syslog/journal? Please try the backup with lzo compression, just to see if this makes a difference. It would give us a point to start.

kawataso · Aug 1, 2018

I can not disclose everything about syslog, but I extracted excerpts that seems necessary. An excerpt of the part where the backup of the container with ID 215 is started, then the kernel BUG is recorded and the fact that I canceled the task the next morning is recorded. Since it was not possible to judge whether there is a relation, I extracted a log on ceph - mgr. I intend to extract carefully, but please let me know if there are any other logs that I think are missing. Most of the saved logs are logs of processes that are regularly executed, such as CRON jobs.
The journal log has already disappeared over time.

I changed the backup compression method from GZIP to LZO, so I will monitor if there is any change.

Code:

Jul 29 22:13:33 HOST4 vzdump[3477076]: INFO: Starting Backup of VM 215 (lxc)
Jul 29 22:13:35 HOST4 kernel: [300558.445168] rbd: rbd5: capacity 53687091200 features 0x1
Jul 29 22:13:35 HOST4 kernel: [300558.917672] EXT4-fs (rbd5): write access unavailable, skipping orphan cleanup
Jul 29 22:13:35 HOST4 kernel: [300558.918917] EXT4-fs (rbd5): mounted filesystem without journal. Opts: noload
(snip)
Jul 29 23:26:45 HOST4 kernel: [304948.926528] EXT4-fs error (device rbd5): ext4_lookup:1575: inode #2621882: comm tar: deleted inode referenced: 2643543
Jul 29 23:26:45 HOST4 kernel: [304948.927428]
Jul 29 23:26:45 HOST4 kernel: [304948.927428] Assertion failure in rbd_queue_workfn() at line 4035:
Jul 29 23:26:45 HOST4 kernel: [304948.927428]
Jul 29 23:26:45 HOST4 kernel: [304948.927428]  rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);
Jul 29 23:26:45 HOST4 kernel: [304948.927428]
Jul 29 23:26:45 HOST4 kernel: [304948.931565] ------------[ cut here ]------------
Jul 29 23:26:45 HOST4 kernel: [304948.931566] kernel BUG at drivers/block/rbd.c:4035!
Jul 29 23:26:45 HOST4 kernel: [304948.932423] invalid opcode: 0000 [#1] SMP PTI
Jul 29 23:26:45 HOST4 kernel: [304948.933255] Modules linked in: xt_set ip_set_hash_ip xt_multiport xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw veth 8021q garp mrp rbd libceph nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables xfs libcrc32c mptctl mptbase iptable_filter dell_rbu softdog nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper dcdbas cryptd snd_pcm intel_cstate intel_rapl_perf
Jul 29 23:26:45 HOST4 kernel: [304948.940099]  snd_timer snd mgag200 ttm soundcore drm_kms_helper pcspkr joydev input_leds ipmi_ssif drm cdc_ether usbnet mii i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt lpc_ich mei_me mei ipmi_si shpchp ipmi_devintf ipmi_msghandler wmi mac_hid acpi_power_meter vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp sunrpc libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq hid_generic usbmouse usbkbd usbhid hid tg3 ahci ptp libahci pps_core mpt3sas raid_class scsi_transport_sas megaraid_sas
Jul 29 23:26:45 HOST4 kernel: [304948.947276] CPU: 3 PID: 3819431 Comm: kworker/3:1 Tainted: P           O     4.15.18-1-pve #1
Jul 29 23:26:45 HOST4 kernel: [304948.948549] Hardware name: Dell Inc. PowerEdge R420/072XWF, BIOS 2.4.2 01/29/2015
Jul 29 23:26:45 HOST4 kernel: [304948.949836] Workqueue: rbd rbd_queue_workfn [rbd]
Jul 29 23:26:45 HOST4 kernel: [304948.951124] RIP: 0010:rbd_queue_workfn+0x462/0x4f0 [rbd]
Jul 29 23:26:45 HOST4 kernel: [304948.952414] RSP: 0018:ffffb6e0cf33be18 EFLAGS: 00010286
Jul 29 23:26:45 HOST4 kernel: [304948.953699] RAX: 0000000000000086 RBX: ffff8ca8c1260800 RCX: 0000000000000000
Jul 29 23:26:45 HOST4 kernel: [304948.954998] RDX: 0000000000000000 RSI: ffff8cabdead6498 RDI: ffff8cabdead6498
Jul 29 23:26:45 HOST4 kernel: [304948.956297] RBP: ffffb6e0cf33be60 R08: 0000000000000101 R09: 0000000000000498
Jul 29 23:26:45 HOST4 kernel: [304948.957598] R10: 000000000000039a R11: 00000000ffffffff R12: ffff8ca8dc0f6480
Jul 29 23:26:45 HOST4 kernel: [304948.958907] R13: ffff8ca92b8a3780 R14: 0000000000000000 R15: 0000000000001000
Jul 29 23:26:45 HOST4 kernel: [304948.960219] FS:  0000000000000000(0000) GS:ffff8cabdeac0000(0000) knlGS:0000000000000000
Jul 29 23:26:45 HOST4 kernel: [304948.961548] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 29 23:26:45 HOST4 kernel: [304948.962875] CR2: 00007f5d79bb2000 CR3: 00000002ba40a001 CR4: 00000000000626e0
Jul 29 23:26:45 HOST4 kernel: [304948.964222] Call Trace:
Jul 29 23:26:45 HOST4 kernel: [304948.965573]  ? __schedule+0x3e8/0x870
Jul 29 23:26:45 HOST4 kernel: [304948.966907]  process_one_work+0x1e0/0x400
Jul 29 23:26:45 HOST4 kernel: [304948.968236]  worker_thread+0x4b/0x420
Jul 29 23:26:45 HOST4 kernel: [304948.969546]  kthread+0x105/0x140
Jul 29 23:26:45 HOST4 kernel: [304948.970861]  ? process_one_work+0x400/0x400
Jul 29 23:26:45 HOST4 kernel: [304948.972187]  ? kthread_create_worker_on_cpu+0x70/0x70
Jul 29 23:26:45 HOST4 kernel: [304948.973503]  ? do_syscall_64+0x73/0x130
Jul 29 23:26:45 HOST4 kernel: [304948.974824]  ? SyS_exit_group+0x14/0x20
Jul 29 23:26:45 HOST4 kernel: [304948.976155]  ret_from_fork+0x35/0x40
Jul 29 23:26:45 HOST4 kernel: [304948.977488] Code: 00 48 83 78 20 fe 0f 84 6a fc ff ff 48 c7 c1 a8 28 0d c1 ba c3 0f 00 00 48 c7 c6 b0 3c 0d c1 48 c7 c7 90 1d 0d c1 e8 ae 0c 82 d8 <0f> 0b 48 8b 75 d0 4d 89 d0 44 89 f1 4c 89 fa 48 89 df 4c 89 55
Jul 29 23:26:45 HOST4 kernel: [304948.980361] RIP: rbd_queue_workfn+0x462/0x4f0 [rbd] RSP: ffffb6e0cf33be18
Jul 29 23:26:45 HOST4 kernel: [304948.981831] ---[ end trace aba98a911c548647 ]---
(snip)
Jul 30 06:25:01 HOST4 ceph-mgr[5613]: 2018-07-30 06:25:01.899334 7fa5f9af7700 -1 received  signal: Hangup from  PID: 834618 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
Jul 30 06:25:01 HOST4 ceph-osd[7183]: 2018-07-30 06:25:01.899418 7f90db802700 -1 received  signal: Hangup from  PID: 834618 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
Jul 30 06:25:01 HOST4 ceph-osd[6917]: 2018-07-30 06:25:01.899418 7fb51e4a5700 -1 received  signal: Hangup from  PID: 834618 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
Jul 30 06:25:01 HOST4 ceph-mon[5595]: 2018-07-30 06:25:01.904786 7fed2ef0e700 -1 Fail to open '/proc/834618/cmdline' error = (2) No such file or directory
Jul 30 06:25:01 HOST4 ceph-mon[5595]: 2018-07-30 06:25:01.904819 7fed2ef0e700 -1 received  signal: Hangup from  PID: 834618 task name: <unknown> UID: 0
(snip)
Jul 30 09:25:25 HOST4 vzdump[3477076]: ERROR: Backup of VM 215 failed - command 'set -o pipefail && tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/mnt/pve/NFSHOST/dump/vzdump-lxc-215-2018_07_29-22_13_33.tmp' ./etc/vzdump/pct.conf '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | gzip >/mnt/pve/NFSHOST/dump/vzdump-lxc-215-2018_07_29-22_13_33.tar.dat' failed: interrupted by signal
Jul 30 09:25:25 HOST4 vzdump[3477076]: INFO: Starting Backup of VM 2238 (lxc)
Jul 30 09:25:27 HOST4 kernel: [340870.123684] rbd: rbd6: capacity 32212254720 features 0x1
Jul 30 09:25:27 HOST4 kernel: [340870.284952] EXT4-fs (rbd6): write access unavailable, skipping orphan cleanup
Jul 30 09:25:27 HOST4 kernel: [340870.287101] EXT4-fs (rbd6): mounted filesystem without journal. Opts: noload

(repeated)
Jul 30 00:40:01 HOST4 CRON[4043654]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif
 [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul 30 00:41:00 HOST4 systemd[1]: Starting Proxmox VE replication runner...
Jul 30 00:41:01 HOST4 systemd[1]: Started Proxmox VE replication runner.
Jul 30 00:42:00 HOST4 systemd[1]: Starting Proxmox VE replication runner...
Jul 30 00:42:01 HOST4 systemd[1]: Started Proxmox VE replication runner.
Jul 30 00:42:05 HOST4 pmxcfs[5260]: [status] notice: received log
Jul 30 00:43:00 HOST4 systemd[1]: Starting Proxmox VE replication runner...
Jul 30 00:43:01 HOST4 systemd[1]: Started Proxmox VE replication runner.
Jul 30 00:44:00 HOST4 systemd[1]: Starting Proxmox VE replication runner...
Jul 30 00:44:01 HOST4 systemd[1]: Started Proxmox VE replication runner.
Jul 30 00:44:40 HOST4 rrdcached[4920]: flushing old values
Jul 30 00:44:40 HOST4 rrdcached[4920]: rotating journals
Jul 30 00:44:40 HOST4 rrdcached[4920]: started new journal /var/lib/rrdcached/journal/rrd.journal.1532879080.242131
Jul 30 00:44:40 HOST4 rrdcached[4920]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1532871880.242013
Jul 30 00:45:00 HOST4 systemd[1]: Starting Proxmox VE replication runner...
Jul 30 00:45:01 HOST4 systemd[1]: Started Proxmox VE replication runner.

Alwin · Aug 1, 2018

Where you snipped the logfile, are there more messages concerning the backup of lxc 215? As for now, it seems that the mapped snapshot disappeared before the backup finished.

kawataso · Aug 2, 2018

The log presented in # 9 is excerpted from /var/log/syslog and the rotated archive, but I do not think there is any other related logs.
I changed the compression method from last night to LZO, but the same problem was reproduced at once. However, it is occurring in HOST5 instead of the same host. I will collect logs and then post them.

kawataso · Aug 2, 2018

First, it is an excerpt from syslog including kernel's Stack trace. I pulled out what I think is related to lxc 224 which failed backup this time. I started canceling the task last night and canceling the task today, so the target syslog file is divided into two. Logs available from journalctl were similar.
Next is the result of HOST5 pveversion and the information of the tar process where IO was stopped.
The next is snapshot information created for lxc 224 backup. I confirmed it just before canceling the task, but the snapshot appeared to be present, but it was not mounted on vzsnap0 for the snapshot.
Next is the backup task log. It includes the logs of the backup targets before and after. Since the entire task has been canceled, only the snapshot and the map to rbd are done, and the container backup is not done.
Since we checked the mapping of rbd and the state of snapshot after canceling the task, we will post it. A log in which the mounted vzsnap0 was unmounted was not recorded (including a normal backup).

Code:

## from syslog.1
Aug  1 22:35:44 HOST5 vzdump[1987963]: INFO: Starting Backup of VM 224 (lxc)
(snip)
Aug  1 22:43:33 HOST5 kernel: [650826.159211] EXT4-fs error (device rbd7): ext4_lookup:1575: inode #2097306: comm tar: deleted inode referenced: 2097331
Aug  1 22:43:33 HOST5 kernel: [650826.159992]
Aug  1 22:43:33 HOST5 kernel: [650826.159992] Assertion failure in rbd_queue_workfn() at line 4035:
Aug  1 22:43:33 HOST5 kernel: [650826.159992]
Aug  1 22:43:33 HOST5 kernel: [650826.159992]  rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);
Aug  1 22:43:33 HOST5 kernel: [650826.159992]
Aug  1 22:43:33 HOST5 kernel: [650826.163339] ------------[ cut here ]------------
Aug  1 22:43:33 HOST5 kernel: [650826.163342] kernel BUG at drivers/block/rbd.c:4035!
Aug  1 22:43:33 HOST5 kernel: [650826.164051] invalid opcode: 0000 [#1] SMP PTI
Aug  1 22:43:33 HOST5 kernel: [650826.164761] Modules linked in: binfmt_misc xt_multiport xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw 8021q garp mrp veth rbd libceph nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables xfs libcrc32c mptctl mptbase iptable_filter dell_rbu softdog nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel dcdbas mgag200 aes_x86_64 crypto_simd ttm glue_helper cryptd snd_pcm drm_kms_helper intel_cstate
Aug  1 22:43:33 HOST5 kernel: [650826.171536]  snd_timer intel_rapl_perf snd soundcore drm pcspkr i2c_algo_bit fb_sys_fops syscopyarea ipmi_ssif joydev input_leds sysfillrect sysimgblt cdc_ether usbnet mii lpc_ich mei_me mei ipmi_si ipmi_devintf shpchp ipmi_msghandler wmi acpi_power_meter mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm sunrpc ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq hid_generic usbmouse usbkbd usbhid hid tg3 ptp pps_core mpt3sas raid_class scsi_transport_sas ahci libahci megaraid_sas
Aug  1 22:43:33 HOST5 kernel: [650826.178348] CPU: 2 PID: 2148816 Comm: kworker/2:1 Tainted: P           O     4.15.18-1-pve #1
Aug  1 22:43:33 HOST5 kernel: [650826.179871] Hardware name: Dell Inc. PowerEdge R420/072XWF, BIOS 2.4.2 01/29/2015
Aug  1 22:43:33 HOST5 kernel: [650826.181436] Workqueue: rbd rbd_queue_workfn [rbd]
Aug  1 22:43:33 HOST5 kernel: [650826.183023] RIP: 0010:rbd_queue_workfn+0x462/0x4f0 [rbd]
Aug  1 22:43:33 HOST5 kernel: [650826.184626] RSP: 0018:ffff9f86cccd7e18 EFLAGS: 00010286
Aug  1 22:43:33 HOST5 kernel: [650826.186247] RAX: 0000000000000086 RBX: ffff8f0a83b52800 RCX: 0000000000000000
Aug  1 22:43:33 HOST5 kernel: [650826.187904] RDX: 0000000000000000 RSI: ffff8f0d9ea96498 RDI: ffff8f0d9ea96498
Aug  1 22:43:33 HOST5 kernel: [650826.189581] RBP: ffff9f86cccd7e60 R08: 0000000000000101 R09: 000000000000054b
Aug  1 22:43:33 HOST5 kernel: [650826.191283] R10: 00000000000000e5 R11: 00000000ffffffff R12: ffff8f0a814a6000
Aug  1 22:43:33 HOST5 kernel: [650826.193007] R13: ffff8f0a897cac80 R14: 0000000000000000 R15: 0000000000001000
Aug  1 22:43:33 HOST5 kernel: [650826.194752] FS:  0000000000000000(0000) GS:ffff8f0d9ea80000(0000) knlGS:0000000000000000
Aug  1 22:43:33 HOST5 kernel: [650826.196541] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  1 22:43:33 HOST5 kernel: [650826.198349] CR2: 00007f47206da000 CR3: 000000004180a002 CR4: 00000000000626e0
Aug  1 22:43:33 HOST5 kernel: [650826.200189] Call Trace:
Aug  1 22:43:33 HOST5 kernel: [650826.202052]  ? __schedule+0x3e8/0x870
Aug  1 22:43:33 HOST5 kernel: [650826.203918]  process_one_work+0x1e0/0x400
Aug  1 22:43:33 HOST5 kernel: [650826.205800]  worker_thread+0x4b/0x420
Aug  1 22:43:33 HOST5 kernel: [650826.207689]  kthread+0x105/0x140
Aug  1 22:43:33 HOST5 kernel: [650826.209583]  ? process_one_work+0x400/0x400
Aug  1 22:43:33 HOST5 kernel: [650826.211499]  ? kthread_create_worker_on_cpu+0x70/0x70
Aug  1 22:43:33 HOST5 kernel: [650826.213442]  ? do_syscall_64+0x73/0x130
Aug  1 22:43:33 HOST5 kernel: [650826.215386]  ? kthread_create_worker_on_cpu+0x70/0x70
Aug  1 22:43:33 HOST5 kernel: [650826.217357]  ret_from_fork+0x35/0x40
Aug  1 22:43:33 HOST5 kernel: [650826.219336] Code: 00 48 83 78 20 fe 0f 84 6a fc ff ff 48 c7 c1 a8 a8 0b c1 ba c3 0f 00 00 48 c7 c6 b0 bc 0b c1 48 c7 c7 90 9d 0b c1 e8 ae 8c 83 ea <0f> 0b 48 8b 75 d0 4d 89 d0 44 89 f1 4c 89 fa 48 89 df 4c 89 55
Aug  1 22:43:33 HOST5 kernel: [650826.223596] RIP: rbd_queue_workfn+0x462/0x4f0 [rbd] RSP: ffff9f86cccd7e18
Aug  1 22:43:33 HOST5 kernel: [650826.225802] ---[ end trace 077dfe1021fa9487 ]---

## from syslog
Aug  2 09:39:47 HOST5 vzdump[1987963]: ERROR: Backup of VM 224 failed - command 'set -o pipefail && tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/mnt/pve/NFSHOST/dump/vzdump-lxc-224-2018_08_01-22_35_44.tmp' ./etc/vzdump/pct.conf '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | lzop >/mnt/pve/NFSHOST/dump/vzdump-lxc-224-2018_08_01-22_35_44.tar.dat' failed: interrupted by signal
Aug  2 09:39:47 HOST5 vzdump[1987963]: INFO: Starting Backup of VM 225 (lxc)
Aug  2 09:39:50 HOST5 kernel: [690202.276308] rbd: rbd8: capacity 53687091200 features 0x1
Aug  2 09:39:50 HOST5 kernel: [690202.504630] EXT4-fs (rbd8): write access unavailable, skipping orphan cleanup
Aug  2 09:39:50 HOST5 kernel: [690202.505113] EXT4-fs (rbd8): mounted filesystem without journal. Opts: noload

# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-1-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)
pve-kernel-4.15: 5.2-4
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-1-pve: 4.15.18-15
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-35
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-24
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-28
pve-container: 2.0-24
pve-docs: 5.2-4
pve-firewall: 3.0-13
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

## D state proccess
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     2255388  0.0  0.0  34072  9920 ?        D    Aug01   0:23 tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs --xattrs-include=user.* --xattrs-include=security.capability --warning=no-file-ignored --warning=no-xattr-write --one-file-system --warning=no-file-ignored --directory=/mnt/pve/NFSHOST/dump/vzdump-lxc-224-2018_08_01-22_35_44.tmp ./etc/vzdump/pct.conf --directory=/mnt/vzsnap0 --no-anchored --exclude=lost+found --anchored --exclude=./tmp/?* --exclude=./var/tmp/?* --exclude=./var/run/?*.pid ./

# rbd snap ls ceph-pool/vm-224-disk-1
SNAPID NAME       SIZE TIMESTAMP
 15633 vzdump 51200 MB Wed Aug  1 22:35:46 2018
 
# mount | grep vzsnap0
(not mounted)

## TASK Log
INFO: Starting Backup of VM 223 (lxc)
INFO: status = running
INFO: CT Name: XXXXCT223
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd7
INFO: creating archive '/mnt/pve/NFSHOST/dump/vzdump-lxc-223-2018_08_01-22_18_50.tar.lzo'
INFO: Total bytes written: 11089131520 (11GiB, 11MiB/s)
INFO: archive file size: 4.01GB
INFO: delete old backup '/mnt/pve/NFSHOST/dump/vzdump-lxc-223-2018_07_30-22_49_34.tar.gz'
INFO: remove vzdump snapshot
Removing snap: 100% complete...done.
INFO: Finished Backup of VM 223 (00:16:54)
INFO: Starting Backup of VM 224 (lxc)
INFO: status = running
INFO: CT Name: XXXXCT224
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd7
INFO: creating archive '/mnt/pve/NFSHOST/dump/vzdump-lxc-224-2018_08_01-22_35_44.tar.lzo'
INFO: remove vzdump snapshot
rbd: sysfs write failed
can't unmap rbd volume vm-224-disk-1: rbd: sysfs write failed
ERROR: Backup of VM 224 failed - command 'set -o pipefail && tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' --one-file-system '--warning=no-file-ignored' '--directory=/mnt/pve/NFSHOST/dump/vzdump-lxc-224-2018_08_01-22_35_44.tmp' ./etc/vzdump/pct.conf '--directory=/mnt/vzsnap0' --no-anchored '--exclude=lost+found' --anchored '--exclude=./tmp/?*' '--exclude=./var/tmp/?*' '--exclude=./var/run/?*.pid' ./ | lzop >/mnt/pve/NFSHOST/dump/vzdump-lxc-224-2018_08_01-22_35_44.tar.dat' failed: interrupted by signal
INFO: Starting Backup of VM 225 (lxc)
INFO: status = running
INFO: CT Name: XXXXCT225
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd8
INFO: creating archive '/mnt/pve/NFSHOST/dump/vzdump-lxc-225-2018_08_02-09_39_47.tar.lzo'
(TASK Canceled at 09:32249 JST)

## after TASK Canceled
# rbd showmapped
id pool      image          snap   device
0  ceph-pool vm-221-disk-1  -      /dev/rbd0
1  ceph-pool vm-226-disk-1  -      /dev/rbd1
2  ceph-pool vm-224-disk-1  -      /dev/rbd2
3  ceph-pool vm-223-disk-1  -      /dev/rbd3
4  ceph-pool vm-225-disk-1  -      /dev/rbd4
5  ceph-pool vm-229-disk-1  -      /dev/rbd5
6  ceph-pool vm-9004-disk-1 -      /dev/rbd6
7  ceph-pool vm-224-disk-1  vzdump /dev/rbd7
8  ceph-pool vm-225-disk-1  vzdump /dev/rbd8
# rbd snap ls ceph-pool/vm-224-disk-1
SNAPID NAME       SIZE TIMESTAMP
 15633 vzdump 51200 MB Wed Aug  1 22:35:46 2018
# rbd snap ls ceph-pool/vm-225-disk-1
SNAPID NAME       SIZE TIMESTAMP
 15646 vzdump 51200 MB Thu Aug  2 09:39:50 2018

Alwin · Aug 2, 2018

My conclusions from the output so far.

kawataso said:
Aug 1 22:43:33 HOST5 kernel: [650826.159211] EXT4-fs error (device rbd7): ext4_lookup:1575: inode #2097306: comm tar: deleted inode referenced: 2097331

This probably can be neglected, as the mapped rbd image is mounted read-only without loading the journal.

kawataso said:
Aug 1 22:43:33 HOST5 kernel: [650826.159992] Assertion failure in rbd_queue_workfn() at line 4035: Aug 1 22:43:33 HOST5 kernel: [650826.159992] Aug 1 22:43:33 HOST5 kernel: [650826.159992] rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP); Aug 1 22:43:33 HOST5 kernel: [650826.159992] Aug 1 22:43:33 HOST5 kernel: [650826.163339] ------------[ cut here ]------------ Aug 1 22:43:33 HOST5 kernel: [650826.163342] kernel BUG at drivers/block/rbd.c:4035!

Here, ceph bails out and the access for tar to the image is gone, at least the mount is gone.

kawataso said:
mpt3sas raid_class scsi_transport_sas ahci libahci megaraid_sas

Are you running ceph OSDs on RAID0/JBOD? This might be also a point where ceph might fail. On the mailling list and on the bug tracker often entries are around where the RAID controller is a culprit.

On pvetest there would be a minor kernel update and the ceph update to 12.2.7. While I didn't find anything directly related to the above, it might be worth a try. The whole cluster would need to be upgraded though.

kawataso said:
7 ceph-pool vm-224-disk-1 vzdump /dev/rbd7
8 ceph-pool vm-225-disk-1 vzdump /dev/rbd8
# rbd snap ls ceph-pool/vm-224-disk-1
SNAPID NAME SIZE TIMESTAMP
15633 vzdump 51200 MB Wed Aug 1 22:35:46 2018

While the snapshot is still there, does the container get migrated? When the container migrates is the same snapshot still mapped? It may well be that the mapping still exists, though the snapshot that originally created the mapping doesn't exist anymore and the backup fails, as it can't get a new map on the snapshot.

kawataso · Aug 6, 2018

Thank you for investigating.

R420 is equipped with a RAID controller, but Ceph is manipulating the disk directly in non-RAID mode. As a precaution, I investigated the firmware version of the RAID controller, but the latest version was applied. We also investigated the firmware version of the disk, and as updatable firmware was published, it is in the process of being applied.

The migration just after backup failed was successful. Snapshots existed even after the migration was complete. When R420 was restarted in order to solve the problem of vzdump, the mapped rbd device disappeared, but since the snapshot itself still remained, it was deleted with rbd snap purge. The subsequent backup is done normally.

Although we will consider applying pvetest, it is difficult to do immediately.

Backups continued to be successfully acquired this weekend. While updating the firmware of the disk, continue to look at the situation.

Alwin · Aug 6, 2018

kawataso said:
The migration just after backup failed was successful. Snapshots existed even after the migration was complete. When R420 was restarted in order to solve the problem of vzdump, the mapped rbd device disappeared, but since the snapshot itself still remained, it was deleted with rbd snap purge. The subsequent backup is done normally.

If you do a reboot every time, to every host where it occurs, then the left over mapped image is out of the question.

kawataso said:
Although we will consider applying pvetest, it is difficult to do immediately.

The packages should land soon in the main repository.

kawataso said:
Backups continued to be successfully acquired this weekend. While updating the firmware of the disk, continue to look at the situation.

Hopefully this resolves the issue.

kawataso · Aug 10, 2018

A similar problem has occurred once, but new useful information could not be obtained.

After that, all servers have been updated to the latest version by now, Ceph has also been updated to 12.2.7-pve1. Since the firmware update of the planned HDD was also completed, I will continue to look at the situation.

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-1-pve)
pve-manager: 5.2-6 (running version: 5.2-6/bcd5f008)
pve-kernel-4.15: 5.2-4
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-1-pve: 4.15.18-17
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph: 12.2.7-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-37
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-24
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-29
pve-container: 2.0-24
pve-docs: 5.2-5
pve-firewall: 3.0-13
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-30
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

karnz · Oct 10, 2018

Also found almost same error here. but fixed with the patch from
https://bugzilla.proxmox.com/show_bug.cgi?id=1911
https://forum.proxmox.com/threads/lxc-backups-hang-via-nfs-and-cifs.46669/

Log.

Oct 11 00:48:13 pcs7 kernel: [71975.855598] print_req_error: I/O error, dev rbd1, sector 0
Oct 11 00:48:14 pcs7 kernel: [71976.938409] EXT4-fs error (device rbd1): htree_dirblock_to_tree:977: inode #1190583: comm rsync: Directory block failed checksum
Oct 11 00:48:14 pcs7 kernel: [71976.938720] rbd: rbd1: write 1000 at 0 (0)
Oct 11 00:48:14 pcs7 kernel: [71976.938738] Buffer I/O error on dev rbd1, logical block 0, lost sync page write
Oct 11 00:48:14 pcs7 kernel: [71976.939155] EXT4-fs (rbd1): previous I/O error to superblock detected
Oct 11 00:48:14 pcs7 kernel: [71976.939422] print_req_error: I/O error, dev rbd1, sector 0
Oct 11 00:48:14 pcs7 kernel: [71976.939792] EXT4-fs (rbd1): previous I/O error to superblock detected
Oct 11 00:48:14 pcs7 kernel: [71976.940107] Buffer I/O error on dev rbd1, logical block 0, lost sync page write
Oct 11 00:48:14 pcs7 kernel: [71976.940830] rbd: rbd1: write 1000 at 0 (0)
Oct 11 00:48:14 pcs7 kernel: [71976.941220] EXT4-fs warning (device rbd1): ext4_dirent_csum_verify:353: inode #1068358: comm rsync: No space for directory leaf checksum. Please run e2fsck -D.
Oct 11 00:48:14 pcs7 kernel: [71976.941578] rbd: rbd1: write 1000 at 0 (0)
Oct 11 00:48:14 pcs7 kernel: [71976.941611] Buffer I/O error on dev rbd1, logical block 0, lost sync page write
Oct 11 00:48:14 pcs7 kernel: [71976.941971] EXT4-fs (rbd1): previous I/O error to superblock detected
Oct 11 00:48:14 pcs7 kernel: [71976.943258] print_req_error: I/O error, dev rbd1, sector 0
Oct 11 00:48:14 pcs7 kernel: [71976.943631] EXT4-fs (rbd1): previous I/O error to superblock detected
Oct 11 00:48:14 pcs7 kernel: [71976.943957] Buffer I/O error on dev rbd1, logical block 0, lost sync page write
Oct 11 00:48:14 pcs7 kernel: [71976.944645] rbd: rbd1: write 1000 at 0 (0)
Oct 11 00:48:14 pcs7 kernel: [71976.945253] rbd: rbd1: write 1000 at 0 (0)
Oct 11 00:48:14 pcs7 kernel: [71976.946954] rbd: rbd1: write 1000 at 0 (0)
Oct 11 00:48:14 pcs7 kernel: [71976.947532] rbd: rbd1: result -2 xferred 1000
Oct 11 00:48:14 pcs7 kernel: [71976.948909] rbd: rbd1: result -2 xferred 1000
Oct 11 00:48:14 pcs7 kernel: [71976.953199] rbd: rbd1: result -2 xferred 1000
Oct 11 00:48:14 pcs7 kernel: [71977.736404] RIP: 0010:mark_buffer_dirty+0xaf/0xf0
Oct 11 00:48:14 pcs7 kernel: [71977.736407] RBP: ffffaab209783e30 R08: 000000009bcd84c6 R09: 0000000000000000
Oct 11 00:48:14 pcs7 kernel: [71977.736410] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 00:48:14 pcs7 kernel: [71977.736418] kmmpd+0x2c1/0x3f0
Oct 11 00:48:14 pcs7 kernel: [71977.736425] ? do_syscall_64+0x73/0x130
Oct 11 00:48:14 pcs7 kernel: [71977.736448] ---[ end trace a310bc28f0936e60 ]---
Oct 11 00:48:14 pcs7 kernel: [71977.737704] rbd: rbd1: result -2 xferred 1000
Oct 11 00:48:15 pcs7 pvedaemon[649876]: command '/usr/bin/rsync --stats -X -A --numeric-ids -aH --whole-file --sparse --one-file-system /var/lib/lxc/106/.copy-volume-2/ /var/lib/lxc/106/.copy-volume-1' failed: exit code 23

PVE version

proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-10
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-7-pve: 4.15.18-26
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

RobFantini · Oct 11, 2018

are any of the containers using a kernel older then 2.6.32 ?

we have an accounting system that uses debian etch . whenever we tried to run it using a container we'd have similar backup issues. I'll pass on more info if you are running an old kernel in the pct

sk4hrr · Oct 18, 2018

kawataso said:
A similar problem has occurred once, but new useful information could not be obtained.

Same for me, but if you wan't something, just ask!

Code:

Oct 17 23:15:01 pve3 CRON[113537]: (root) CMD (vzdump 89013 89017 89016 89001 89254 --mode snapshot --compress lzo --quiet 1 --mailnotification failure --storage local
--mailto maintenance@adami.fr)
Oct 17 23:15:01 pve3 pmxcfs[1611]: [status] notice: received log
Oct 17 23:15:02 pve3 vzdump[113538]: <root@pam> starting task UPID:pve3:0001BB83:001537D2:5BC7A656:vzdump::root@pam:
Oct 17 23:15:02 pve3 vzdump[113539]: INFO: starting new backup job: vzdump 89013 89017 89016 89001 89254 --mailto maintenance@adami.fr --storage local --quiet 1 --mode
snapshot --compress lzo --mailnotification failure
Oct 17 23:15:02 pve3 vzdump[113539]: INFO: Starting Backup of VM 89001 (lxc)
Oct 17 23:15:04 pve3 kernel: [13907.443703] EXT4-fs (rbd8): mounted filesystem without journal. Opts: noload
Oct 17 23:15:04 pve3 kernel: [13907.603899] Modules linked in: xt_multiport veth rbd libceph ip_set ip6table_filter ip6_tables xfs iptable_filter 8021q garp mrp softdog
 nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel gpio_ich ast kvm ttm drm_kms_helper irqbypass drm i2c_algo_bit crct
10dif_pclmul crc32_pclmul mxm_wmi ghash_clmulni_intel fb_sys_fops pcbc aesni_intel syscopyarea aes_x86_64 sysfillrect crypto_simd glue_helper sysimgblt ipmi_ssif cryptd
 joydev input_leds intel_cstate ioatdma snd_pcm snd_timer shpchp intel_rapl_perf intel_pch_thermal snd soundcore mei_me lpc_ich pcspkr mei mac_hid wmi ipmi_si ipmi_devi
ntf ipmi_msghandler acpi_pad vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables auto
fs4
Oct 17 23:15:04 pve3 kernel: [13907.604191] CPU: 3 PID: 108915 Comm: kworker/3:2 Not tainted 4.15.18-7-pve #1
Oct 17 23:15:04 pve3 kernel: [13907.604211] Hardware name: Supermicro Super Server/X10SDV-4C-TLN2F, BIOS 1.3 01/05/2018
Oct 17 23:15:04 pve3 kernel: [13907.604239] Workqueue: rbd rbd_queue_workfn [rbd]
Oct 17 23:15:04 pve3 kernel: [13907.604256] RIP: 0010:rbd_queue_workfn+0x462/0x4f0 [rbd]
Oct 17 23:15:04 pve3 kernel: [13907.604272] RSP: 0018:ffffaeaf8649fe18 EFLAGS: 00010286
Oct 17 23:15:04 pve3 kernel: [13907.604288] RAX: 0000000000000086 RBX: ffff92367915c000 RCX: 0000000000000006
Oct 17 23:15:04 pve3 kernel: [13907.604309] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff92393fcd6490
Oct 17 23:15:04 pve3 kernel: [13907.604329] RBP: ffffaeaf8649fe60 R08: 0000000000000000 R09: 000000000000046a
Oct 17 23:15:04 pve3 kernel: [13907.604392] FS:  0000000000000000(0000) GS:ffff92393fcc0000(0000) knlGS:0000000000000000
Oct 17 23:15:04 pve3 kernel: [13907.604506]  ? __schedule+0x3e8/0x870
Oct 17 23:15:04 pve3 kernel: [13907.608331]  ? do_syscall_64+0x73/0x130
Oct 17 23:15:34 pve3 pmxcfs[1611]: [status] notice: received log
Oct 17 23:16:00 pve3 systemd[1]: Starting Proxmox VE replication runner...
Oct 17 23:16:01 pve3 systemd[1]: Started Proxmox VE replication runner.
Oct 17 23:17:00 pve3 systemd[1]: Starting Proxmox VE replication runner...
Oct 17 23:17:00 pve3 systemd[1]: Started Proxmox VE replication runner.
Oct 17 23:17:01 pve3 CRON[114518]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 17 23:18:00 pve3 systemd[1]: Starting Proxmox VE replication runner...
Oct 17 23:18:00 pve3 systemd[1]: Started Proxmox VE replication runner.
Oct 17 23:18:54 pve3 pmxcfs[1611]: [dcdb] notice: data verification successful

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

I noticed that if I started the backup manually, at another time, it worked. Could this be related to the fact that the backup is launched on all hosts? (too few io disk)

I have strange messages in dmesg:

Code:

[13907.603865] ------------[ cut here ]------------
[13907.603866] kernel BUG at drivers/block/rbd.c:4035!
[13907.603886] invalid opcode: 0000 [#1] SMP PTI
[13907.603899] Modules linked in: xt_multiport veth rbd libceph ip_set ip6table_filter ip6_tables xfs iptable_filter 8021q garp mrp softdog nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel gpio_ich ast kvm ttm drm_kms_helper irqbypass drm i2c_algo_bit crct10dif_pclmul crc32_pclmul mxm_wmi ghash_clmulni_intel fb_sys_fops pcbc aesni_intel syscopyarea aes_x86_64 sysfillrect crypto_simd glue_helper sysimgblt ipmi_ssif cryptd joydev input_leds intel_cstate ioatdma snd_pcm snd_timer shpchp intel_rapl_perf intel_pch_thermal snd soundcore mei_me lpc_ich pcspkr mei mac_hid wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4
[13907.604122]  raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic usbkbd usbmouse usbhid hid raid1 ahci i2c_i801 libahci ixgbe dca ptp pps_core mdio
[13907.604191] CPU: 3 PID: 108915 Comm: kworker/3:2 Not tainted 4.15.18-7-pve #1
[13907.604211] Hardware name: Supermicro Super Server/X10SDV-4C-TLN2F, BIOS 1.3 01/05/2018
[13907.604239] Workqueue: rbd rbd_queue_workfn [rbd]
[13907.604256] RIP: 0010:rbd_queue_workfn+0x462/0x4f0 [rbd]
[13907.604272] RSP: 0018:ffffaeaf8649fe18 EFLAGS: 00010286
[13907.604288] RAX: 0000000000000086 RBX: ffff92367915c000 RCX: 0000000000000006
[13907.604309] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff92393fcd6490
[13907.604329] RBP: ffffaeaf8649fe60 R08: 0000000000000000 R09: 000000000000046a
[13907.604350] R10: 000000000000022f R11: 00000000ffffffff R12: ffff92386879bcc0
[13907.604370] R13: ffff9234721c8b80 R14: 0000000000000000 R15: 0000000000001000
[13907.604392] FS:  0000000000000000(0000) GS:ffff92393fcc0000(0000) knlGS:0000000000000000
[13907.604415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13907.604432] CR2: 0000563cf91fd478 CR3: 000000074b80a003 CR4: 00000000003606e0
[13907.604453] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[13907.604473] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[13907.604494] Call Trace:
[13907.604506]  ? __schedule+0x3e8/0x870
[13907.604519]  process_one_work+0x1e0/0x400
[13907.605290]  worker_thread+0x4b/0x420
[13907.606066]  kthread+0x105/0x140
[13907.606827]  ? process_one_work+0x400/0x400
[13907.607580]  ? kthread_create_worker_on_cpu+0x70/0x70
[13907.608331]  ? do_syscall_64+0x73/0x130
[13907.609070]  ? SyS_exit_group+0x14/0x20
[13907.609837]  ret_from_fork+0x35/0x40
[13907.610574] Code: 00 48 83 78 20 fe 0f 84 6a fc ff ff 48 c7 c1 a8 e8 a1 c0 ba c3 0f 00 00 48 c7 c6 b0 fc a1 c0 48 c7 c7 90 dd a1 c0 e8 0e 57 ed c3 <0f> 0b 48 8b 75 d0 4d 89 d0 44 89 f1 4c 89 fa 48 89 df 4c 89 55
[13907.612155] RIP: rbd_queue_workfn+0x462/0x4f0 [rbd] RSP: ffffaeaf8649fe18
[13907.612962] ---[ end trace a8996e2c7a871c7d ]---
[15344.742259] audit: type=1400 audit(1539812341.814:84): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-89197_</var/lib/lxc>" name="/" pid=124161 comm="(ionclean)" flags="rw, rslave"
....
[53542.499472] rbd: rbd9: capacity 128849018880 features 0x1
[53542.559617] EXT4-fs (rbd9): write access unavailable, skipping orphan cleanup
[53542.560930] EXT4-fs (rbd9): mounted filesystem without journal. Opts: noload
[54130.445181] rbd: rbd9: capacity 5368709120 features 0x1
[54130.600445] rbd: rbd10: capacity 12884901888 features 0x1
[54130.677114] EXT4-fs (rbd9): write access unavailable, skipping orphan cleanup
[54130.678363] EXT4-fs (rbd9): mounted filesystem without journal. Opts: noload
[54130.733458] EXT4-fs (rbd10): mounted filesystem without journal. Opts: noload
[54293.050540] rbd: rbd9: capacity 5368709120 features 0x1
[54293.124291] EXT4-fs (rbd9): write access unavailable, skipping orphan cleanup
[54293.125501] EXT4-fs (rbd9): mounted filesystem without journal. Opts: noload
[54455.646285] rbd: rbd9: capacity 128849018880 features 0x1
[54455.703957] EXT4-fs (rbd9): write access unavailable, skipping orphan cleanup
[54455.705113] EXT4-fs (rbd9): mounted filesystem without journal. Opts: noload
[54944.461322] audit: type=1400 audit(1539851941.661:174): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-89197_</var/lib/lxc>" name="/" pid=429231 comm="(ionclean)" flags="rw, rslave"

RobFantini said:
are any of the containers using a kernel older then 2.6.32 ?

The containers use the host kernel, right?

RobFantini said:
we have an accounting system that uses debian etch

I think LXC allows you to run anything that has a Linux kernel, but I use the list of templates as a compatibility matrix.

sk4hrr · Oct 19, 2018

It's solved for me, I save in "suspend" mode.

Backup hangup with Ceph/rbd

New Member

Proxmox Retired Staff

New Member

Attachments

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Renowned Member

Famous Member

New Member

New Member