I've had a number of stuck backup jobs of lxc containers using a ceph snapshot as a source. Have you noticed this too? Its been happening for awhile:
While it looks like a disk errors, and this home cluster has a bunch of old junk computers it, the failures follow OSDs where the backup is running and `ceph pg deep-scrub <pg>` clean. Also, the ceph error is recoverable, while the hang is from the mounted snapshot /dev/rbd1 device.
is broken enough requiring a reboot
I've changed from snapshot to stop backup and its working great! The full ODS logs with scrub are here, but I think its more of a snapshot /dev/rbd1 mounting bug. If this isn't just me, its not really a proxmox bug, but at this point I'm not sure where I should file the bug.
2017-12-10 05:00:15.946415 osd.0 osd.0 10.1.1.10:6801/30151 14 : cluster [ERR] 1.3d missing primary copy of 1:bc3c5d99:::rbd_data.a6d574b0dc51.00000000000004f7:head, will try copies on 3,5
2017-12-11 05:04:40.047933 osd.0 osd.0 10.1.1.10:6804/2946 121 : cluster [ERR] 1.6d missing primary copy of 1:b665eacf:::rbd_data.8c342ae8944a.000000000000081a:head, will try copies on 3,5
2017-12-12 05:03:26.166045 osd.0 osd.0 10.1.1.10:6804/2946 238 : cluster [ERR] 1.f1 missing primary copy of 1:8f0ff2c2:::rbd_data.8c342ae8944a.0000000000000628:head, will try copies on 2,5
2017-12-23 05:10:58.722254 osd.2 osd.2 10.1.1.9:6804/2182 383 : cluster [ERR] 3.3c missing primary copy of 3:3d470d6b:::rbd_data.a6239643c9869.00000000000001ab:head, will try copies on 0,4
2018-01-03 05:08:28.078099 osd.3 osd.3 10.1.1.9:6804/2177 105 : cluster [ERR] 3.14 missing primary copy of 3:2812d53a:::rbd_data.a6239643c9869.00000000000006ba:head, will try copies on 0,4
2018-01-04 05:10:02.519346 osd.2 osd.2 10.1.1.9:6804/2245 45 : cluster [ERR] 3.5 missing primary copy of 3:a17478e7:::rbd_data.a6239643c9869.000000000000026a:head, will try copies on 0,5
2018-01-06 05:02:28.087476 osd.1 osd.1 10.1.1.10:6804/3394 37 : cluster [ERR] 3.b missing primary copy of 3:d1c318cd:::rbd_data.a41d774b0dc51.0000000000000519:head, will try copies on 2,4
2018-01-06 05:08:36.092738 osd.0 osd.0 10.1.1.10:6800/3249 65 : cluster [ERR] 3.45 missing primary copy of 3:a36530a0:::rbd_data.a41d774b0dc51.0000000000000cbe:head, will try copies on 3,5
2018-01-07 05:01:35.730387 osd.1 osd.1 10.1.1.10:6804/3394 66 : cluster [ERR] 3.4 missing primary copy of 3:20c0df81:::rbd_data.a6239643c9869.0000000000000834:head, will try copies on 2,5
2018-01-07 15:16:56.002832 osd.1 osd.1 10.1.1.10:6804/3356 5 : cluster [ERR] 3.65 missing primary copy of 3:a657a9ec:::rbd_data.a6239643c9869.0000000000000167:head, will try copies on 2,4
2018-01-08 05:02:30.962180 osd.0 osd.0 10.1.1.10:6800/3033 31 : cluster [ERR] 3.17 missing primary copy of 3:e9e1edfd:::rbd_data.a6239643c9869.00000000000006a4:head, will try copies on 3,5
2018-01-18 05:10:33.630255 osd.3 osd.3 10.1.1.9:6804/2172 47 : cluster [ERR] 3.39 missing primary copy of 3:9ccaaf7f:::rbd_data.a6239643c9869.000000000000062c:head, will try copies on 0,4
2018-01-18 18:59:53.061005 osd.3 osd.3 10.1.1.9:6804/2202 23 : cluster [ERR] 3.71 missing primary copy of 3:8ff57dec:::rbd_data.a6239643c9869.00000000000003e4:head, will try copies on 1,5
2018-01-18 19:22:27.504886 osd.2 osd.2 10.1.1.9:6800/2030 1 : cluster [ERR] 3.5d missing primary copy of 3:bb371d0a:::rbd_data.a6239643c9869.0000000000000275:head, will try copies on 0,4
2018-01-18 19:48:34.393222 osd.0 osd.0 10.1.1.10:6800/5528 15 : cluster [ERR] 3.74 missing primary copy of 3:2e9b9ab2:::rbd_data.a6239643c9869.00000000000009c3:head, will try copies on 3,4
2018-01-18 20:09:55.249787 osd.0 osd.0 10.1.1.10:6800/3006 1 : cluster [ERR] 3.5a missing primary copy of 3:5ad483d1:::rbd_data.a6239643c9869.000000000000005b:head, will try copies on 3,5
2018-01-18 21:08:21.173157 osd.2 osd.2 10.1.1.9:6800/2020 17 : cluster [ERR] 3.4b missing primary copy of 3:d239927e:::rbd_data.a6239643c9869.00000000000004ad:head, will try copies on 1,5
2018-01-18 22:43:41.818465 osd.0 osd.0 10.1.1.10:6800/2984 17 : cluster [ERR] 3.64 missing primary copy of 3:2789172e:::rbd_data.a6239643c9869.00000000000007a5:head, will try copies on 2,4
2018-01-18 22:56:42.256424 osd.4 osd.4 10.1.1.2:6804/15194 83 : cluster [ERR] 3.60 missing primary copy of 3:064cef74:::rbd_data.a6239643c9869.0000000000000294:head, will try copies on 1,3
2018-01-18 23:24:47.344713 osd.0 osd.0 10.1.1.10:6800/2980 1 : cluster [ERR] 3.23 missing primary copy of 3:c484296e:::rbd_data.a6239643c9869.0000000000000515:head, will try copies on 2,5
While it looks like a disk errors, and this home cluster has a bunch of old junk computers it, the failures follow OSDs where the backup is running and `ceph pg deep-scrub <pg>` clean. Also, the ceph error is recoverable, while the hang is from the mounted snapshot /dev/rbd1 device.
Code:
[ 633.220900] libceph: get_reply osd0 tid 118337 data 2256896 > preallocated 131072, skipping
[ 5254.083541] libceph: get_reply osd2 tid 95813 data 1851392 > preallocated 131072, skipping
[ 8153.398333] libceph: get_reply osd0 tid 59025 data 946176 > preallocated 131072, skipping
[186162.844452] libceph: get_reply osd4 tid 111482 data 2768896 > preallocated 131072, skipping
[ 1903.885819] libceph: get_reply osd0 tid 75276 data 4079616 > preallocated 131072, skipping
is broken enough requiring a reboot
Code:
% pct delsnapshot 100 vzdump
rbd: sysfs write failed
can't unmap rbd volume vm-100-disk-1: rbd: sysfs write failed
# but removing it by hand works:
rbd snap rm ceph/vm-100-disk-1@vzdump
I've changed from snapshot to stop backup and its working great! The full ODS logs with scrub are here, but I think its more of a snapshot /dev/rbd1 mounting bug. If this isn't just me, its not really a proxmox bug, but at this point I'm not sure where I should file the bug.
Code:
[ 633.220900] libceph: get_reply osd0 tid 118337 data 2256896 > preallocated 131072, skipping
2018-01-18 20:09:55.249787 osd.0 osd.0 10.1.1.10:6800/3006 1 : cluster [ERR] 3.5a missing primary copy of 3:5ad483d1:::rbd_data.a6239643c9869.000000000000005b:head, will try copies on 3,5
[ 5254.083541] libceph: get_reply osd2 tid 95813 data 1851392 > preallocated 131072, skipping
2018-01-18 21:08:21.169251 7f07e404f700 -1 bluestore(/var/lib/ceph/osd/ceph-2) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x3d000, got 0x6706be76, expected 0x1e054be5, device location [0x6fdead000~1000], logical extent 0x23d000~1000, object #3:d239927e:::rbd_data.a6239643c9869.00000000000004ad:head#
2018-01-18 21:08:21.173148 7f07e404f700 -1 log_channel(cluster) log [ERR] : 3.4b missing primary copy of 3:d239927e:::rbd_data.a6239643c9869.00000000000004ad:head, will try copies on 1,5
2018-01-18 21:31:12.940846 7f5bf4b68700 0 log_channel(cluster) log [DBG] : 3.4b deep-scrub starts
2018-01-18 21:31:14.284975 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1162'47442 (1082'45914,1162'47442] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1162'47442 lcod 0'0 mlcod 0'0 active+clean+scrubbing+deep] _scan_snaps no head for 3:d269f5bc:::rbd_data.a6239643c9869.0000000000002060:237 (have MIN)
2018-01-18 21:31:14.285000 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1162'47442 (1082'45914,1162'47442] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1162'47442 lcod 0'0 mlcod 0'0 active+clean+scrubbing+deep] _scan_snaps no head for 3:d269f5bc:::rbd_data.a6239643c9869.0000000000002060:226 (have MIN)
2018-01-18 21:31:14.285008 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1162'47442 (1082'45914,1162'47442] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1162'47442 lcod 0'0 mlcod 0'0 active+clean+scrubbing+deep] _scan_snaps no head for 3:d269f5bc:::rbd_data.a6239643c9869.0000000000002060:210 (have MIN)
2018-01-18 21:31:15.878975 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1175'47444 (1082'45914,1175'47444] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1175'47444 lcod 1175'47443 mlcod 1175'47443 active+clean+scrubbing+deep] _scan_snaps no head for 3:d2ce786c:::rbd_data.a6239643c9869.000000000000207f:28b (have MIN)
2018-01-18 21:31:15.879002 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1175'47444 (1082'45914,1175'47444] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1175'47444 lcod 1175'47443 mlcod 1175'47443 active+clean+scrubbing+deep] _scan_snaps no head for 3:d2ce786c:::rbd_data.a6239643c9869.000000000000207f:27f (have MIN)
2018-01-18 21:31:15.879010 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1175'47444 (1082'45914,1175'47444] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1175'47444 lcod 1175'47443 mlcod 1175'47443 active+clean+scrubbing+deep] _scan_snaps no head for 3:d2ce786c:::rbd_data.a6239643c9869.000000000000207f:279 (have MIN)
2018-01-18 21:31:15.879017 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1175'47444 (1082'45914,1175'47444] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1175'47444 lcod 1175'47443 mlcod 1175'47443 active+clean+scrubbing+deep] _scan_snaps no head for 3:d2ce786c:::rbd_data.a6239643c9869.000000000000207f:267 (have MIN)
2018-01-18 21:31:15.879024 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1175'47444 (1082'45914,1175'47444] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1175'47444 lcod 1175'47443 mlcod 1175'47443 active+clean+scrubbing+deep] _scan_snaps no head for 3:d2ce786c:::rbd_data.a6239643c9869.000000000000207f:251 (have MIN)
2018-01-18 21:31:15.879030 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1175'47444 (1082'45914,1175'47444] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1175'47444 lcod 1175'47443 mlcod 1175'47443 active+clean+scrubbing+deep] _scan_snaps no head for 3:d2ce786c:::rbd_data.a6239643c9869.000000000000207f:23b (have MIN)
2018-01-18 21:31:15.879036 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1175'47444 (1082'45914,1175'47444] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1175'47444 lcod 1175'47443 mlcod 1175'47443 active+clean+scrubbing+deep] _scan_snaps no head for 3:d2ce786c:::rbd_data.a6239643c9869.000000000000207f:223 (have MIN)
2018-01-18 21:31:15.879043 7f5bf4b68700 -1 osd.2 pg_epoch: 1175 pg[3.4b( v 1175'47444 (1082'45914,1175'47444] local-lis/les=1174/1175 n=90 ec=296/296 lis/c 1174/1174 les/c/f 1175/1175/0 1174/1174/1174) [2,1,5] r=0 lpr=1174 crt=1175'47444 lcod 1175'47443 mlcod 1175'47443 active+clean+scrubbing+deep] _scan_snaps no head for 3:d2ce786c:::rbd_data.a6239643c9869.000000000000207f:20d (have MIN)
2018-01-18 21:31:19.885360 7f5bf4b68700 0 log_channel(cluster) log [DBG] : 3.4b deep-scrub ok
[ 8153.398333] libceph: get_reply osd0 tid 59025 data 946176 > preallocated 131072, skipping
2018-01-18 22:43:41.789123 7f05e97ba700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x35000, got 0x6706be76, expected 0x3247a483, device location [0x7bca35000~1000], logical extent 0x335000~1000, object #3:2789172e:::rbd_data.a6239643c9869.00000000000007a5:head#
2018-01-18 22:43:41.818465 osd.0 osd.0 10.1.1.10:6800/2984 17 : cluster [ERR] 3.64 missing primary copy of 3:2789172e:::rbd_data.a6239643c9869.00000000000007a5:head, will try copies on 2,4
2018-01-18 22:46:15.017339 7f05e97ba700 0 log_channel(cluster) log [DBG] : 3.64 deep-scrub starts
2018-01-18 22:46:19.295938 7f05e97ba700 0 log_channel(cluster) log [DBG] : 3.64 deep-scrub ok
[186162.844452] libceph: get_reply osd4 tid 111482 data 2768896 > preallocated 131072, skipping
2018-01-18 22:56:42.245043 7f08153b9700 -1 bluestore(/var/lib/ceph/osd/ceph-4) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x7b000, got 0x6706be76, expected 0xb619ce69, device location [0x70e07b000~1000], logical extent 0x17b000~1000, object #3:064cef74:::rbd_data.a6239643c9869.0000000000000294:head#
2018-01-18 22:56:42.256424 osd.4 osd.4 10.1.1.2:6804/15194 83 : cluster [ERR] 3.60 missing primary copy of 3:064cef74:::rbd_data.a6239643c9869.0000000000000294:head, will try copies on 1,3
2018-01-18 22:59:13.061571 7f08153b9700 0 log_channel(cluster) log [DBG] : 3.60 deep-scrub starts
2018-01-18 22:59:18.362073 7f08153b9700 0 log_channel(cluster) log [DBG] : 3.60 deep-scrub ok
[ 1903.885819] libceph: get_reply osd0 tid 75276 data 4079616 > preallocated 131072, skipping
2018-01-18 23:24:47.082488 7fb6c8524700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1d000, got 0x6706be76, expected 0x210a620a, device location [0x761fed000~1000], logical extent 0x1d000~1000, object #3:c484296e:::rbd_data.a6239643c9869.0000000000000515:head#
2018-01-18 23:24:47.344713 osd.0 osd.0 10.1.1.10:6800/2980 1 : cluster [ERR] 3.23 missing primary copy of 3:c484296e:::rbd_data.a6239643c9869.0000000000000515:head, will try copies on 2,5
2018-01-18 23:27:53.695821 7fb6c8524700 0 log_channel(cluster) log [DBG] : 3.23 deep-scrub starts
2018-01-18 23:27:58.363954 7fb6c8524700 0 log_channel(cluster) log [DBG] : 3.23 deep-scrub ok
Code:
root@MacPro:~# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 5.45755 - 5588G 105G 5483G 1.88 1.00 - root default
-5 1.81918 - 1862G 35964M 1827G 1.89 1.00 - host MacPro
2 hdd 0.90959 1.00000 931G 21076M 910G 2.21 1.17 75 osd.2
3 hdd 0.90959 1.00000 931G 14888M 916G 1.56 0.83 53 osd.3
-3 1.81918 - 1862G 35884M 1827G 1.88 1.00 - host Super
0 hdd 0.90959 1.00000 931G 21026M 910G 2.20 1.17 75 osd.0
1 hdd 0.90959 1.00000 931G 14857M 916G 1.56 0.83 53 osd.1
-7 1.81918 - 1862G 35864M 1827G 1.88 1.00 - host nas
4 hdd 0.90959 1.00000 931G 16653M 915G 1.75 0.93 60 osd.4
5 hdd 0.90959 1.00000 931G 19210M 912G 2.01 1.07 68 osd.5
TOTAL 5588G 105G 5483G 1.88
MIN/MAX VAR: 0.83/1.17 STDDEV: 0.28
root@MacPro:/etc/pve/rob# pveversion -v
proxmox-ve: 5.1-36 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-42 (running version: 5.1-42/724a6cb3)
pve-kernel-4.13.13-5-pve: 4.13.13-36
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-19
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1
proxmox-ve: 5.1-36 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-42 (running version: 5.1-42/724a6cb3)
pve-kernel-4.13.13-5-pve: 4.13.13-36
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-19
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1