Backup slowing down to a crawl

sseidel

Renowned Member
Jul 8, 2015
49
7
73
I have this problem with a backup one one VM: the backup starts "ok" (not as fast as it could, but reasonable speed) but at some point slows down to almost no progress. Here are two logs:

Code:
INFO: Starting Backup of VM 111 (qemu)
INFO: status = running
INFO: update VM 111: -lock backup
INFO: include disk 'scsi0' 'rbd_ct:vm-111-disk-1' 140G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backup01/dump/vzdump-qemu-111-2018_06_23-21_30_02.vma.gz'
INFO: started backup task 'e1795f14-3f61-4c01-85d8-4ff1d371579c'
INFO: status: 0% (160432128/150323855360), sparse 0% (7118848), duration 3, read/write 53/51 MB/s
INFO: status: 1% (1543241728/150323855360), sparse 0% (204550144), duration 34, read/write 44/38 MB/s
INFO: status: 2% (3028484096/150323855360), sparse 0% (527478784), duration 68, read/write 43/34 MB/s
INFO: status: 3% (4515430400/150323855360), sparse 0% (639668224), duration 105, read/write 40/37 MB/s
INFO: status: 4% (6023806976/150323855360), sparse 0% (655355904), duration 152, read/write 32/31 MB/s
INFO: status: 5% (7571767296/150323855360), sparse 0% (813867008), duration 189, read/write 41/37 MB/s
INFO: status: 6% (9069592576/150323855360), sparse 0% (945946624), duration 224, read/write 42/39 MB/s
INFO: status: 7% (10555293696/150323855360), sparse 0% (950132736), duration 253, read/write 51/51 MB/s
INFO: status: 8% (12089229312/150323855360), sparse 0% (1085124608), duration 292, read/write 39/35 MB/s
INFO: status: 9% (13573160960/150323855360), sparse 0% (1259769856), duration 334, read/write 35/31 MB/s
INFO: status: 10% (15121907712/150323855360), sparse 0% (1383661568), duration 380, read/write 33/30 MB/s
INFO: status: 11% (16557539328/150323855360), sparse 0% (1411604480), duration 425, read/write 31/31 MB/s
INFO: status: 12% (18057658368/150323855360), sparse 1% (1567252480), duration 480, read/write 27/24 MB/s
INFO: status: 13% (19542573056/150323855360), sparse 1% (1711927296), duration 530, read/write 29/26 MB/s
INFO: status: 14% (21064908800/150323855360), sparse 1% (1724657664), duration 562, read/write 47/47 MB/s
INFO: status: 15% (22555197440/150323855360), sparse 1% (1872805888), duration 622, read/write 24/22 MB/s
INFO: status: 16% (24084348928/150323855360), sparse 1% (1996996608), duration 674, read/write 29/27 MB/s
INFO: status: 17% (25559367680/150323855360), sparse 1% (2018693120), duration 3169, read/write 0/0 MB/s
INFO: status: 18% (27088584704/150323855360), sparse 1% (2235613184), duration 3409, read/write 6/5 MB/s
INFO: status: 19% (28577103872/150323855360), sparse 1% (2392182784), duration 3439, read/write 49/44 MB/s
INFO: status: 20% (30070669312/150323855360), sparse 1% (2542043136), duration 3464, read/write 59/53 MB/s
INFO: status: 21% (31578128384/150323855360), sparse 1% (2701664256), duration 3495, read/write 48/43 MB/s
INFO: status: 22% (33075363840/150323855360), sparse 1% (2715508736), duration 3538, read/write 34/34 MB/s
INFO: status: 23% (34611200000/150323855360), sparse 1% (2863001600), duration 3568, read/write 51/46 MB/s
INFO: status: 24% (36127047680/150323855360), sparse 2% (3041021952), duration 3595, read/write 56/49 MB/s
INFO: status: 25% (37624610816/150323855360), sparse 2% (3100639232), duration 3628, read/write 45/43 MB/s
INFO: status: 26% (39113195520/150323855360), sparse 2% (3423080448), duration 3653, read/write 59/46 MB/s
INFO: status: 27% (40614952960/150323855360), sparse 2% (3566981120), duration 3684, read/write 48/43 MB/s
INFO: status: 28% (42146136064/150323855360), sparse 2% (3687231488), duration 3714, read/write 51/47 MB/s
INFO: status: 29% (43623120896/150323855360), sparse 2% (3714830336), duration 3745, read/write 47/46 MB/s
INFO: status: 30% (45119176704/150323855360), sparse 2% (3849728000), duration 3772, read/write 55/50 MB/s
INFO: status: 31% (46631878656/150323855360), sparse 2% (3980304384), duration 3808, read/write 42/38 MB/s
INFO: status: 32% (48129703936/150323855360), sparse 2% (4001726464), duration 3834, read/write 57/56 MB/s
INFO: status: 33% (49622089728/150323855360), sparse 2% (4138037248), duration 3870, read/write 41/37 MB/s
INFO: status: 34% (51130269696/150323855360), sparse 2% (4212781056), duration 3899, read/write 52/49 MB/s
INFO: status: 35% (52621737984/150323855360), sparse 2% (4348125184), duration 3935, read/write 41/37 MB/s
INFO: status: 36% (54156656640/150323855360), sparse 2% (4509302784), duration 3965, read/write 51/45 MB/s
INFO: status: 37% (55641440256/150323855360), sparse 3% (4591120384), duration 3997, read/write 46/43 MB/s
INFO: status: 38% (57130156032/150323855360), sparse 3% (4816203776), duration 4027, read/write 49/42 MB/s
INFO: status: 39% (58682245120/150323855360), sparse 3% (4968542208), duration 4056, read/write 53/48 MB/s
INFO: status: 40% (60205367296/150323855360), sparse 3% (5023625216), duration 4087, read/write 49/47 MB/s
INFO: status: 41% (61672587264/150323855360), sparse 3% (5185970176), duration 4116, read/write 50/44 MB/s
INFO: status: 42% (63137841152/150323855360), sparse 3% (5373067264), duration 4143, read/write 54/47 MB/s
INFO: status: 43% (64650412032/150323855360), sparse 3% (5526986752), duration 4179, read/write 42/37 MB/s
INFO: status: 44% (66161737728/150323855360), sparse 3% (5537914880), duration 4210, read/write 48/48 MB/s
INFO: status: 45% (67707994112/150323855360), sparse 3% (5670223872), duration 4243, read/write 46/42 MB/s
INFO: status: 46% (69182095360/150323855360), sparse 3% (5804797952), duration 4290, read/write 31/28 MB/s
INFO: status: 47% (70708887552/150323855360), sparse 3% (5817257984), duration 4319, read/write 52/52 MB/s
INFO: status: 48% (72157298688/150323855360), sparse 3% (5940146176), duration 4349, read/write 48/44 MB/s
INFO: status: 49% (73661284352/150323855360), sparse 4% (6074302464), duration 4377, read/write 53/48 MB/s
INFO: status: 50% (75180015616/150323855360), sparse 4% (6164926464), duration 4403, read/write 58/54 MB/s
INFO: status: 51% (76709560320/150323855360), sparse 4% (6395613184), duration 4436, read/write 46/39 MB/s
INFO: status: 52% (78199914496/150323855360), sparse 4% (6560391168), duration 4469, read/write 45/40 MB/s
INFO: status: 53% (79700033536/150323855360), sparse 4% (6689472512), duration 4499, read/write 50/45 MB/s
INFO: status: 54% (81214046208/150323855360), sparse 4% (6704939008), duration 5162, read/write 2/2 MB/s
INFO: status: 55% (82687688704/150323855360), sparse 4% (6818947072), duration 5189, read/write 54/50 MB/s
INFO: status: 56% (84224507904/150323855360), sparse 4% (6971334656), duration 5220, read/write 49/44 MB/s
INFO: status: 57% (85702606848/150323855360), sparse 4% (7118102528), duration 5251, read/write 47/42 MB/s
INFO: status: 58% (87223631872/150323855360), sparse 4% (7305052160), duration 5280, read/write 52/46 MB/s
INFO: status: 59% (88717918208/150323855360), sparse 4% (7443169280), duration 5313, read/write 45/41 MB/s
INFO: status: 60% (90210435072/150323855360), sparse 4% (7454748672), duration 11384, read/write 0/0 MB/s
INFO: status: 61% (91735457792/150323855360), sparse 4% (7471779840), duration 11707, read/write 4/4 MB/s
INFO: status: 62% (93207396352/150323855360), sparse 4% (7490129920), duration 12014, read/write 4/4 MB/s
INFO: status: 63% (94771609600/150323855360), sparse 4% (7505657856), duration 12334, read/write 4/4 MB/s
INFO: status: 64% (96271925248/150323855360), sparse 5% (7545704448), duration 12624, read/write 5/5 MB/s
INFO: status: 65% (97735147520/150323855360), sparse 5% (7562711040), duration 12914, read/write 5/4 MB/s
INFO: status: 66% (99223470080/150323855360), sparse 5% (7588704256), duration 17714, read/write 0/0 MB/s
INFO: status: 67% (100719984640/150323855360), sparse 5% (7613255680), duration 105541, read/write 0/0 MB/s
INFO: status: 68% (102227705856/150323855360), sparse 5% (7629295616), duration 105770, read/write 6/6 MB/s
INFO: status: 69% (103729004544/150323855360), sparse 5% (7629561856), duration 106032, read/write 5/5 MB/s
INFO: status: 70% (105229123584/150323855360), sparse 5% (7632371712), duration 106412, read/write 3/3 MB/s
INFO: status: 71% (106735927296/150323855360), sparse 5% (7811944448), duration 106672, read/write 5/5 MB/s
INFO: status: 72% (108233687040/150323855360), sparse 5% (7822983168), duration 106946, read/write 5/5 MB/s
INFO: status: 73% (109748355072/150323855360), sparse 5% (7836348416), duration 107162, read/write 7/6 MB/s
INFO: status: 74% (111242838016/150323855360), sparse 5% (7843069952), duration 107383, read/write 6/6 MB/s
INFO: status: 75% (112789225472/150323855360), sparse 5% (7914864640), duration 107543, read/write 9/9 MB/s
INFO: status: 76% (114267979776/150323855360), sparse 5% (8054267904), duration 107761, read/write 6/6 MB/s
INFO: status: 77% (115761676288/150323855360), sparse 5% (8169975808), duration 124941, read/write 0/0 MB/s
INFO: status: 78% (117255962624/150323855360), sparse 5% (8304369664), duration 124979, read/write 39/35 MB/s
INFO: status: 79% (118781509632/150323855360), sparse 5% (8423411712), duration 125018, read/write 39/36 MB/s
INFO: status: 80% (120286019584/150323855360), sparse 5% (8544096256), duration 125057, read/write 38/35 MB/s

Code:
INFO: Starting Backup of VM 111 (qemu)
INFO: status = running
INFO: update VM 111: -lock backup
INFO: include disk 'scsi0' 'rbd_ct:vm-111-disk-1' 140G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backup01/dump/vzdump-qemu-111-2018_06_25-21_30_02.vma.gz'
INFO: started backup task '40e230ba-c332-4f45-95d8-25fde55f3210'
INFO: status: 0% (133038080/150323855360), sparse 0% (7147520), duration 3, read/write 44/41 MB/s
INFO: status: 1% (1535639552/150323855360), sparse 0% (204435456), duration 29, read/write 53/46 MB/s
INFO: status: 2% (3046703104/150323855360), sparse 0% (528662528), duration 64, read/write 43/33 MB/s
INFO: status: 3% (4556390400/150323855360), sparse 0% (642711552), duration 122, read/write 26/24 MB/s
INFO: status: 4% (6028394496/150323855360), sparse 0% (658427904), duration 171, read/write 30/29 MB/s
INFO: status: 5% (7574847488/150323855360), sparse 0% (816939008), duration 210, read/write 39/35 MB/s
INFO: status: 6% (9019916288/150323855360), sparse 0% (835088384), duration 3481, read/write 0/0 MB/s
INFO: status: 7% (10540482560/150323855360), sparse 0% (844193792), duration 4734, read/write 1/1 MB/s
INFO: status: 8% (12067667968/150323855360), sparse 0% (980254720), duration 10253, read/write 0/0 MB/s
INFO: status: 9% (13531021312/150323855360), sparse 0% (990515200), duration 18648, read/write 0/0 MB/s
INFO: status: 10% (15044706304/150323855360), sparse 0% (1018007552), duration 19170, read/write 2/2 MB/s
INFO: status: 11% (16538664960/150323855360), sparse 0% (1031659520), duration 19469, read/write 4/4 MB/s
INFO: status: 12% (18044354560/150323855360), sparse 0% (1055346688), duration 21679, read/write 0/0 MB/s
INFO: status: 13% (19545063424/150323855360), sparse 0% (1078284288), duration 22382, read/write 2/2 MB/s
INFO: status: 14% (21065629696/150323855360), sparse 0% (1090793472), duration 23589, read/write 1/1 MB/s
INFO: status: 15% (22583640064/150323855360), sparse 0% (1100865536), duration 23860, read/write 5/5 MB/s
INFO: status: 16% (24122556416/150323855360), sparse 0% (1113579520), duration 28958, read/write 0/0 MB/s
INFO: status: 17% (25558908928/150323855360), sparse 0% (1129041920), duration 29286, read/write 4/4 MB/s
INFO: status: 18% (27067154432/150323855360), sparse 0% (1141547008), duration 31973, read/write 0/0 MB/s
INFO: status: 19% (28564258816/150323855360), sparse 0% (1161191424), duration 32290, read/write 4/4 MB/s

As you can see, there are sometimes "gaps" like in Log 1 between 59% and 60% it took more than 6000 seconds. When the backup is in this state (slow), the system is completely idle.
No CPU activity.
iostat -kx 2 shows almost no disk activity.
Network is mostly idle.
When I log in to the VM and do a dd from the disk, it reads with >100 MB/s.
When I do a dd on the RBD device on the host (we use Ceph with KRBD), I get >100MB/s.
When I write to the backup storage (GlusterFS) I get > 100 MB/s, also sustained for a long time.

What else can I test/change? Other VMs (also similar in size) do not show this behaviour. This VM backed up until last week with no problem. I did have this problem before on other VMs, too. It's quite random.

Thanks,

Stefan
 
What does the whole backup log show?
 
What exactly are you missing? The first backup got cancelled by some external circumstance, the other backup is still "running". I think when it eventually finishes, the final lines are just the normal ones that appear after every backup.
 
I think when it eventually finishes, the final lines are just the normal ones that appear after every backup.
These would be interesting.
 
Well after one day and only 10% more progress, I cancelled the backup. I rebooted the VM, took another backup, that finished normally:

Code:
INFO: Starting Backup of VM 111 (qemu)
INFO: status = running
INFO: update VM 111: -lock backup
INFO: include disk 'scsi0' 'rbd_ct:vm-111-disk-1' 140G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backup01/dump/vzdump-qemu-111-2018_06_27-08_43_27.vma.gz'
INFO: started backup task '6c0db273-5e4c-4c73-9f17-fc26dd4447c3'
INFO: status: 0% (113704960/150323855360), sparse 0% (7151616), duration 3, read/write 37/35 MB/s
INFO: status: 1% (1505230848/150323855360), sparse 0% (204472320), duration 37, read/write 40/35 MB/s
INFO: status: 2% (3018194944/150323855360), sparse 0% (528248832), duration 75, read/write 39/31 MB/s
INFO: status: 3% (4548132864/150323855360), sparse 0% (640499712), duration 114, read/write 39/36 MB/s
INFO: status: 4% (6020988928/150323855360), sparse 0% (656150528), duration 150, read/write 40/40 MB/s
INFO: status: 5% (7552237568/150323855360), sparse 0% (814780416), duration 183, read/write 46/41 MB/s
INFO: status: 6% (9069264896/150323855360), sparse 0% (946798592), duration 221, read/write 39/36 MB/s
INFO: status: 7% (10548084736/150323855360), sparse 0% (950984704), duration 259, read/write 38/38 MB/s
INFO: status: 8% (12034244608/150323855360), sparse 0% (1085603840), duration 290, read/write 47/43 MB/s
INFO: status: 9% (13543538688/150323855360), sparse 0% (1260208128), duration 324, read/write 44/39 MB/s
INFO: status: 10% (15086387200/150323855360), sparse 0% (1395085312), duration 365, read/write 37/34 MB/s
INFO: status: 11% (16550789120/150323855360), sparse 0% (1422729216), duration 404, read/write 37/36 MB/s
INFO: status: 12% (18053464064/150323855360), sparse 1% (1568886784), duration 434, read/write 50/45 MB/s
INFO: status: 13% (19562037248/150323855360), sparse 1% (1746784256), duration 468, read/write 44/39 MB/s
INFO: status: 14% (21099380736/150323855360), sparse 1% (1770450944), duration 510, read/write 36/36 MB/s
INFO: status: 15% (22570205184/150323855360), sparse 1% (1916268544), duration 542, read/write 45/41 MB/s
INFO: status: 16% (24071831552/150323855360), sparse 1% (2038984704), duration 586, read/write 34/31 MB/s
INFO: status: 17% (25562382336/150323855360), sparse 1% (2131615744), duration 629, read/write 34/32 MB/s
INFO: status: 18% (27084652544/150323855360), sparse 1% (2412929024), duration 668, read/write 39/31 MB/s
INFO: status: 19% (28577693696/150323855360), sparse 1% (2548760576), duration 708, read/write 37/33 MB/s
INFO: status: 20% (30139154432/150323855360), sparse 1% (2714460160), duration 755, read/write 33/29 MB/s
INFO: status: 21% (31586910208/150323855360), sparse 1% (2729578496), duration 797, read/write 34/34 MB/s
INFO: status: 22% (33128906752/150323855360), sparse 1% (2885095424), duration 828, read/write 49/44 MB/s
INFO: status: 23% (34600976384/150323855360), sparse 2% (3058458624), duration 866, read/write 38/34 MB/s
INFO: status: 24% (36144807936/150323855360), sparse 2% (3063468032), duration 907, read/write 37/37 MB/s
INFO: status: 25% (37589549056/150323855360), sparse 2% (3435851776), duration 938, read/write 46/34 MB/s
INFO: status: 26% (39089143808/150323855360), sparse 2% (3585204224), duration 969, read/write 48/43 MB/s
INFO: status: 27% (40628912128/150323855360), sparse 2% (3588595712), duration 1014, read/write 34/34 MB/s
INFO: status: 28% (42095738880/150323855360), sparse 2% (3736375296), duration 1053, read/write 37/33 MB/s
INFO: status: 29% (43598807040/150323855360), sparse 2% (3870527488), duration 1076, read/write 65/59 MB/s
INFO: status: 30% (45098336256/150323855360), sparse 2% (3882651648), duration 1101, read/write 59/59 MB/s
INFO: status: 31% (46655799296/150323855360), sparse 2% (4018577408), duration 1125, read/write 64/59 MB/s
INFO: status: 32% (48113188864/150323855360), sparse 2% (4155551744), duration 1153, read/write 52/47 MB/s
INFO: status: 33% (49647386624/150323855360), sparse 2% (4205989888), duration 1181, read/write 54/52 MB/s
INFO: status: 34% (51135119360/150323855360), sparse 2% (4248768512), duration 1212, read/write 47/46 MB/s
INFO: status: 35% (52676722688/150323855360), sparse 2% (4386631680), duration 1234, read/write 70/63 MB/s
INFO: status: 36% (54148464640/150323855360), sparse 3% (4546338816), duration 1261, read/write 54/48 MB/s
INFO: status: 37% (55640326144/150323855360), sparse 3% (4628127744), duration 1293, read/write 46/44 MB/s
INFO: status: 38% (57143001088/150323855360), sparse 3% (4853731328), duration 1313, read/write 75/63 MB/s
INFO: status: 39% (58668351488/150323855360), sparse 3% (5004193792), duration 1339, read/write 58/52 MB/s
INFO: status: 40% (60138717184/150323855360), sparse 3% (5025550336), duration 1367, read/write 52/51 MB/s
INFO: status: 41% (61670096896/150323855360), sparse 3% (5223153664), duration 1392, read/write 61/53 MB/s
INFO: status: 42% (63178080256/150323855360), sparse 3% (5413937152), duration 1418, read/write 57/50 MB/s
INFO: status: 43% (64653754368/150323855360), sparse 3% (5557985280), duration 1445, read/write 54/49 MB/s
INFO: status: 44% (66152300544/150323855360), sparse 3% (5609910272), duration 1475, read/write 49/48 MB/s
INFO: status: 45% (67658842112/150323855360), sparse 3% (5902344192), duration 1498, read/write 65/52 MB/s
INFO: status: 46% (69149392896/150323855360), sparse 4% (6025437184), duration 1529, read/write 48/44 MB/s
INFO: status: 47% (70668255232/150323855360), sparse 4% (6027403264), duration 1554, read/write 60/60 MB/s
INFO: status: 48% (72220082176/150323855360), sparse 4% (6150447104), duration 1583, read/write 53/49 MB/s
INFO: status: 49% (73665085440/150323855360), sparse 4% (6284599296), duration 1605, read/write 65/59 MB/s
INFO: status: 50% (75196923904/150323855360), sparse 4% (6319857664), duration 1633, read/write 54/53 MB/s
INFO: status: 51% (76734922752/150323855360), sparse 4% (6606049280), duration 1660, read/write 56/46 MB/s
INFO: status: 52% (78211645440/150323855360), sparse 4% (6764396544), duration 1686, read/write 56/50 MB/s
INFO: status: 53% (79705079808/150323855360), sparse 4% (6903480320), duration 1710, read/write 62/56 MB/s
INFO: status: 54% (81198514176/150323855360), sparse 4% (6908284928), duration 1738, read/write 53/53 MB/s
INFO: status: 55% (82694832128/150323855360), sparse 4% (7030169600), duration 1763, read/write 59/54 MB/s
INFO: status: 56% (84239712256/150323855360), sparse 4% (7176273920), duration 1789, read/write 59/53 MB/s
INFO: status: 57% (85708898304/150323855360), sparse 4% (7329325056), duration 1810, read/write 69/62 MB/s
INFO: status: 58% (87245455360/150323855360), sparse 5% (7516274688), duration 1841, read/write 49/43 MB/s
INFO: status: 59% (88692293632/150323855360), sparse 5% (7647744000), duration 1867, read/write 55/50 MB/s
INFO: status: 60% (90219806720/150323855360), sparse 5% (7664001024), duration 1899, read/write 47/47 MB/s
INFO: status: 61% (91746861056/150323855360), sparse 5% (7790157824), duration 1927, read/write 54/50 MB/s
INFO: status: 62% (93201170432/150323855360), sparse 5% (7907659776), duration 1952, read/write 58/53 MB/s
INFO: status: 63% (94723112960/150323855360), sparse 5% (8053018624), duration 1982, read/write 50/45 MB/s
INFO: status: 64% (96252985344/150323855360), sparse 5% (8062681088), duration 2010, read/write 54/54 MB/s
INFO: status: 65% (97725972480/150323855360), sparse 5% (8181846016), duration 2037, read/write 54/50 MB/s
INFO: status: 66% (99214098432/150323855360), sparse 5% (8292610048), duration 2060, read/write 64/59 MB/s
INFO: status: 67% (100759371776/150323855360), sparse 5% (8313462784), duration 2091, read/write 49/49 MB/s
INFO: status: 68% (102260015104/150323855360), sparse 5% (8675581952), duration 2116, read/write 60/45 MB/s
INFO: status: 69% (103726645248/150323855360), sparse 5% (8830193664), duration 2141, read/write 58/52 MB/s
INFO: status: 70% (105290137600/150323855360), sparse 5% (8955375616), duration 2167, read/write 60/55 MB/s
INFO: status: 71% (106742153216/150323855360), sparse 5% (8988864512), duration 2199, read/write 45/44 MB/s
INFO: status: 72% (108266913792/150323855360), sparse 6% (9127522304), duration 2226, read/write 56/51 MB/s
INFO: status: 73% (109756678144/150323855360), sparse 6% (9263489024), duration 2251, read/write 59/54 MB/s
INFO: status: 74% (111263612928/150323855360), sparse 6% (9270956032), duration 2280, read/write 51/51 MB/s
INFO: status: 75% (112758226944/150323855360), sparse 6% (9401061376), duration 2310, read/write 49/45 MB/s
INFO: status: 76% (114289016832/150323855360), sparse 6% (9564672000), duration 2332, read/write 69/62 MB/s
INFO: status: 77% (115802570752/150323855360), sparse 6% (9564708864), duration 2361, read/write 52/52 MB/s
INFO: status: 78% (117267431424/150323855360), sparse 6% (9698009088), duration 2387, read/write 56/51 MB/s
INFO: status: 79% (118772334592/150323855360), sparse 6% (9831723008), duration 2413, read/write 57/52 MB/s
INFO: status: 80% (120345198592/150323855360), sparse 6% (10156380160), duration 2438, read/write 62/49 MB/s
INFO: status: 81% (121779126272/150323855360), sparse 6% (10174070784), duration 2470, read/write 44/44 MB/s
INFO: status: 82% (123269611520/150323855360), sparse 6% (10310459392), duration 2499, read/write 51/46 MB/s
INFO: status: 83% (124801122304/150323855360), sparse 6% (10438639616), duration 2528, read/write 52/48 MB/s
INFO: status: 84% (126306615296/150323855360), sparse 6% (10454446080), duration 2561, read/write 45/45 MB/s
INFO: status: 85% (127834456064/150323855360), sparse 7% (10580299776), duration 2594, read/write 46/42 MB/s
INFO: status: 86% (129283915776/150323855360), sparse 7% (10746122240), duration 2618, read/write 60/53 MB/s
INFO: status: 87% (130835677184/150323855360), sparse 7% (10789761024), duration 2646, read/write 55/53 MB/s
INFO: status: 88% (132300668928/150323855360), sparse 7% (10924752896), duration 2674, read/write 52/47 MB/s
INFO: status: 89% (133794496512/150323855360), sparse 7% (11193409536), duration 2699, read/write 59/49 MB/s
INFO: status: 90% (135291666432/150323855360), sparse 7% (11306799104), duration 2733, read/write 44/40 MB/s
INFO: status: 91% (136839168000/150323855360), sparse 7% (11344449536), duration 2762, read/write 53/52 MB/s
INFO: status: 92% (138311630848/150323855360), sparse 7% (11469615104), duration 2809, read/write 31/28 MB/s
INFO: status: 93% (139823022080/150323855360), sparse 7% (11570159616), duration 2836, read/write 55/52 MB/s
INFO: status: 94% (141317308416/150323855360), sparse 7% (11639746560), duration 2865, read/write 51/49 MB/s
INFO: status: 95% (142822342656/150323855360), sparse 7% (11882467328), duration 2894, read/write 51/43 MB/s
INFO: status: 96% (144341925888/150323855360), sparse 7% (12009033728), duration 2928, read/write 44/40 MB/s
INFO: status: 97% (145823891456/150323855360), sparse 8% (12031270912), duration 2968, read/write 37/36 MB/s
INFO: status: 98% (147322699776/150323855360), sparse 8% (12183891968), duration 2999, read/write 48/43 MB/s
INFO: status: 99% (148850802688/150323855360), sparse 8% (12228165632), duration 3030, read/write 49/47 MB/s
INFO: status: 100% (150323855360/150323855360), sparse 8% (12297965568), duration 3061, read/write 47/45 MB/s
INFO: transferred 150323 MB in 3061 seconds (49 MB/s)
INFO: archive file size: 77.89GB
INFO: delete old backup '/mnt/pve/backup01/dump/vzdump-qemu-111-2018_06_20-21_44_45.vma.gz'
INFO: Finished Backup of VM 111 (00:51:29)
INFO: Backup job finished successfully
TASK OK

Very unsatisfactory, who knows when it'll happen again.

Stefan
 
Sadly this is not conclusive, but it could be that on the node there wasn't enough memory available and with the reboot of the VM it might have been freed. Or the VM was too busy writing to disk, that the couldn't pick up the dirty bits fast enough. Either way just guessing.
 
Unfortunately no, node has 128GB RAM, only 20GB or so were used. Disk activity was close to none.
 
Started on another host, next day, same problem.

I did some stuff with the QEMU monitor, and even after cancelling the backup, it shows:
Code:
info block-jobs
Type backup, device drive-scsi0: Completed 37059821568 of 150323855360 bytes, speed limit 10485760000 bytes/s
(I had set the speed limit while I was experimenting with it, it was 0 before.)
I can issue block_job_cancel with no error messages but it doesn't have any effect. I guess there's a bug in the qemu backup code that will somehow slow down the block jobs and make it uncancellable.

Stefan
 
To identify which storage the target or source are to look at, a backup to local storage would be good. A small VM could give already some insight. Also you might want to try to change the disk storage, from krbd (rbd_ct) to librbd (rbd_vm). This way Qemu is talking through librbd with the cluster.
 
I did a backup of a small VM to local storage. As mentioned above, it finished in normal speed. I also have another VM of about the same size as the one that's having troubles, and it backs up to the same identical storage without problems (yet). You can see from the logs that also this troublesome VM starts out ok, but then there seems to be an incident that practically stops progress.

I did a fstrim inside the VM and the progress of the block job increased quite a bit. So my suspicion is that writes to the device are handled ok and written to the VMA file, but no new reads are issued.

Even if librbd would work, that's not a solution. krbd is already slower than we would want (mind you, this is an all SSD setup with 10GB Ethernet), librbd would degrade performance even more.

Stefan
 
Sadly this is not conclusive, but it could be that on the node there wasn't enough memory available and with the reboot of the VM it might have been freed. Or the VM was too busy writing to disk, that the couldn't pick up the dirty bits fast enough. Either way just guessing.
Well, the other 20 or so VMs on the same source storage (Ceph) are backing up just fine to the same target storage (GlusterFS), so I highly doubt that is the source of the problem. Indeed, 5 hosts can simultaneously (thanks to the silly cluster backup mechanism) back up VMs and they all work ok, except this one.

I found one difference though: most VM disks are set to writeback cache, whereas this one has directsync. I changed it to the default (no cache) and will see if that makes a difference. If it works then, would that warrant a proper bug report?

Thanks,

Stefan
 
I found one difference though: most VM disks are set to writeback cache, whereas this one has directsync. I changed it to the default (no cache) and will see if that makes a difference. If it works then, would that warrant a proper bug report?
This is by design, the qemu needs to read each block separately, as each block needs to go as per 'directsync' to the storage immediately. With caching activated the qemu process can read a couple of blocks at once, as these are going into the cache first, till they land on the storage.
 
This is by design, the qemu needs to read each block separately, as each block needs to go as per 'directsync' to the storage immediately. With caching activated the qemu process can read a couple of blocks at once, as these are going into the cache first, till they land on the storage.
Are you saying that backups don't work with "directsync"? Because usually, the first backup after a fresh VM start has worked, only after a couple of days (or backups?) the backups slow down eventually. (I let one run over the weekend, it basically only made progress for VM disk writes).

Stefan
 
Are you saying that backups don't work with "directsync"?
They are slower then with writeback. Depending on the amount of data that changes during the backup, it can take a very long time to get all of it. Ceph objects are 4 MB by default, a write of 4K (assumption) from the VM will be written to the Ceph OSDs instead of 4 MB, with directsync.

Because usually, the first backup after a fresh VM start has worked, only after a couple of days (or backups?) the backups slow down eventually. (I let one run over the weekend, it basically only made progress for VM disk writes).
After VM start, the caches and RAM are empty, writes to disk (inside the vM) are less often happening, then after the caches and RAM are filled during runtime.
 
So by "slow", you mean 407 kByte/s slow? Because that's the average speed of this (cancelled) backup job:

Code:
INFO: Starting Backup of VM 111 (qemu)
INFO: status = running
INFO: update VM 111: -lock backup
INFO: include disk 'scsi0' 'rbd_ct:vm-111-disk-1' 140G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backup01/dump/vzdump-qemu-111-2018_06_29-21_30_02.vma.gz'
INFO: started backup task 'e5941003-c0ca-4b0a-9394-9ff91e71ccd0'
INFO: status: 0% (231866368/150323855360), sparse 0% (116277248), duration 3, read/write 77/38 MB/s
INFO: status: 1% (1524170752/150323855360), sparse 0% (768454656), duration 25, read/write 58/29 MB/s
INFO: status: 2% (3009413120/150323855360), sparse 0% (1297092608), duration 51, read/write 57/36 MB/s
INFO: status: 3% (4513005568/150323855360), sparse 0% (1438490624), duration 100, read/write 30/27 MB/s
INFO: status: 4% (6023086080/150323855360), sparse 0% (1464827904), duration 142, read/write 35/35 MB/s
INFO: status: 5% (7529955328/150323855360), sparse 1% (1627811840), duration 188, read/write 32/29 MB/s
INFO: status: 6% (9059696640/150323855360), sparse 1% (1761214464), duration 233, read/write 33/31 MB/s
INFO: status: 7% (10524622848/150323855360), sparse 1% (1765400576), duration 288, read/write 26/26 MB/s
INFO: status: 8% (12049448960/150323855360), sparse 1% (1899954176), duration 338, read/write 30/27 MB/s
INFO: status: 9% (13573685248/150323855360), sparse 1% (2074218496), duration 375, read/write 41/36 MB/s
INFO: status: 10% (15057944576/150323855360), sparse 2% (3392716800), duration 3634, read/write 0/0 MB/s
INFO: status: 11% (16563765248/150323855360), sparse 3% (4811284480), duration 3916, read/write 5/0 MB/s
INFO: status: 12% (18087346176/150323855360), sparse 4% (6203879424), duration 4126, read/write 7/0 MB/s
INFO: status: 13% (19580518400/150323855360), sparse 5% (7548432384), duration 4342, read/write 6/0 MB/s
INFO: status: 14% (21047410688/150323855360), sparse 5% (8392679424), duration 18783, read/write 0/0 MB/s
INFO: status: 15% (22548578304/150323855360), sparse 6% (9565417472), duration 19327, read/write 2/0 MB/s
INFO: status: 16% (24069537792/150323855360), sparse 6% (10155896832), duration 19701, read/write 4/2 MB/s
INFO: status: 17% (25651445760/150323855360), sparse 7% (10918473728), duration 20234, read/write 2/1 MB/s
INFO: status: 18% (27058896896/150323855360), sparse 8% (12237852672), duration 20608, read/write 3/0 MB/s
INFO: status: 19% (28609478656/150323855360), sparse 9% (13769359360), duration 21188, read/write 2/0 MB/s
INFO: status: 20% (30070603776/150323855360), sparse 9% (14962089984), duration 25922, read/write 0/0 MB/s
INFO: status: 21% (31603884032/150323855360), sparse 10% (16380370944), duration 68182, read/write 0/0 MB/s
INFO: status: 22% (33179041792/150323855360), sparse 11% (16994811904), duration 68199, read/write 92/56 MB/s
INFO: status: 23% (34617819136/150323855360), sparse 11% (17040240640), duration 68230, read/write 46/44 MB/s
INFO: status: 24% (36145463296/150323855360), sparse 11% (17612312576), duration 68250, read/write 76/47 MB/s
INFO: status: 25% (37592760320/150323855360), sparse 11% (17667997696), duration 68277, read/write 53/51 MB/s
INFO: status: 26% (39112933376/150323855360), sparse 12% (18762498048), duration 105993, read/write 0/0 MB/s
INFO: status: 27% (40594636800/150323855360), sparse 13% (20244201472), duration 106728, read/write 2/0 MB/s
INFO: status: 28% (42121494528/150323855360), sparse 14% (21762658304), duration 107300, read/write 2/0 MB/s
INFO: status: 29% (43610537984/150323855360), sparse 15% (23244042240), duration 107590, read/write 5/0 MB/s
INFO: status: 30% (45103579136/150323855360), sparse 16% (24681234432), duration 110177, read/write 0/0 MB/s
INFO: status: 31% (46619099136/150323855360), sparse 17% (26196754432), duration 111834, read/write 0/0 MB/s

If that is the case, then basically you should mark "directsync" as not compatibly with backup. Which is unfortunate, because according to our testing, directsync is the fastest performing mode for high random disk IO applications like large, busy databases.

Stefan
 
I have changed the disk cache to "writethrough" and for the last days the daily backup has always run successfully. So either it's a bug in "directsync" or "directsync" should be marked as incompatible with backup.
 
I already know a lot about caching, thank you. However, can you explain in detail why directsync does not work with the VMA backup? And I mean in technical terms.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!