VZDump very slow in 10 Gbit Network

Frank bartels

Member
Mar 6, 2018
17
0
21
53
Hi all,

We have a two-node Proxmox HA cluster with an ISCSI storage (LVM) in our company where our VMs are located.
The ISCSI Network is a pure 10 Gbps network. The VM and Nodes are also connected via 10 Gbit network cards. We use a Ggabit network for managment of the nodes. A QNAB NAS is connected via NFS as backup storage.
When we make a backup with vzdump we never get over 100mb / s. This is very slow for us.
Are there ways to become more efficient here?
Proxmox 5.1 HA Cluster HP VSA SAN as ISCSI Target. QNAB NAS with NFS.

Thanks a lot.

VZDump Log:

INFO: starting new backup job: vzdump 200 --compress gzip --node tom --mode snapshot --remove 0 --storage HAPVE01-Backup-122
INFO: Starting Backup of VM 200 (qemu)
INFO: status = running
INFO: update VM 200: -lock backup
INFO: VM Name: StarNet-Config
INFO: include disk 'virtio0' 'HAPVE01-LVM:vm-200-disk-1' 40G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/HAPVE01-Backup-122/dump/vzdump-qemu-200-2018_03_06-13_32_32.vma.gz'
INFO: started backup task '6dd0bec5-12e3-41b1-a1f6-c5f64e5b0e76'
INFO: status: 0% (259260416/42949672960), sparse 0% (84836352), duration 3, read/write 86/58 MB/s
INFO: status: 1% (440926208/42949672960), sparse 0% (87392256), duration 6, read/write 60/59 MB/s
INFO: status: 2% (925040640/42949672960), sparse 0% (89833472), duration 12, read/write 80/80 MB/s
INFO: status: 3% (1333919744/42949672960), sparse 0% (101175296), duration 17, read/write 81/79 MB/s
INFO: status: 4% (1731461120/42949672960), sparse 0% (102916096), duration 22, read/write 79/79 MB/s
INFO: status: 5% (2223046656/42949672960), sparse 0% (131661824), duration 28, read/write 81/77 MB/s
INFO: status: 6% (2647392256/42949672960), sparse 0% (159465472), duration 39, read/write 38/36 MB/s
INFO: status: 7% (3041001472/42949672960), sparse 0% (174211072), duration 44, read/write 78/75 MB/s
INFO: status: 8% (3481337856/42949672960), sparse 0% (194121728), duration 50, read/write 73/70 MB/s
INFO: status: 9% (3930456064/42949672960), sparse 0% (209485824), duration 56, read/write 74/72 MB/s
INFO: status: 10% (4351524864/42949672960), sparse 0% (233906176), duration 61, read/write 84/79 MB/s
INFO: status: 11% (4796973056/42949672960), sparse 0% (309735424), duration 73, read/write 37/30 MB/s
INFO: status: 12% (5165744128/42949672960), sparse 0% (310501376), duration 78, read/write 73/73 MB/s
INFO: status: 13% (5608833024/42949672960), sparse 0% (341909504), duration 84, read/write 73/68 MB/s
INFO: status: 14% (6081085440/42949672960), sparse 1% (460169216), duration 90, read/write 78/58 MB/s
INFO: status: 15% (6464667648/42949672960), sparse 1% (549703680), duration 102, read/write 31/24 MB/s
INFO: status: 16% (6909329408/42949672960), sparse 1% (693112832), duration 108, read/write 74/50 MB/s
INFO: status: 17% (7322992640/42949672960), sparse 1% (833314816), duration 114, read/write 68/45 MB/s
INFO: status: 18% (7765098496/42949672960), sparse 2% (1275420672), duration 119, read/write 88/0 MB/s
INFO: status: 19% (8188133376/42949672960), sparse 3% (1698455552), duration 124, read/write 84/0 MB/s
INFO: status: 20% (8642297856/42949672960), sparse 4% (2140332032), duration 129, read/write 90/2 MB/s
INFO: status: 21% (9022472192/42949672960), sparse 5% (2460610560), duration 133, read/write 95/14 MB/s
INFO: status: 22% (9484304384/42949672960), sparse 6% (2922442752), duration 138, read/write 92/0 MB/s
INFO: status: 23% (9961668608/42949672960), sparse 7% (3399806976), duration 143, read/write 95/0 MB/s
INFO: status: 24% (10369826816/42949672960), sparse 8% (3807965184), duration 147, read/write 102/0 MB/s
INFO: status: 25% (10757013504/42949672960), sparse 9% (4192145408), duration 151, read/write 96/0 MB/s
INFO: status: 26% (11192958976/42949672960), sparse 10% (4552478720), duration 156, read/write 87/15 MB/s
INFO: status: 27% (11668815872/42949672960), sparse 11% (5028335616), duration 161, read/write 95/0 MB/s
INFO: status: 28% (12072845312/42949672960), sparse 12% (5432365056), duration 165, read/write 101/0 MB/s
INFO: status: 29% (12535660544/42949672960), sparse 13% (5895180288), duration 170, read/write 92/0 MB/s
INFO: status: 30% (12905742336/42949672960), sparse 14% (6263156736), duration 174, read/write 92/0 MB/s
INFO: status: 31% (13404602368/42949672960), sparse 15% (6743293952), duration 179, read/write 99/3 MB/s
INFO: status: 32% (13827768320/42949672960), sparse 16% (7166459904), duration 183, read/write 105/0 MB/s
INFO: status: 33% (14200406016/42949672960), sparse 17% (7539097600), duration 187, read/write 93/0 MB/s
INFO: status: 34% (14647820288/42949672960), sparse 18% (7986511872), duration 192, read/write 89/0 MB/s
INFO: status: 35% (15107686400/42949672960), sparse 19% (8433217536), duration 197, read/write 91/2 MB/s
INFO: status: 36% (15479472128/42949672960), sparse 20% (8602988544), duration 203, read/write 61/33 MB/s
INFO: status: 37% (15947202560/42949672960), sparse 21% (9070718976), duration 208, read/write 93/0 MB/s
INFO: status: 38% (16400121856/42949672960), sparse 22% (9523638272), duration 213, read/write 90/0 MB/s
INFO: status: 39% (16761094144/42949672960), sparse 23% (9884610560), duration 217, read/write 90/0 MB/s
INFO: status: 40% (17227448320/42949672960), sparse 24% (10338422784), duration 222, read/write 93/2 MB/s
INFO: status: 41% (17614045184/42949672960), sparse 24% (10677186560), duration 226, read/write 96/11 MB/s
INFO: status: 42% (18077646848/42949672960), sparse 25% (11140788224), duration 231, read/write 92/0 MB/s
INFO: status: 43% (18484625408/42949672960), sparse 26% (11547766784), duration 235, read/write 101/0 MB/s
INFO: status: 44% (18904776704/42949672960), sparse 27% (11967918080), duration 239, read/write 105/0 MB/s
INFO: status: 45% (19335413760/42949672960), sparse 28% (12396494848), duration 243, read/write 107/0 MB/s
INFO: status: 46% (19834077184/42949672960), sparse 29% (12831596544), duration 248, read/write 99/12 MB/s
INFO: status: 47% (20201144320/42949672960), sparse 30% (13198663680), duration 252, read/write 91/0 MB/s
INFO: status: 48% (20718551040/42949672960), sparse 31% (13716070400), duration 257, read/write 103/0 MB/s
INFO: status: 49% (21123694592/42949672960), sparse 32% (14121213952), duration 261, read/write 101/0 MB/s
INFO: status: 50% (21509636096/42949672960), sparse 33% (14506573824), duration 265, read/write 96/0 MB/s
INFO: status: 51% (21929984000/42949672960), sparse 34% (14867251200), duration 269, read/write 105/14 MB/s
INFO: status: 52% (22398369792/42949672960), sparse 35% (15335636992), duration 274, read/write 93/0 MB/s
INFO: status: 53% (22782148608/42949672960), sparse 36% (15719415808), duration 278, read/write 95/0 MB/s
INFO: status: 54% (23251910656/42949672960), sparse 37% (16189177856), duration 283, read/write 93/0 MB/s
INFO: status: 55% (23624024064/42949672960), sparse 38% (16561287168), duration 287, read/write 93/0 MB/s
INFO: status: 56% (24119279616/42949672960), sparse 39% (17056542720), duration 292, read/write 99/0 MB/s
INFO: status: 57% (24504303616/42949672960), sparse 40% (17441566720), duration 296, read/write 96/0 MB/s
INFO: status: 58% (24964562944/42949672960), sparse 41% (17901826048), duration 301, read/write 92/0 MB/s
INFO: status: 59% (25425805312/42949672960), sparse 42% (18363068416), duration 306, read/write 92/0 MB/s
INFO: status: 60% (25794641920/42949672960), sparse 43% (18731900928), duration 310, read/write 92/0 MB/s
INFO: status: 61% (26253983744/42949672960), sparse 44% (19191242752), duration 315, read/write 91/0 MB/s
INFO: status: 62% (26685407232/42949672960), sparse 45% (19622666240), duration 319, read/write 107/0 MB/s
INFO: status: 63% (27061452800/42949672960), sparse 46% (19998711808), duration 323, read/write 94/0 MB/s
INFO: status: 64% (27567456256/42949672960), sparse 47% (20504715264), duration 328, read/write 101/0 MB/s
INFO: status: 65% (27972599808/42949672960), sparse 48% (20909854720), duration 332, read/write 101/0 MB/s
INFO: status: 66% (28347531264/42949672960), sparse 49% (21284786176), duration 336, read/write 93/0 MB/s
INFO: status: 67% (28849733632/42949672960), sparse 50% (21786988544), duration 341, read/write 100/0 MB/s
INFO: status: 68% (29239934976/42949672960), sparse 51% (22177189888), duration 345, read/write 97/0 MB/s
INFO: status: 69% (29646651392/42949672960), sparse 52% (22583906304), duration 349, read/write 101/0 MB/s
INFO: status: 70% (30120148992/42949672960), sparse 53% (23057399808), duration 354, read/write 94/0 MB/s
INFO: status: 71% (30593187840/42949672960), sparse 54% (23530438656), duration 359, read/write 94/0 MB/s
INFO: status: 72% (31016615936/42949672960), sparse 55% (23953866752), duration 363, read/write 105/0 MB/s
INFO: status: 73% (31421431808/42949672960), sparse 56% (24358682624), duration 367, read/write 101/0 MB/s
INFO: status: 74% (31783976960/42949672960), sparse 57% (24721227776), duration 371, read/write 90/0 MB/s
INFO: status: 75% (32321568768/42949672960), sparse 58% (25258811392), duration 376, read/write 107/0 MB/s
INFO: status: 76% (32696958976/42949672960), sparse 59% (25634185216), duration 380, read/write 93/0 MB/s
INFO: status: 77% (33075625984/42949672960), sparse 60% (26012852224), duration 384, read/write 94/0 MB/s
INFO: status: 78% (33597161472/42949672960), sparse 61% (26534387712), duration 389, read/write 104/0 MB/s
INFO: status: 79% (33971044352/42949672960), sparse 62% (26908270592), duration 393, read/write 93/0 MB/s
INFO: status: 80% (34371338240/42949672960), sparse 63% (27308560384), duration 397, read/write 100/0 MB/s
INFO: status: 81% (34864234496/42949672960), sparse 64% (27801456640), duration 402, read/write 98/0 MB/s
INFO: status: 82% (35233005568/42949672960), sparse 65% (28170227712), duration 406, read/write 92/0 MB/s
INFO: status: 83% (35698704384/42949672960), sparse 66% (28635926528), duration 411, read/write 93/0 MB/s
INFO: status: 84% (36118331392/42949672960), sparse 67% (29055553536), duration 415, read/write 104/0 MB/s
INFO: status: 85% (36514365440/42949672960), sparse 68% (29451579392), duration 419, read/write 99/0 MB/s
INFO: status: 86% (36983603200/42949672960), sparse 69% (29920817152), duration 424, read/write 93/0 MB/s
INFO: status: 87% (37402640384/42949672960), sparse 70% (30339854336), duration 429, read/write 83/0 MB/s
INFO: status: 88% (37842386944/42949672960), sparse 71% (30779600896), duration 434, read/write 87/0 MB/s
INFO: status: 89% (38265290752/42949672960), sparse 72% (31202504704), duration 439, read/write 84/0 MB/s
INFO: status: 90% (38737149952/42949672960), sparse 73% (31674359808), duration 444, read/write 94/0 MB/s
INFO: status: 91% (39126368256/42949672960), sparse 74% (32063578112), duration 448, read/write 97/0 MB/s
INFO: status: 92% (39608647680/42949672960), sparse 75% (32545857536), duration 453, read/write 96/0 MB/s
INFO: status: 93% (40037777408/42949672960), sparse 76% (32974987264), duration 458, read/write 85/0 MB/s
INFO: status: 94% (40403533824/42949672960), sparse 77% (33340743680), duration 462, read/write 91/0 MB/s
INFO: status: 95% (40862482432/42949672960), sparse 78% (33799692288), duration 468, read/write 76/0 MB/s
INFO: status: 96% (41242722304/42949672960), sparse 79% (34179932160), duration 472, read/write 95/0 MB/s
INFO: status: 97% (41675849728/42949672960), sparse 80% (34613059584), duration 477, read/write 86/0 MB/s
INFO: status: 98% (42149347328/42949672960), sparse 81% (35086557184), duration 482, read/write 94/0 MB/s
INFO: status: 99% (42595778560/42949672960), sparse 82% (35532988416), duration 487, read/write 89/0 MB/s
INFO: status: 100% (42949672960/42949672960), sparse 83% (35886882816), duration 491, read/write 88/0 MB/s
INFO: transferred 42949 MB in 491 seconds (87 MB/s)
INFO: archive file size: 2.95GB
INFO: Finished Backup of VM 200 (00:08:14)
INFO: Backup job finished successfully
TASK OK
 
The QNAB is connected by Gigabit. That's true, but it only has to write about 3 GB. We're more likely to suspect a reading speed problem on the ISCSI storage.
We have already run a backup with local storage as a counter sample. Again, the reading speed of the ISCSI target was not greater than 100mb / s.
We tested the read speed with dd and reached a block size of 4 MB approx. 300 - 500 MB / s. Also we could then packed (pigz) within 110 sec. Write on the NFS.backup.jpg
 
according to your logs the Backup to the QNAP runs with 87MB/s (read MByte/s)

INFO: transferred 42949 MB in 491 seconds (87 MB/s)

this is merely maxing out the Gigabit connection. Please don't intermix Mb/s and MB/s, 87 MByte/s is nearly 700 Mbit/s
 
Our log shows (iftop -B) that the bandwidth to qnap is about 30-50MB / s. The backup is also only 3GB and only the 3GB (TX total) are transmitted.
The read rate to the iscsi target is almost constant at 100MB / s. We suspect here thus a bottleneck in the reading rate which seems to be limited by software.

one of our other proxmox host has the vm's hard drives on its local raid5 LVM memory. saves the same vm every night on the same qnap and achieves the following data:
Code:
[/INDENT]
200: Mar 02 02:45:46 INFO: Starting Backup of VM 200 (qemu)
200: Mar 02 02:45:46 INFO: status = stopped
200: Mar 02 02:45:46 INFO: update VM 200: -lock backup
200: Mar 02 02:45:46 INFO: backup mode: stop
200: Mar 02 02:45:46 INFO: ionice priority: 7
200: Mar 02 02:45:46 INFO: VM Name: StarNet-Config
200: Mar 02 02:45:46 INFO: include disk 'virtio0' 'local:200/vm-200-disk-1.qcow2' 40G
200: Mar 02 02:45:46 INFO: creating archive '/mnt/pve/KVMSHARE/dump/vzdump-qemu-200-2018_03_02-02_45_46.vma.lzo'
200: Mar 02 02:45:46 INFO: starting kvm to execute backup task
200: Mar 02 02:45:48 INFO: started backup task '88c94944-cbf9-43b9-9755-3e36d28b2cce'
200: Mar 02 02:45:51 INFO: status: 0% (331218944/42949672960), sparse 0% (85323776), duration 3, 110/81 MB/s
200: Mar 02 02:45:54 INFO: status: 1% (573964288/42949672960), sparse 0% (87588864), duration 6, 80/80 MB/s
200: Mar 02 02:45:57 INFO: status: 2% (950272000/42949672960), sparse 0% (92192768), duration 9, 125/123 MB/s
200: Mar 02 02:46:01 INFO: status: 3% (1406402560/42949672960), sparse 0% (101994496), duration 13, 114/111 MB/s
200: Mar 02 02:46:04 INFO: status: 4% (1787887616/42949672960), sparse 0% (103743488), duration 16, 127/126 MB/s
200: Mar 02 02:46:12 INFO: status: 5% (2151153664/42949672960), sparse 0% (105242624), duration 24, 45/45 MB/s
200: Mar 02 02:46:21 INFO: status: 6% (2617966592/42949672960), sparse 0% (158285824), duration 33, 51/45 MB/s
200: Mar 02 02:46:37 INFO: status: 7% (3132096512/42949672960), sparse 0% (177922048), duration 49, 32/30 MB/s
200: Mar 02 02:46:43 INFO: status: 8% (3527409664/42949672960), sparse 0% (195719168), duration 55, 65/62 MB/s
200: Mar 02 02:46:47 INFO: status: 9% (3874684928/42949672960), sparse 0% (209408000), duration 59, 86/83 MB/s
200: Mar 02 02:46:53 INFO: status: 10% (4450680832/42949672960), sparse 0% (303284224), duration 65, 95/80 MB/s
200: Mar 02 02:46:56 INFO: status: 11% (4770365440/42949672960), sparse 0% (309649408), duration 68, 106/104 MB/s
200: Mar 02 02:46:59 INFO: status: 12% (5186781184/42949672960), sparse 0% (317911040), duration 71, 138/136 MB/s
200: Mar 02 02:47:03 INFO: status: 13% (5621809152/42949672960), sparse 0% (347447296), duration 75, 108/101 MB/s
200: Mar 02 02:47:07 INFO: status: 14% (6089736192/42949672960), sparse 1% (465743872), duration 79, 116/87 MB/s
200: Mar 02 02:47:27 INFO: status: 15% (6447235072/42949672960), sparse 1% (538570752), duration 99, 17/14 MB/s
200: Mar 02 02:47:38 INFO: status: 16% (6875709440/42949672960), sparse 1% (693960704), duration 110, 38/24 MB/s
200: Mar 02 02:47:46 INFO: status: 20% (8730443776/42949672960), sparse 5% (2228670464), duration 118, 231/40 MB/s
200: Mar 02 02:47:49 INFO: status: 25% (10907156480/42949672960), sparse 10% (4305170432), duration 121, 725/33 MB/s
200: Mar 02 02:47:52 INFO: status: 35% (15172960256/42949672960), sparse 19% (8498618368), duration 124, 1421/24 MB/s
200: Mar 02 02:47:57 INFO: status: 40% (17182425088/42949672960), sparse 24% (10309967872), duration 129, 401/39 MB/s
200: Mar 02 02:48:00 INFO: status: 45% (19366543360/42949672960), sparse 28% (12428603392), duration 132, 728/21 MB/s
200: Mar 02 02:48:05 INFO: status: 50% (21616132096/42949672960), sparse 34% (14611398656), duration 137, 449/13 MB/s
200: Mar 02 02:48:08 INFO: status: 86% (37254594560/42949672960), sparse 70% (30197260288), duration 140, 5212/17 MB/s
200: Mar 02 02:48:10 INFO: status: 100% (42949672960/42949672960), sparse 83% (35892334592), duration 142, 2847/0 MB/s
200: Mar 02 02:48:10 INFO: transferred 42949 MB in 142 seconds (302 MB/s)
200: Mar 02 02:48:10 INFO: stopping kvm after backup task
200: Mar 02 02:48:11 INFO: archive file size: 3.62GB
200: Mar 02 02:48:11 INFO: delete old backup '/mnt/pve/KVMSHARE/dump/vzdump-qemu-200-2018_03_01-22_00_03.vma.lzo'
200: Mar 02 02:48:12 INFO: Finished Backup of VM 200 (00:02:26)

we suspect the blocksize (8192byte) from cstream. cstream is used by vzdump to fetch the data from the block devices.​
 
what is the content of /etc/vzdump.conf ? There is an option to limit bandwith there. The very constant rate looks like vzdump is doing bw limiting.

But be aware of one thing:

if you don't limit bandwith, vzdump can fill up network buffers and make the host nearly unresponsive during backup. This behavior shoot out nodes on our cluster with IPMI_WATCHDOG.
 
...
The read rate to the iscsi target is almost constant at 100MB / s. We suspect here thus a bottleneck in the reading rate which seems to be limited by software.
Hi,
sure?! In my opinion fast iScsi-Raids are rare.

Try to "measure" the iscsi-raid-speed, like
Code:
dd if=/dev/iscsi-vg/vm-disk-xxx-1 of=/dev/null bs=64k count=163840
Imho the result must divide by two, because vzdump scan first for zeros (spares file) and after that read the content?!

Blocksize 64k, because spirit wrote, that vzdump use this blocksize. https://forum.proxmox.com/threads/vzdump-slow-on-ceph-images-rbd-export-fast.29777/#post-149727
Perhaps this has changed now?!

Udo
 
Hi, we're currently analyzing a similar issue with a specific 12-Bay Qnap (I guess you meant them?) and noticed a couple of things:
  • Write performance on thoese with Software RAID only has some limited performance
  • RAM should be extended to at least 8GB or 16GB if possible to enhance buffering and kernel responsiveness
  • Performance decreases with larger drives >6TB HDDs, especially on RAID6
  • If you use bonding on the 1Gbit interfaces, make sure the involved switches can handle 802.3ad properly
They're trying to improve some of these things via firmware, but at some point you can't rule out hardware limits via software...
 
hi udo, hi marsian,

i did that test with 64k
Code:
dd if=/dev/VG1/vm-291-disk-1 of=/dev/null bs=64k count=163840
163840+0 Datensätze ein
163840+0 Datensätze aus
10737418240 Bytes (11 GB, 10 GiB) kopiert, 35,4739 s, 303 MB/s

300/2 ==> 150MB/s (+50%)
we did many other tests, too.
the issue with poor bandwidth from ceph is already solved in 5.0 and is been patched for 4.4 too.

if we format the ISCSI Target with ext4 on top of the underlying gpt the read speed of that drive is ten times faster.

dd with block-size of 8MB is right at 10GBe speed
Code:
dd if=/dev/VG1/vm-291-disk-1 of=/dev/null bs=8M  
1771+0 Datensätze ein
1770+0 Datensätze aus
14847836160 Bytes (15 GB, 14 GiB) kopiert, 16,2972 s, 911 MB/s

we already played with jumbo frames, ctream block-size (8Kb,16Kb,32Kb,64Kb) and get slightly better results.
first we get about 85MB/s now about 100 to 120MB/s ... BUT that is far away from that what is needed to finish all backups until the morning comes

it takes about 4 hours to backup all VM's from our single-node(4.4) and it is about 3 times faster than our 'new' two-node cluster (5.0)
12 hours of backup is -NO- option.


@marsian

our single-node(4.4) backups all its vm's to a TVS-471U (4x 6TB Raid 1 0) with an 4x1 bond or (testcase) 2x2 bond with 802.3ad active

the final backup of our benchmarkVM is about 3G and is transfered in about 32 seconds

Code:
dd if=/dev/VG1/vm-291-disk-1 of=/mnt/pve/HAPVE01-Backup-122/test.dump bs=8M count=375
375+0 Datensätze ein
375+0 Datensätze aus
3145728000 Bytes (3,1 GB, 2,9 GiB) kopiert, 31,8774 s, 98,7 MB/s
so write-speed is not our problem

as long as the storage of the VM is an image-file the backup is fine... but if it is a block-device (on ISCSI target) to fulfill all HA preconditions the vzdump backup-routine is not usable in production
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!