Proxmox, vzdump, nfs and ceph slow backup

basanisi

Active Member
Apr 15, 2011
40
2
28
Hello every body,

I have a proxmox cluster composed with 3 servers equipped with 24 cores xexon, 4 Gbits nics, 64 Gb ram on each.

Each server is connected via a cisco switch sg500, and all switches is connected via 10 Gbits optical.

We have 3 batiments with 1 proxmox server on each.

The pve version is :

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2

All server is configured with 2 bond lacp, 1 for vm and 1 for ceph.

Backups are made on 2 synology rackstation on a drbd cluster, via nfs.

All vm are stored on the ceph, proxmox cluster

But vm backup, via vzdump over nfs are really slow, approximately 33 Mo/s for a 10 GO VM disk.

For testing lacp we copy a 10 Go files to nfs, via rsync gave an average speed of 150 Mo/s, vs 33 Mo/s via vzdump. We also migrate vm from a proxmox server to another at 180 Mo/s.

Could you help me, please ?
 

fireon

Famous Member
Oct 25, 2010
3,923
333
103
40
Austria/Graz
iteas.at
Only VM's or also Container? NFS4 or NFS3. What is the Backupmode? What compression? Logs from one Backup?
 

basanisi

Active Member
Apr 15, 2011
40
2
28
Hello,

Thank's for answer my question.

We only use vm no container.

The backup mod selected is snapshot, the compression is lzo. We test nfs3, 4.0 and 4.1 without any noticeables improvements.

All network bond use MTU 9000, on both proxmox nodes, rackstations, and switches

This is a backup log of one VM

INFO: Starting Backup of VM 100 (qemu)
INFO: status = running
INFO: update VM 100: -lock backup
INFO: VM Name: STESUD05
INFO: include disk 'virtio0' 'vm_storage:vm-100-disk-0' 103G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/rackstation_cluster_weekly/dump/vzdump-qemu-100-2019_01_26-13_00_02.vma.lzo'
Qemu Guest Agent is not running - VM 100 qmp command 'guest-ping' failed - got timeout
INFO: started backup task 'fff982e0-0d23-4391-9da2-3245dbbf6a01'
INFO: status: 0% (113246208/110595407872), sparse 0% (5267456), duration 3, read/write 37/35 MB/s
INFO: status: 1% (1119879168/110595407872), sparse 0% (236093440), duration 23, read/write 50/38 MB/s
INFO: status: 2% (2243952640/110595407872), sparse 0% (364531712), duration 64, read/write 27/24 MB/s
INFO: status: 3% (3351248896/110595407872), sparse 0% (380334080), duration 95, read/write 35/35 MB/s
INFO: status: 4% (4437573632/110595407872), sparse 0% (395931648), duration 129, read/write 31/31 MB/s
INFO: status: 5% (5540675584/110595407872), sparse 0% (507707392), duration 156, read/write 40/36 MB/s
INFO: status: 6% (6677331968/110595407872), sparse 0% (682541056), duration 186, read/write 37/32 MB/s
INFO: status: 7% (7767851008/110595407872), sparse 0% (806113280), duration 209, read/write 47/42 MB/s
INFO: status: 8% (8883535872/110595407872), sparse 0% (1061654528), duration 236, read/write 41/31 MB/s
INFO: status: 9% (9974054912/110595407872), sparse 1% (1209540608), duration 259, read/write 47/40 MB/s
INFO: status: 10% (11068768256/110595407872), sparse 1% (1468506112), duration 284, read/write 43/33 MB/s
INFO: status: 11% (12213813248/110595407872), sparse 1% (1628561408), duration 312, read/write 40/35 MB/s
INFO: status: 12% (13312720896/110595407872), sparse 1% (1808351232), duration 343, read/write 35/29 MB/s
INFO: status: 13% (14378074112/110595407872), sparse 1% (2004922368), duration 365, read/write 48/39 MB/s
INFO: status: 14% (15531507712/110595407872), sparse 2% (2224619520), duration 396, read/write 37/30 MB/s
INFO: status: 15% (16610754560/110595407872), sparse 2% (2310459392), duration 424, read/write 38/35 MB/s
INFO: status: 16% (17725128704/110595407872), sparse 2% (2433191936), duration 454, read/write 37/33 MB/s
INFO: status: 17% (18840813568/110595407872), sparse 2% (2640293888), duration 476, read/write 50/41 MB/s
INFO: status: 18% (19918749696/110595407872), sparse 2% (2789687296), duration 502, read/write 41/35 MB/s
INFO: status: 19% (21055406080/110595407872), sparse 2% (2904084480), duration 532, read/write 37/34 MB/s
INFO: status: 20% (22162702336/110595407872), sparse 2% (3002482688), duration 556, read/write 46/42 MB/s
INFO: status: 21% (23265804288/110595407872), sparse 2% (3074723840), duration 587, read/write 35/33 MB/s
INFO: status: 22% (24364711936/110595407872), sparse 2% (3249545216), duration 621, read/write 32/27 MB/s
INFO: status: 23% (25451036672/110595407872), sparse 3% (3353899008), duration 656, read/write 31/28 MB/s
INFO: status: 24% (26575110144/110595407872), sparse 3% (3508682752), duration 690, read/write 33/28 MB/s
INFO: status: 25% (27669823488/110595407872), sparse 3% (3526324224), duration 720, read/write 36/35 MB/s
INFO: status: 26% (28806479872/110595407872), sparse 3% (3984486400), duration 738, read/write 63/37 MB/s
INFO: status: 27% (29871833088/110595407872), sparse 3% (4247502848), duration 765, read/write 39/29 MB/s
INFO: status: 28% (30983323648/110595407872), sparse 3% (4322873344), duration 798, read/write 33/31 MB/s
INFO: status: 29% (32073842688/110595407872), sparse 4% (4525948928), duration 831, read/write 33/26 MB/s
INFO: status: 30% (33193721856/110595407872), sparse 4% (4632596480), duration 868, read/write 30/27 MB/s
INFO: status: 31% (34313601024/110595407872), sparse 4% (4812324864), duration 904, read/write 31/26 MB/s
INFO: status: 32% (35433480192/110595407872), sparse 4% (4954460160), duration 939, read/write 31/27 MB/s
INFO: status: 33% (36543660032/110595407872), sparse 4% (5395038208), duration 964, read/write 44/26 MB/s
INFO: status: 34% (37622906880/110595407872), sparse 5% (5797015552), duration 989, read/write 43/27 MB/s
INFO: status: 35% (38734397440/110595407872), sparse 5% (6264979456), duration 1013, read/write 46/26 MB/s
INFO: status: 36% (39845888000/110595407872), sparse 5% (6298513408), duration 1049, read/write 30/29 MB/s
INFO: status: 37% (40936407040/110595407872), sparse 5% (6422835200), duration 1073, read/write 45/40 MB/s
INFO: status: 38% (42043703296/110595407872), sparse 5% (6525181952), duration 1101, read/write 39/35 MB/s
INFO: status: 39% (43171971072/110595407872), sparse 6% (6652497920), duration 1133, read/write 35/31 MB/s
INFO: status: 40% (44270878720/110595407872), sparse 6% (6786363392), duration 1160, read/write 40/35 MB/s
INFO: status: 41% (45382369280/110595407872), sparse 6% (6952738816), duration 1182, read/write 50/42 MB/s
INFO: status: 42% (46456111104/110595407872), sparse 6% (7035768832), duration 1215, read/write 32/30 MB/s
INFO: status: 43% (47592767488/110595407872), sparse 6% (7095812096), duration 1258, read/write 26/25 MB/s
INFO: status: 44% (48662315008/110595407872), sparse 6% (7173394432), duration 1290, read/write 33/30 MB/s
INFO: status: 45% (49798971392/110595407872), sparse 6% (7253426176), duration 1326, read/write 31/29 MB/s
INFO: status: 46% (50900369408/110595407872), sparse 6% (7320985600), duration 1351, read/write 44/41 MB/s
INFO: status: 47% (51992592384/110595407872), sparse 6% (7433076736), duration 1381, read/write 36/32 MB/s
INFO: status: 48% (53097398272/110595407872), sparse 6% (7542063104), duration 1414, read/write 33/30 MB/s
INFO: status: 49% (54211379200/110595407872), sparse 6% (7574585344), duration 1445, read/write 35/34 MB/s
INFO: status: 50% (55297703936/110595407872), sparse 6% (7579021312), duration 1480, read/write 31/30 MB/s
INFO: status: 51% (56446943232/110595407872), sparse 6% (7646482432), duration 1511, read/write 37/34 MB/s
INFO: status: 52% (57537462272/110595407872), sparse 6% (7733112832), duration 1544, read/write 33/30 MB/s
INFO: status: 53% (58657341440/110595407872), sparse 7% (7848218624), duration 1580, read/write 31/27 MB/s
INFO: status: 54% (59726888960/110595407872), sparse 7% (7969206272), duration 1609, read/write 36/32 MB/s
INFO: status: 55% (60874424320/110595407872), sparse 7% (8127733760), duration 1637, read/write 40/35 MB/s
INFO: status: 56% (61975035904/110595407872), sparse 7% (8178499584), duration 1662, read/write 44/41 MB/s
INFO: status: 57% (63061360640/110595407872), sparse 7% (8250806272), duration 1687, read/write 43/40 MB/s
INFO: status: 58% (64177045504/110595407872), sparse 7% (8298004480), duration 1712, read/write 44/42 MB/s
INFO: status: 59% (65275953152/110595407872), sparse 7% (8570023936), duration 1731, read/write 57/43 MB/s
INFO: status: 60% (66370666496/110595407872), sparse 7% (8595152896), duration 1766, read/write 31/30 MB/s
INFO: status: 61% (67490545664/110595407872), sparse 7% (8741163008), duration 1799, read/write 33/29 MB/s
INFO: status: 62% (68606230528/110595407872), sparse 8% (8867774464), duration 1839, read/write 27/24 MB/s
INFO: status: 63% (69675778048/110595407872), sparse 8% (8987119616), duration 1870, read/write 34/30 MB/s
INFO: status: 64% (70833405952/110595407872), sparse 8% (9136672768), duration 1907, read/write 31/27 MB/s
INFO: status: 65% (71924187136/110595407872), sparse 8% (9246375936), duration 1937, read/write 36/32 MB/s
INFO: status: 66% (73001861120/110595407872), sparse 8% (9386856448), duration 1966, read/write 37/32 MB/s
INFO: status: 67% (74155294720/110595407872), sparse 8% (9463013376), duration 1997, read/write 37/34 MB/s
INFO: status: 68% (75245813760/110595407872), sparse 8% (9663934464), duration 2020, read/write 47/38 MB/s
INFO: status: 69% (76318244864/110595407872), sparse 8% (9696071680), duration 2050, read/write 35/34 MB/s
INFO: status: 70% (77431963648/110595407872), sparse 8% (9754161152), duration 2079, read/write 38/36 MB/s
INFO: status: 71% (78571896832/110595407872), sparse 8% (9877127168), duration 2107, read/write 40/36 MB/s
INFO: status: 72% (79641444352/110595407872), sparse 9% (10191728640), duration 2134, read/write 39/27 MB/s
INFO: status: 73% (80757129216/110595407872), sparse 9% (10392121344), duration 2159, read/write 44/36 MB/s
INFO: status: 74% (81868619776/110595407872), sparse 9% (10463236096), duration 2191, read/write 34/32 MB/s
INFO: status: 75% (82954944512/110595407872), sparse 9% (10519248896), duration 2215, read/write 45/42 MB/s
INFO: status: 76% (84087406592/110595407872), sparse 9% (10620424192), duration 2241, read/write 43/39 MB/s
INFO: status: 77% (85182119936/110595407872), sparse 9% (10685943808), duration 2273, read/write 34/32 MB/s
INFO: status: 78% (86314582016/110595407872), sparse 9% (10799669248), duration 2300, read/write 41/37 MB/s
INFO: status: 79% (87413489664/110595407872), sparse 9% (10972737536), duration 2322, read/write 49/42 MB/s
INFO: status: 80% (88516591616/110595407872), sparse 10% (11092148224), duration 2354, read/write 34/30 MB/s
INFO: status: 81% (89586139136/110595407872), sparse 10% (11239116800), duration 2379, read/write 42/36 MB/s
INFO: status: 82% (90693435392/110595407872), sparse 10% (11337502720), duration 2407, read/write 39/36 MB/s
INFO: status: 83% (91803877376/110595407872), sparse 10% (11424546816), duration 2442, read/write 31/29 MB/s
INFO: status: 84% (92941582336/110595407872), sparse 10% (11529768960), duration 2475, read/write 34/31 MB/s
INFO: status: 85% (94011129856/110595407872), sparse 10% (11579260928), duration 2507, read/write 33/31 MB/s
INFO: status: 86% (95135203328/110595407872), sparse 10% (11709624320), duration 2537, read/write 37/33 MB/s
INFO: status: 87% (96238305280/110595407872), sparse 10% (11816312832), duration 2565, read/write 39/35 MB/s
INFO: status: 88% (97334460416/110595407872), sparse 10% (11906084864), duration 2590, read/write 43/40 MB/s
INFO: status: 89% (98486452224/110595407872), sparse 10% (12129329152), duration 2620, read/write 38/30 MB/s
INFO: status: 90% (99569500160/110595407872), sparse 11% (12222889984), duration 2653, read/write 32/29 MB/s
INFO: status: 91% (100654907392/110595407872), sparse 11% (12286955520), duration 2685, read/write 33/31 MB/s
INFO: status: 92% (101766397952/110595407872), sparse 11% (12386594816), duration 2723, read/write 29/26 MB/s
INFO: status: 93% (102873694208/110595407872), sparse 11% (12515844096), duration 2762, read/write 28/25 MB/s
INFO: status: 94% (103997767680/110595407872), sparse 11% (12704759808), duration 2798, read/write 31/25 MB/s
INFO: status: 95% (105100869632/110595407872), sparse 11% (12832964608), duration 2826, read/write 39/34 MB/s
INFO: status: 96% (106183000064/110595407872), sparse 11% (12877524992), duration 2852, read/write 41/39 MB/s
INFO: status: 97% (107315462144/110595407872), sparse 11% (12918489088), duration 2883, read/write 36/35 MB/s
INFO: status: 98% (108407029760/110595407872), sparse 11% (12940406784), duration 2914, read/write 35/34 MB/s
INFO: status: 99% (109517471744/110595407872), sparse 11% (13029412864), duration 2948, read/write 32/30 MB/s
INFO: status: 100% (110595407872/110595407872), sparse 11% (13038178304), duration 2979, read/write 34/34 MB/s
INFO: transferred 110595 MB in 2979 seconds (37 MB/s)
INFO: archive file size: 47.01GB
INFO: Finished Backup of VM 100 (00:49:46)

I really don't understand where is the problem ?

Could someone help me ?

Thank's
 

basanisi

Active Member
Apr 15, 2011
40
2
28
Hello,

I have found new informations, my /etc/pve/storage.cfg is

nfs: rackstation_cluster
export /volume2/Backup_VM
path /mnt/pve/rackstation_cluster
server 10.165.2.113
content vztmpl,iso,images,backup,rootdir
maxfiles 7
options vers=3,async,tcp

I umount with not problem, the proxmox remount it normally, but when I test the nfs stat via

nfsstat -m

I show this

/mnt/pve/rackstation_cluster from 10.165.2.113:/volume2/Backup_VM
Flags: rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.165.2.113,mountvers=3,mountport=892,mountproto=tcp,local_lock=none,addr=10.165.2.113

The nfs mount point is not mounted async has indicated in /etc/pve/storage.cfg

I the verified the synology /etc/exports, all is configured with async no realtime

How can I force async mount point in /etc/pve/storage.cfg
 
Last edited:

alexskysilk

Renowned Member
Oct 16, 2015
804
105
63
Chatsworth, CA
www.skysilk.com
There is no simple answer, you're going to have to test every step individually to determine the bottleneck.

1. Are you limiting vzdump speed?

Assuming you arent...
2. create a snapshot, dd pipe to lz4 out to local disk. what was the transaction speed?
3. copy the file to your network target. how fast was that?

What was the cpu utlilization at each step? what network traffic was present when sending to the network target, both at the source and destination? how sure are you that you're not bound by the target speed? are you using the same network links for cluster traffic, ceph traffic, nfs traffic, internet traffic, ?
 

basanisi

Active Member
Apr 15, 2011
40
2
28
Hello alexskysilk

First my vzdump are programmed during night, when nobody are at work.

My vzdump is untouched

# vzdump default settings
#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: 0
#ionice: 1
#lockwait: MINUTES
#stopwait: MINUTES
#size: MB
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#maxfiles: N
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N:

I also create an lz4 of a ram disk and it took 17:39 at an average speed of 67.9 MB/s, and cpu utilisation is at 4-5 % during working day

dd if=/dev/pve/vm-106-disk-0 | pv | lz4 > /root/img.lz4
140509184+0 records iniB/s] [ <=> ]
140509184+0 records out
67GiB 0:17:39 [64.7MiB/s] [ <=> ]
71940702208 bytes (72 GB, 67 GiB) copied, 1059.62 s, 67.9 MB/s

The I rsync the file to rackstation

time rsync -avP --stats img.lz4 /mnt/pve/rackstation_cluster and it took 6 m 10 to copy file with an approximately 120 Mb/sd uring working day

sent 44,532,687,584 bytes received 135 bytes 120,196,188.18 bytes/sec
total size is 44,521,817,909 speedup is 1.00
real 6m10.652s
user 5m28.936s
sys 2m7.808s

The all 3 servers are configured with 4 gigabits nics on 2 lacp links : 1 for VMs (only servers 7 windows and 16 linux), internet (only used for updates), nfsand cluster and 1 for ceph. All lacp use MTU 9000 jumbo frames.

Thank's for your answer

I hope this can help
 

alexskysilk

Renowned Member
Oct 16, 2015
804
105
63
Chatsworth, CA
www.skysilk.com
I also create an lz4 of a ram disk and it took 17:39 at an average speed of 67.9 MB/s, and cpu utilisation is at 4-5 % during working day
seems like its not that far off the speeds you were realizing doing just backup. in my experience vzdump is not very fast generally but you can probably increase the speed closer to the above by setting a tempdir; see https://pve.proxmox.com/pve-docs/vzdump.1.html for more information.

The all 3 servers are configured with 4 gigabits nics on 2 lacp links : 1 for VMs (only servers 7 windows and 16 linux), internet (only used for updates), nfsand cluster and 1 for ceph. All lacp use MTU 9000 jumbo frames.

I would suggest a different configuration. while LACP is nice it effectively limits you to 2 interfaces, each with the latency of 1. Consider this approach instead:

1 interface for internet/intranet
1 interface for ceph
1 interface for cluster ring0
1 interface for cluster ring1 (you may want to intermingle ceph public on this interface)

each interface should be kept on seperate vlans. depending on the purpose of this cluster (lab, live production, sandbox) you may also want to get 10gb interfaces and set up a mesh network (https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server) so you dont need to buy a switch. you can then move ceph and/or cluster traffic there.
 

baggar11

Member
Aug 14, 2015
22
1
23
Portland, OR
www.puddlehunters.net
I've have actually noticed the same, but in my instance, the performance dropped from a recent upgrade from v4.4 to v5.3. bwlimit was set to 524288 on v4.4. I have tried setting to 0 on v5.3 with no change. Processor sits at about 20% throughout the backup process. Network is 10G fiber backing up to a CIFS share.

Here's a snippet of a vzdump on v4.4.

105: Jan 14 04:00:03 INFO: Starting Backup of VM 105 (qemu)
105: Jan 14 04:00:03 INFO: status = running
105: Jan 14 04:00:05 INFO: update VM 105: -lock backup
105: Jan 14 04:00:05 INFO: VM Name: LibreNMS
105: Jan 14 04:00:05 INFO: include disk 'sata0' 'sdb1:105/vm-105-disk-1.qcow2' 20G
105: Jan 14 04:00:05 INFO: backup mode: snapshot
105: Jan 14 04:00:05 INFO: bandwidth limit: 5242880 KB/s
105: Jan 14 04:00:05 INFO: ionice priority: 7
105: Jan 14 04:00:05 INFO: creating archive '/mnt/backup/dump/vzdump-qemu-105-2019_01_14-04_00_03.vma.lzo'
105: Jan 14 04:00:05 INFO: started backup task '70c5ccc9-920d-42f9-9dee-3ea0e2788582'
105: Jan 14 04:00:08 INFO: status: 1% (380108800/21474836480), sparse 0% (159137792), duration 3, 126/73 MB/s
105: Jan 14 04:00:11 INFO: status: 3% (832438272/21474836480), sparse 1% (408997888), duration 6, 150/67 MB/s
105: Jan 14 04:00:14 INFO: status: 7% (1537343488/21474836480), sparse 4% (884371456), duration 9, 234/76 MB/s
105: Jan 14 04:00:17 INFO: status: 10% (2189426688/21474836480), sparse 6% (1346678784), duration 12, 217/63 MB/s
105: Jan 14 04:00:20 INFO: status: 11% (2527723520/21474836480), sparse 6% (1414516736), duration 15, 112/90 MB/s
105: Jan 14 04:00:23 INFO: status: 13% (2858418176/21474836480), sparse 6% (1415802880), duration 18, 110/109 MB/s
105: Jan 14 04:00:26 INFO: status: 14% (3189833728/21474836480), sparse 6% (1420103680), duration 21, 110/109 MB/s
105: Jan 14 04:00:29 INFO: status: 16% (3542679552/21474836480), sparse 6% (1424891904), duration 24, 117/116 MB/s
105: Jan 14 04:00:32 INFO: status: 18% (3915448320/21474836480), sparse 6% (1430728704), duration 27, 124/122 MB/s
105: Jan 14 04:00:35 INFO: status: 20% (4334747648/21474836480), sparse 6% (1462960128), duration 30, 139/129 MB/s
105: Jan 14 04:00:38 INFO: status: 21% (4717150208/21474836480), sparse 7% (1510473728), duration 33, 127/111 MB/s
105: Jan 14 04:00:41 INFO: status: 23% (5068226560/21474836480), sparse 7% (1517350912), duration 36, 117/114 MB/s
105: Jan 14 04:00:44 INFO: status: 25% (5393743872/21474836480), sparse 7% (1521676288), duration 39, 108/107 MB/s
105: Jan 14 04:00:47 INFO: status: 26% (5785976832/21474836480), sparse 7% (1523625984), duration 42, 130/130 MB/s
105: Jan 14 04:00:50 INFO: status: 28% (6149505024/21474836480), sparse 7% (1646301184), duration 45, 121/80 MB/s
105: Jan 14 04:00:53 INFO: status: 32% (6898974720/21474836480), sparse 10% (2202128384), duration 48, 249/64 MB/s
105: Jan 14 04:00:56 INFO: status: 40% (8723890176/21474836480), sparse 18% (3876777984), duration 51, 608/50 MB/s
105: Jan 14 04:00:59 INFO: status: 42% (9184739328/21474836480), sparse 18% (4051599360), duration 54, 153/95 MB/s
105: Jan 14 04:01:02 INFO: status: 44% (9532669952/21474836480), sparse 18% (4054560768), duration 57, 115/114 MB/s
105: Jan 14 04:01:05 INFO: status: 45% (9859432448/21474836480), sparse 18% (4057165824), duration 60, 108/108 MB/s
105: Jan 14 04:01:08 INFO: status: 47% (10255335424/21474836480), sparse 18% (4057747456), duration 63, 131/131 MB/s
105: Jan 14 04:01:11 INFO: status: 51% (11016863744/21474836480), sparse 21% (4510404608), duration 66, 253/102 MB/s
105: Jan 14 04:01:14 INFO: status: 54% (11623268352/21474836480), sparse 22% (4737974272), duration 69, 202/126 MB/s
105: Jan 14 04:01:17 INFO: status: 68% (14648541184/21474836480), sparse 35% (7630114816), duration 72, 1008/44 MB/s
105: Jan 14 04:01:20 INFO: status: 80% (17382375424/21474836480), sparse 47% (10206633984), duration 75, 911/52 MB/s
105: Jan 14 04:01:23 INFO: status: 94% (20271202304/21474836480), sparse 60% (12966248448), duration 78, 962/43 MB/s
105: Jan 14 04:01:24 INFO: status: 100% (21474836480/21474836480), sparse 65% (14149726208), duration 79, 1203/20 MB/s
105: Jan 14 04:01:24 INFO: transferred 21474 MB in 79 seconds (271 MB/s)
105: Jan 14 04:01:26 INFO: archive file size: 1.27GB
105: Jan 14 04:01:26 INFO: delete old backup '/mnt/backup/dump/vzdump-qemu-105-2019_01_07-04_00_02.vma.lzo'
105: Jan 14 04:01:28 INFO: Finished Backup of VM 105 (00:01:25

Here's a snip from a vzdump backup on v5.3.

105: 2019-01-29 04:00:03 INFO: Starting Backup of VM 105 (qemu)
105: 2019-01-29 04:00:03 INFO: status = running
105: 2019-01-29 04:00:05 INFO: update VM 105: -lock backup
105: 2019-01-29 04:00:05 INFO: VM Name: LibreNMS
105: 2019-01-29 04:00:05 INFO: include disk 'sata0' 'sdb1:105/vm-105-disk-1.qcow2' 20G
105: 2019-01-29 04:00:05 INFO: backup mode: snapshot
105: 2019-01-29 04:00:05 INFO: ionice priority: 7
105: 2019-01-29 04:00:05 INFO: creating archive '/mnt/backup/dump/vzdump-qemu-105-2019_01_29-04_00_03.vma.lzo'
105: 2019-01-29 04:00:05 INFO: started backup task 'abde2d4c-989c-4fba-bd94-3bfe7350ec8c'
105: 2019-01-29 04:00:08 INFO: status: 1% (300285952/21474836480), sparse 0% (198868992), duration 3, read/write 100/33 MB/s
105: 2019-01-29 04:00:11 INFO: status: 2% (494141440/21474836480), sparse 1% (292048896), duration 6, read/write 64/33 MB/s
105: 2019-01-29 04:00:14 INFO: status: 3% (710803456/21474836480), sparse 1% (413138944), duration 9, read/write 72/31 MB/s
105: 2019-01-29 04:00:17 INFO: status: 4% (881852416/21474836480), sparse 2% (483446784), duration 12, read/write 57/33 MB/s
105: 2019-01-29 04:00:20 INFO: status: 5% (1140326400/21474836480), sparse 3% (651141120), duration 15, read/write 86/30 MB/s
105: 2019-01-29 04:00:23 INFO: status: 7% (1505230848/21474836480), sparse 4% (910045184), duration 18, read/write 121/35 MB/s
105: 2019-01-29 04:00:29 INFO: status: 8% (1725693952/21474836480), sparse 4% (915234816), duration 24, read/write 36/35 MB/s
105: 2019-01-29 04:00:32 INFO: status: 9% (1942355968/21474836480), sparse 4% (1034231808), duration 27, read/write 72/32 MB/s
105: 2019-01-29 04:00:37 INFO: status: 10% (2147614720/21474836480), sparse 5% (1085399040), duration 32, read/write 41/30 MB/s
105: 2019-01-29 04:00:40 INFO: status: 11% (2447900672/21474836480), sparse 6% (1291255808), duration 35, read/write 100/31 MB/s
105: 2019-01-29 04:00:43 INFO: status: 12% (2622750720/21474836480), sparse 6% (1306357760), duration 38, read/write 58/53 MB/s
105: 2019-01-29 04:00:46 INFO: status: 13% (2801401856/21474836480), sparse 6% (1306611712), duration 41, read/write 59/59 MB/s
105: 2019-01-29 04:00:50 INFO: status: 14% (3033268224/21474836480), sparse 6% (1307566080), duration 45, read/write 57/57 MB/s
105: 2019-01-29 04:00:54 INFO: status: 15% (3223322624/21474836480), sparse 6% (1316311040), duration 49, read/write 47/45 MB/s
105: 2019-01-29 04:00:58 INFO: status: 16% (3438411776/21474836480), sparse 6% (1319682048), duration 53, read/write 53/52 MB/s
105: 2019-01-29 04:01:03 INFO: status: 17% (3671851008/21474836480), sparse 6% (1321385984), duration 58, read/write 46/46 MB/s
105: 2019-01-29 04:01:07 INFO: status: 18% (3896115200/21474836480), sparse 6% (1326874624), duration 62, read/write 56/54 MB/s
105: 2019-01-29 04:01:11 INFO: status: 19% (4112777216/21474836480), sparse 6% (1331625984), duration 66, read/write 54/52 MB/s
105: 2019-01-29 04:01:16 INFO: status: 20% (4348444672/21474836480), sparse 6% (1331961856), duration 71, read/write 47/47 MB/s
105: 2019-01-29 04:01:19 INFO: status: 21% (4522967040/21474836480), sparse 6% (1332391936), duration 74, read/write 58/58 MB/s
105: 2019-01-29 04:01:22 INFO: status: 22% (4736155648/21474836480), sparse 6% (1369870336), duration 77, read/write 71/58 MB/s
105: 2019-01-29 04:01:26 INFO: status: 23% (4990828544/21474836480), sparse 6% (1420713984), duration 81, read/write 63/50 MB/s
105: 2019-01-29 04:01:29 INFO: status: 24% (5169479680/21474836480), sparse 6% (1420959744), duration 84, read/write 59/59 MB/s
105: 2019-01-29 04:01:34 INFO: status: 25% (5415960576/21474836480), sparse 6% (1421684736), duration 89, read/write 49/49 MB/s
105: 2019-01-29 04:01:37 INFO: status: 26% (5587599360/21474836480), sparse 6% (1421828096), duration 92, read/write 57/57 MB/s
105: 2019-01-29 04:01:41 INFO: status: 27% (5838536704/21474836480), sparse 6% (1422024704), duration 96, read/write 62/62 MB/s
105: 2019-01-29 04:01:45 INFO: status: 28% (6055133184/21474836480), sparse 6% (1454034944), duration 100, read/write 54/46 MB/s
105: 2019-01-29 04:01:48 INFO: status: 31% (6682312704/21474836480), sparse 9% (1961709568), duration 103, read/write 209/39 MB/s
105: 2019-01-29 04:01:51 INFO: status: 32% (6891372544/21474836480), sparse 9% (2060525568), duration 106, read/write 69/36 MB/s
105: 2019-01-29 04:01:56 INFO: status: 35% (7520583680/21474836480), sparse 11% (2499809280), duration 111, read/write 125/37 MB/s
105: 2019-01-29 04:01:59 INFO: status: 41% (8924954624/21474836480), sparse 17% (3784228864), duration 114, read/write 468/39 MB/s
105: 2019-01-29 04:02:02 INFO: status: 42% (9122611200/21474836480), sparse 17% (3839746048), duration 117, read/write 65/47 MB/s
105: 2019-01-29 04:02:05 INFO: status: 43% (9297461248/21474836480), sparse 17% (3858763776), duration 120, read/write 58/51 MB/s
105: 2019-01-29 04:02:08 INFO: status: 44% (9481224192/21474836480), sparse 17% (3859120128), duration 123, read/write 61/61 MB/s
105: 2019-01-29 04:02:11 INFO: status: 45% (9669967872/21474836480), sparse 17% (3859316736), duration 126, read/write 62/62 MB/s
105: 2019-01-29 04:02:16 INFO: status: 46% (9928245248/21474836480), sparse 17% (3862032384), duration 131, read/write 51/51 MB/s
105: 2019-01-29 04:02:19 INFO: status: 47% (10114695168/21474836480), sparse 17% (3862294528), duration 134, read/write 62/62 MB/s
105: 2019-01-29 04:02:22 INFO: status: 48% (10314907648/21474836480), sparse 17% (3862949888), duration 137, read/write 66/66 MB/s
105: 2019-01-29 04:02:25 INFO: status: 50% (10903683072/21474836480), sparse 20% (4314316800), duration 140, read/write 196/45 MB/s
105: 2019-01-29 04:02:28 INFO: status: 51% (11092426752/21474836480), sparse 20% (4314570752), duration 143, read/write 62/62 MB/s
105: 2019-01-29 04:02:31 INFO: status: 52% (11281629184/21474836480), sparse 20% (4314570752), duration 146, read/write 63/63 MB/s
105: 2019-01-29 04:02:34 INFO: status: 60% (12923699200/21474836480), sparse 27% (5820190720), duration 149, read/write 547/45 MB/s
105: 2019-01-29 04:02:37 INFO: status: 65% (14117240832/21474836480), sparse 32% (6915428352), duration 152, read/write 397/32 MB/s
105: 2019-01-29 04:02:40 INFO: status: 73% (15709896704/21474836480), sparse 39% (8412737536), duration 155, read/write 530/31 MB/s
105: 2019-01-29 04:02:43 INFO: status: 87% (18790481920/21474836480), sparse 53% (11424215040), duration 158, read/write 1026/23 MB/s
105: 2019-01-29 04:02:46 INFO: status: 100% (21474836480/21474836480), sparse 65% (14043615232), duration 161, read/write 894/21 MB/s
105: 2019-01-29 04:02:46 INFO: transferred 21474 MB in 161 seconds (133 MB/s)
105: 2019-01-29 04:02:46 INFO: archive file size: 1.32GB
105: 2019-01-29 04:02:46 INFO: delete old backup '/mnt/backup/dump/vzdump-qemu-105-2019_01_22-04_00_03.vma.lzo'
105: 2019-01-29 04:02:48 INFO: Finished Backup of VM 105 (00:02:45)
 

basanisi

Active Member
Apr 15, 2011
40
2
28
seems like its not that far off the speeds you were realizing doing just backup. in my experience vzdump is not very fast generally but you can probably increase the speed closer to the above by setting a tempdir; see https://pve.proxmox.com/pve-docs/vzdump.1.html for more information.



I would suggest a different configuration. while LACP is nice it effectively limits you to 2 interfaces, each with the latency of 1. Consider this approach instead:

1 interface for internet/intranet
1 interface for ceph
1 interface for cluster ring0
1 interface for cluster ring1 (you may want to intermingle ceph public on this interface)

each interface should be kept on seperate vlans. depending on the purpose of this cluster (lab, live production, sandbox) you may also want to get 10gb interfaces and set up a mesh network (https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server) so you dont need to buy a switch. you can then move ceph and/or cluster traffic there.

The proxmox server are in production with many users, who use multiple KVM Vms, the only modification that I made easily is to put ceph lacp to it's own vlan, but the vzdump is always slow.

We made other test like to restore a Vm to local_LVM and backup it on rackstation, and the average speed climb to a mean of 133 MB/s, comparing the same vm backup store to ceph with average speed 33 MB/s.

Then I think that ceph is guilty, but i don't wheres this come from because ceph osd are create via raid10 storage hardware.

Do you have an idea where to search
 

basanisi

Active Member
Apr 15, 2011
40
2
28
I've have actually noticed the same, but in my instance, the performance dropped from a recent upgrade from v4.4 to v5.3. bwlimit was set to 524288 on v4.4. I have tried setting to 0 on v5.3 with no change. Processor sits at about 20% throughout the backup process. Network is 10G fiber backing up to a CIFS share.

Here's a snippet of a vzdump on v4.4.



Here's a snip from a vzdump backup on v5.3.

I think you put your finger on it, I think a big regression.

Is your Vms are KVM with raw disk stored via ceph ?

Thank's for you help
 

alexskysilk

Renowned Member
Oct 16, 2015
804
105
63
Chatsworth, CA
www.skysilk.com
We made other test like to restore a Vm to local_LVM and backup it on rackstation, and the average speed climb to a mean of 133 MB/s, comparing the same vm backup store to ceph with average speed 33 MB/s.

Then I think that ceph is guilty, but i don't wheres this come from because ceph osd are create via raid10 storage hardware.
I completely misread your original report of lz4 performance; How fast is creating an LZ4 out of a ceph snapshot to local disk?
 

basanisi

Active Member
Apr 15, 2011
40
2
28
I had made a test backup to proxmox local drive, and the results are nearly identical from ceph to local vs ceph to rackstation. The average speed is 35 MB/s.

I think that the problem come from ceph storage, but I don' t know where is this problem.

When I restore a vm to local and the backup to rackstation the average speed is nearly 130 MB/s, not 35 MB/s.

It also activate pigz to vmdump.conf, it's improve a little

How do you backup vm with lz4 ?

Thanks for your help.
 

Attachments

  • log.txt
    10.1 KB · Views: 1

Alwin

Proxmox Retired Staff
Retired Staff
Aug 1, 2017
4,617
457
88
Then I think that ceph is guilty, but i don't wheres this come from because ceph osd are create via raid10 storage hardware.
RAID10 for Ceph OSDs is a performance break by its own. As by design, Ceph is made to withstand disk/node/rack failures (with proper configuraiton).

I had made a test backup to proxmox local drive, and the results are nearly identical from ceph to local vs ceph to rackstation. The average speed is 35 MB/s.
Depending on cache and backup mode, the performance will vary. What are the results of a 'rados bench'?
 

alexskysilk

Renowned Member
Oct 16, 2015
804
105
63
Chatsworth, CA
www.skysilk.com

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!