VZDump slow on ceph images, RBD export fast

valeech

Well-Known Member
May 4, 2016
41
5
48
48
Hello,

I have a 3 node cluster running ceph. I have a ceph monitor and 6 OSDs on each node. Each OSD journal is mapped to an NVMe partition on a separate disk. Each node has a dedicated 10G nic for ceph public network and a dedicated nic for ceph cluster network.

On the same LAN as the ceph public lives a separate NFS server.

All of my VMs have their disks stored in the ceph pool. I want to backup my VMs to the NFS server. When the backup job runs, I get about 30MBps with or without compression. I have tried doing a snapshot and a stopped VM backup, no difference. As I have peeled back the onion to find out where the issue is, it looks like it is VZDump.

If I run "rbd export ceph/vm-100-disk-1 /mnt/pve/nfsserver/vm-100-disk-1.raw" I get transfer speeds of 395MBps. This rules out ceph or NFS or the network as being an issue.

Any idea what could be causing VZDump to be so much slower than the rbd export?

I found this thread earlier but wasn't sure if it was related because the link that provided in the thread about patches mentions that the patches increased ceph export speeds. I don't have issues with ceph export speeds, except in VZDump, if VZDump actually does a ceph export.

Thanks!


Current running versions:
# pveversion -v
proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.6-1-pve: 4.4.6-48
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-72
pve-firmware: 1.1-8
libpve-common-perl: 4.0-59
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-14
pve-container: 1.0-62
pve-firewall: 2.0-25
pve-ha-manager: 1.0-28
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
 
please upgrade to the current version - there were some fixes regarding backups from/restore to ceph that should speed things up
 
  • Like
Reactions: valeech
Hi Fabian,

Setup the exact same cluster in the lab running 4.2. Tested NFS backup. 30MB/s. Tested ceph export to same NFS filesystem, 400MB/s.

Upgraded cluster to latest 4.3. No change. 30MB/s on backup. ceph export, 400MB/s.
 
Hi Fabian,

Setup the exact same cluster in the lab running 4.2. Tested NFS backup. 30MB/s. Tested ceph export to same NFS filesystem, 400MB/s.

Upgraded cluster to latest 4.3. No change. 30MB/s on backup. ceph export, 400MB/s.

do you have stop/start the vm after upgrade to 4.3 ?
 
The VM I tested was stopped before and after the upgrade. I never started it.

I can try that though...
 
That would makes sense. Is there a mechanism to adjust the block size within vzdump?
 
  • Like
Reactions: valeech
I ended up writing my own backup script to dump the images out of ceph to a NFS share. Let me know if you would like to try it.
This solution is also in our agenda, but the handling if these dump is not as convenient as with the VMA files, so you can't import them to another PVE machine without having a running ceph cluster in case of emergency.

It's a pity to have a fast ceph cluster with SSD but running these ultra-slow backups :-(
 
Are there updates on this? We're facing the same issues and as robhost mentioned, VMA files is really what's needed

ultra-slow is definitely the proper description for the backups at this point coming off of Ceph backend
 
Let me rise this topic up, please.

Is anybody could run backup faster than 40MB/s ?
The fact is, rbd export works 10x faster compare to proxmox backup tool.
Proxmox is a great open source product, but backups is definitely something which needs to be sorted.

Spirit assumed that problem is 64K reads of VZdump and the default size of blocks of RBD image is 4M, what produces too much IOPS during backup what causes performance degradation.
VZDump slow on ceph images, RBD export fast

So, I've done a test in order to check this theory.
I created RBD image with --object-size 64K and attached this image to a fresh created VM.
Unfortunatelly, it did not make any difference on backup performance.

Any suggestions? Does anybody know what is the bottleneck?

thanks
 
Let me rise this topic up, please.

Is anybody could run backup faster than 40MB/s ?
The fact is, rbd export works 10x faster compare to proxmox backup tool.
Proxmox is a great open source product, but backups is definitely something which needs to be sorted.

Spirit assumed that problem is 64K reads of VZdump and the default size of blocks of RBD image is 4M, what produces too much IOPS during backup what causes performance degradation.
VZDump slow on ceph images, RBD export fast

So, I've done a test in order to check this theory.
I created RBD image with --object-size 64K and attached this image to a fresh created VM.
Unfortunatelly, it did not make any difference on backup performance.

Any suggestions? Does anybody know what is the bottleneck?

thanks

The problem is that proxmox backup block by 64K. so you have a lot more iops to ceph (use more cpu) + round-trip network time.

I remember somebody have done test with qemu backup (different from proxmox custom backup code), where it's possible to tune block size, and it was really faster.




"Hi Alexandre,

Am 02.03.2016 um 15:23 schrieb Alexandre DERUMIER:
>>>We're currently implementing our own backup system which uses the
>>>drive-backup qmp routine to dump qcow2. We don't need the config files.
>>>And this works with ceph and 4MB cluster size.
>
> How about speed performance (64K,4MB) and stability vs proxmox vma
> format ?

We (Stefan and I) do not have benchmarks, that are fully comparable.

We tested a 100GB virtio drive from ceph backup upped to NFS:

- with VMA, default 64kb cluster size and without read/write limits:

| | Snapshot | Suspend | Stop |
|------+----------+---------+-------|
| none | 09:55 | 09:37 | 09:35 |
| LZO | 10:24 | 10:26 | 10:20 |
| GZip | 31:52 | 32:22 | |

- QMP drive-backup (params sync => full, format => qcow2), 4MB cluster
size and 130 mb/s, 215 ops/s read; 90 mb/s, 155 ops/s write limits:

9:15 Minutes.

QMP drive-backup with above limits and default cluster size 64KB took
very long, more than 40 minutes."
 
Last edited:
What Iva-a-an mentioned is exactly right, this issue NEEDS to be sorted. If Spirit is correct and it's the 64KB block issue in vzdump and not QEMU why hasn't this been fixed yet? The problem has existed now for a very long time ... not a few months or something ... I say this because if it were a Proxmox issue it would make sense that it'd have already been rectified by now as it's under their direct control.

Thanks Iva-a-an for pushing this topic to the top again ... it most definitely is of great importance and should be addressed as soon as possible.

We've always been impressed with Proxmox as a platform and it still serves us and our customers quite well overall but, recently, support and software quality hasn't been where it should be and by saying this I'm not trying to make anyone angry but rather just point out what we've observed. We've been using Proxmox in production since version 2.0 and would love to see it rise to the status of RHEV or VMware so that more and more admins can see how nicely the platform works (as it's nearly a work of art what with the nice web GUI and all) but there are some sticking points that would be great to have addressed and this is a big one right now.
 
AFAIK VMware canceled their own backup tools, so Proxmox VE is just not that far away here, correct me if I am wrong. VMware does not support Ceph in ANY way, so I do not see why VMware is "ahead" here. Also RHEV does not have a comparable backup solution. So please post to the point and not tell the community that VMware or others are so much "ahead" ....

But yes, the problem is known but not that simple to fix. We see around 60 mb/sec in ceph backup jobs, so its not so slow as some mention here.
 
Please don't misunderstand. I'm not saying VMware is "ahead" in tech but rather in acceptance and would love to see Proxmox gain some serious market acceptance to match that of the big guys ... that's all I was attempting to convey. We continue to use Proxmox because it gives very good performance and stability. We've never been impressed with VMware and we replaced our XenServer setup with Proxmox to give you an idea of how we feel about Proxmox' abilities as a Virtual Hosting Platform.

In the case of Proxmox backups, backups have always been one of the features considered and needed by those that use it so it's not unreasonable to expect a feature that exists to be a top priority no matter what the competitors do.

With that said, the platform overall does what's expected and does it very well so a tremendous "THANKS!!!" is deserved and you have it from our part.

We had no idea that it was a "known problem but not simple to fix" as what's seen in the forums doesn't convey that so thank you for making that clear. 60 MB/s is slow when you are doing many backups per night and that is a top speed that isn't maintained at all during an actual backup with more like 30 MB/s being the average. They can't complete by morning because of that slowness. We are using 10Gb networking. From local it moves at 300 MB/s to 500 MB/s which means backups easily complete during the night if they are from local storage so aren't a problem for the customers first thing in the morning but the fact of the matter is almost none of our VMs at this point are on local storage but instead are on Ceph which has brought huge performance and flexibility to our customer base but means that backing up those dozens of VMs can't finish as they should by morning many times (and no, we aren't backing that many up per night).

This is the point that was trying to be made and not saying VMware is "ahead". Anybody in the community looking at this thread can now see this clarification and understand that in no way am I saying VMware is superior as, if it were, we would be using it instead of Proxmox but that's simply not the case. No offense intended. What would be appreciated would be an explanation of what's been done and being done to correct the slow backup situation coming from Ceph backend. We humbly realize we don't know how to fix the issue ourselves but would appreciate just as humble of a recognition on the part of the developers and not take offense at observations but instead realize that we all want the same thing, to make Proxmox the best it can be.
 
  • Like
Reactions: Dan Nicolae
That would be very nice indeed if that's the case ... of course that's still some time off since Luminous hasn't yet been released (currently at 12.0.3 and needs to be at 12.2 before release) and my guess is that Proxmox 5.0 could very well be released without Luminous meaning you'd still be dealing with Jewel and whatever is still "holding up the wheels of progress" ... it would be cool to get a boost just from upgrading but the underlying problem still needs to be addressed ... notice that the first post in this thread is from May 2016 so the issue is now over a year old

Like I said in a previous post, I humbly admit that I don't know how to fix the problem myself (if I did I'd more than happily donate my time to repairing it and getting it committed into Proxmox GIT repo) but would like to see someone from Proxmox outline what's going on and what's been done so far and what the plan of attack is to get this issue under control.

Tom, Fabian, and any other Proxmox developer ... no one is asking for miracles, just a bit of transparency and hope for an upcoming resolution to allay fears ... many of us are using this in production and supporting customers on it so need some sort of idea as to when to expect fixes for the sake of the businesses that we run. We stepped up and bought Enterprise repo subscriptions in our case because we feel that Proxmox deserves monetary support for the hard work they are doing, so please understand, we are not unappreciative just a bit lost as to what's going to happen ... our backup situation continues to cause us apprehension since everything that messes with our customers ends up giving us a "black eye" with them as they have no idea where the issue stems from and trust us to make everything work ...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!