VZDump slow on ceph images, RBD export fast

My bad ... I just said that current Luminous release is 12.0.3 and it's not, it's at RC status and is at 12.1.0 ... sorry ... maybe Proxmox 5.0 and Luminous CAN come out together after all ...
 
Like I said in a previous post, I humbly admit that I don't know how to fix the problem myself (if I did I'd more than happily donate my time to repairing it and getting it committed into Proxmox GIT repo) but would like to see someone from Proxmox outline what's going on and what's been done so far and what the plan of attack is to get this issue under control.

You are invited to just read whats going on in the project. You write that you do not know how to fix it but you just assume it just easy for others.?

All our developers communicate in a public available mailing list, all the code written is available in our git and also planning and future plan are discussed in an open way.

Here are some links for you, if you follow all this you will never miss anything:

https://pve.proxmox.com/wiki/Developer_Documentation
https://bugzilla.proxmox.com
https://git.proxmox.com
 
  • Like
Reactions: jeffwadsworth
Not at all ... I am not assuming that it's easy for others but not for me

Perhaps I'm incorrect but it seems you're getting offended by my comments as your responses seem defensive ... perhaps it's a language barrier issue ... please don't take what I'm saying as accusatory or flippant or unconcerned about whether or not it's difficult for you guys to fix things

I'm more than willing to read the links you put above, great ... no problem there
 
So at this point, I'm subscribed to pve-devel so thanks for that. The bugzilla doesn't seem to show any mention of the vzdump ceph slowness bug. As far as the GIT repo, I have looked at that regularly over the past couple of years and it's essentially a different view of pve-devel mailing list.

Anyway, we are clearly not the only ones that would like a status on this bug. We aren't saying it's easy to fix but it has existed for a very long time too so easy or not, the question remains ... what has been accomplished in more than a year of this bug's existence? We've seen our backup speeds increase from 6-9MB/s to 30-40MB/s which is a big improvement so, THANKS! ... however, to get it to 300-400MB/s what's needed? What's blocking? What has been found in the last year?
 
  • Like
Reactions: Dan Nicolae
So, clearly, the issue is not so simple.
What options then do we have for backups?
The most likely some people already find solution to make backups in time?

Currently, I'm looking at rbd tools such as rbd options. I saw ceph supports asynchronous replications but it requires another ceph cluster, which we don't have at this moment.
Another option is to use rbd snapshot and rbd export. But how about consistency? What are risks to use rbd export insead of proxmox backup tool?
 
There are others in this forum that have posted that they do their backups straight from Ceph and that it works, BUT ... that does not bring the VM config file and so on, so for disaster recovery would be super problematic. No, Proxmox was setup from the beginning to do its own backups, this is clearly a bug that needs to be fixed and should be.

As far as the software from the folks at enterpriseve.com, I can say we haven't tried it, no, but that takes me back to what I just said. This is a feature of Proxmox with a bug and needs to be fixed, not "gotten around", by being forced to buy additional software to do something that Proxmox already is supposed to do but just happens that it is not doing it well enough right now under these particular circumstances (backups from Ceph only).

We need to stay on top of this as to status. I've been watching the developers mailing list for weeks now and there hasn't been a single mention of this backup bug. All I've seen is feature additions and fixes to features recently added or fixes to the GUI but no mention of the Ceph backup slowness problem.

This problem isn't just about getting backups done within a reasonable amount of time. It's also an issue with the fact that Ceph has its own mainenance it does in the form of PG scrubs and deep scrubs that if they are scheduled at night as the backups are, only adds to further slowdowns or problems for customers. The backups having this slowness problem is not a fault of Ceph it seems but does seem to add to slowness issues that Ceph already contends with internally.

As a question to the Proxmox guys, not a complaint, why is Perl used so much instead of a simpler language like Python? Tom gave me a hardtime, though I wasn't criticizing anyone, when I said I'm not a programmer so this is not solvable by me as if I was trying to say it was easy for the developers. If the issue is hard for the Devs too, why not use a language that's more parsable so that when these sorts of bugs come up it's easier to track down the cause. I'm not a programmer but have looked at a lot of opensource code ... C, C++, Perl, Python, Java ... and without being a programmer it's quite obvious to me that Perl is horrifically complicated to understand and read in comparison to some of the others. I know it's like the "Swiss Army knife" of languages but that's like having a guy that's a "Jack of all trades and Master of none" ... does everything good but does nothing exceptionally ... again, this is not a complaint or a criticism, but rather an honest observation and would love to know the reasoning behind using Perl over other more simple or more parsable languages.
 
I want to make something clear here ... before we were users of Proxmox, we were users of Citrix XenServer. Why did we get off of XenServer? A couple of reasons ... XenServer had no built-in way to create VM backups and XenServer performed very poorly on disk IO. As you can see, our number 1 reason for moving to Proxmox was Proxmox provided excellent backups.

Why am I mentioning this? Proxmox needs to understand that the VM backups that has been a feature in Proxmox since the beginning sets them apart from the many other solutions making them more desirable. So, please, put some time into fixing the backup speed from Ceph as an old and counted on feature in Proxmox that I am sure we are not the only ones to appreciate. And appreciate that feature we do ... it is why we because subscribers, in order to give financial backup to the Proxmox team to help make sure they could continue to support the excellent feature set found in Proxmox currently ...NOT new features but the excellent features already found in Proxmox.
 
I want to make something clear here ... before we were users of Proxmox, we were users of Citrix XenServer. Why did we get off of XenServer? A couple of reasons ... XenServer had no built-in way to create VM backups and XenServer performed very poorly on disk IO. As you can see, our number 1 reason for moving to Proxmox was Proxmox provided excellent backups.

Why am I mentioning this? Proxmox needs to understand that the VM backups that has been a feature in Proxmox since the beginning sets them apart from the many other solutions making them more desirable. So, please, put some time into fixing the backup speed from Ceph as an old and counted on feature in Proxmox that I am sure we are not the only ones to appreciate. And appreciate that feature we do ... it is why we because subscribers, in order to give financial backup to the Proxmox team to help make sure they could continue to support the excellent feature set found in Proxmox currently ...NOT new features but the excellent features already found in Proxmox.

this is not a fast and easy kind of problem:
  • the current backup code including the output format is tightly coupled to the qemu code base, keeping it current with qemu changes is already quite some work
  • the backup code needs to be stable and rock solid, and we try to keep compatibility for a long time
we are currently working on tackling the first issue, by factoring out the format related code into an external library which should make follow-up updates easier (like switching some of the qemu integration to potentially better performing variants) and reduce the maintenance burden.
poking us every few weeks on why there hasn't been visible progress takes valuable time away from development (the same people replying here on the forum are the ones who also implement the features you want, please keep that in mind!)
 
  • Like
Reactions: jeffwadsworth
Fabian, thank you. I apologize for the "poking" as you put it. This is the first answer I've seen that nicely describes the difficulty you guys are having with getting this issue resolved. If my words here seemed excessive or lacking in tact at times, again I apologize. I am a back office admin that's facing real world issues because of the slowness of these backups on Ceph and getting frustrated because of it so if my words reflect that frustration, it wasn't on purpose.

I just needed a reasonable answer on this and you've given exactly that. You're a gentleman and a scholar. :)

Thanks again and keep up the excellent work.
 
I made an experiment on my infrastructure to see how backup performance differs and somehow its sad to see how much would be possible...

Always the same VM (just different IDs):

Backup VM on ceph to NFS:
Code:
INFO: include disk 'virtio0' 'ceph-vm:vm-103-disk-1' 32G
INFO: creating archive '/mnt/pve/Backup/dump/vzdump-qemu-103-2017_08_09-15_08_25.vma.lzo'
INFO: transferred 34359 MB in 268 seconds (128 MB/s)
INFO: archive file size: 2.35GB
INFO: Finished Backup of VM 103 (00:04:30)

Backup VM on local-zfs to NFS:
Code:
INFO: include disk 'virtio0' 'local-zfs:vm-901-disk-1' 32G
INFO: creating archive '/mnt/pve/Backup/dump/vzdump-qemu-901-2017_08_09-15_13_16.vma.lzo'
INFO: transferred 34359 MB in 88 seconds (390 MB/s)
INFO: archive file size: 2.35GB
INFO: Finished Backup of VM 901 (00:01:38)

Backup VM on local-ssd to NFS:
Code:
INFO: include disk 'virtio0' 'SSD:902/vm-902-disk-1.raw' 32G
INFO: creating archive '/mnt/pve/Backup/dump/vzdump-qemu-902-2017_08_09-15_15_29.vma.lzo'
INFO: transferred 34359 MB in 52 seconds (660 MB/s)
INFO: archive file size: 2.35GB
INFO: Finished Backup of VM 902 (00:00:58)

Hopefully this might be solved in the future and we all can laugh about this time :D
 
@fips

What kind of NAS are you using? I only get about a third of what you are getting to your NFS (you showed 128 MB/s)

All those throughputs have to be on 10Gb ethernet but is the NAS a generic piece of hardware like a Dell with standards Linux on it or is it a Proprietary NAS solution?

Yes, hopefully you are right, that once this is fixed it'll be history to laugh at
 
@fips

What kind of NAS are you using? I only get about a third of what you are getting to your NFS (you showed 128 MB/s)

All those throughputs have to be on 10Gb ethernet but is the NAS a generic piece of hardware like a Dell with standards Linux on it or is it a Proprietary NAS solution?

FreeNAS 11 with 10x 2TB SATA Disk connected via 10Gbe,
Basic NFS setup, no tunes, no hacks.
Today I added to my FreeNAS ZFS Pool a SSD ZIL and a SSD L2ARC, lets see if it get better.

To be honest, the test VM is showed above is quit good compressible: As you can see the VM disk has 32G, the archive file 2.35G.
During the night I saved my exchange VM (550G VM disks, compressed 200G) and throughput was 35MB/s...
 
After adding a ZIL and a L2ARC to the NFS, I can say that it doesn't bring any advantages for backup with vzdump.
In the end it took the same amount of time to save all my VMs and CTs.
 
Let me rise this topic up, please.

Is anybody could run backup faster than 40MB/s ?

Yes. As joke:

INFO: starting new backup job: vzdump 109 110 111 123 --mailto gosha --compress lzo --storage BARC --mailnotification always --mode snapshot --quiet 1
INFO: skip external VMs: 109, 111
INFO: Starting Backup of VM 110 (qemu)
INFO: status = running
INFO: update VM 110: -lock backup
INFO: VM Name: eqmz
INFO: include disk 'scsi0' 'ceph_stor:vm-110-disk-1' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/BARC/dump/vzdump-qemu-110-2017_08_10-18_45_05.vma.lzo'
INFO: started backup task '9d8f8d6f-1311-4135-8cc8-c045ce922db9'
INFO: status: 0% (130351104/107374182400), sparse 0% (6275072), duration 3, 43/41 MB/s
INFO: status: 1% (1085407232/107374182400), sparse 0% (715509760), duration 13, 95/24 MB/s
INFO: status: 2% (2334588928/107374182400), sparse 1% (1963966464), duration 17, 312/0 MB/s
INFO: status: 3% (3280797696/107374182400), sparse 2% (2910171136), duration 20, 315/0 MB/s
INFO: status: 4% (4414308352/107374182400), sparse 3% (4043542528), duration 24, 283/0 MB/s
INFO: status: 5% (5392302080/107374182400), sparse 3% (4271185920), duration 45, 46/35 MB/s
INFO: status: 6% (6445793280/107374182400), sparse 4% (4300845056), duration 73, 37/36 MB/s
INFO: status: 7% (7544766464/107374182400), sparse 4% (4352024576), duration 100, 40/38 MB/s
INFO: status: 8% (8622309376/107374182400), sparse 4% (4373495808), duration 128, 38/37 MB/s
INFO: status: 9% (9667018752/107374182400), sparse 4% (4416679936), duration 155, 38/37 MB/s
INFO: status: 10% (10740432896/107374182400), sparse 4% (4494766080), duration 182, 39/36 MB/s
.....
INFO: status: 90% (96979648512/107374182400), sparse 61% (66357768192), duration 1078, 359/2 MB/s
INFO: status: 91% (98016559104/107374182400), sparse 62% (67394678784), duration 1081, 345/0 MB/s
INFO: status: 92% (98891595776/107374182400), sparse 63% (68269715456), duration 1084, 291/0 MB/s
INFO: status: 93% (100034609152/107374182400), sparse 64% (69399830528), duration 1088, 285/3 MB/s
INFO: status: 94% (101013585920/107374182400), sparse 65% (70378807296), duration 1091, 326/0 MB/s
INFO: status: 95% (102071140352/107374182400), sparse 66% (71428984832), duration 1095, 264/1 MB/s
INFO: status: 96% (103109885952/107374182400), sparse 67% (72467701760), duration 1098, 346/0 MB/s
INFO: status: 97% (104178122752/107374182400), sparse 68% (73512976384), duration 1102, 267/5 MB/s
INFO: status: 98% (105239412736/107374182400), sparse 69% (74574266368), duration 1105, 353/0 MB/s
INFO: status: 99% (106624057344/107374182400), sparse 70% (75951534080), duration 1109, 346/1 MB/s
INFO: status: 100% (107374182400/107374182400), sparse 71% (76701659136), duration 1111, 375/0 MB/s
INFO: transferred 107374 MB in 1111 seconds (96 MB/s)
INFO: archive file size: 11.00GB
INFO: delete old backup '/mnt/pve/BARC/dump/vzdump-qemu-110-2017_08_09-18_45_04.vma.lzo'
INFO: Finished Backup of VM 110 (00:18:36)

transferred 107374 MB in 1111 seconds (96 MB/s)
Good compression :) archive file size: 11.00GB
 
No, that's definitely not the answer that we're looking for. If you read the rest of the thread you'll see that was already mentioned and the response to it. Proxmox already does back ups, it's a matter of fixing the issues with backing up Ceph images within Proxmox so that it's done at full speed.
 
The problem is the size of virtual machines and Proxmox wants to make full backups.
Proxmox should implement APIs for the filesystem and backup, and support for other Java, Python, C, C ++, C# languages. The creation of external programs is due to the lack of customization of the GUI through the plugin.
 
Backup from Ceph images is as fast as from any other storage since quite some time on 5.0, the issue is fixed.