Here is some more information (the bug report is quite long): On recent kernels, it seems like the kernel sometimes returns all zeros when a block is read from disk. The bug report states that this happens under memory pressure, but our hosts are not particularily loaded, yet we see this bug...
We've been hit by this bug: http://tracker.ceph.com/issues/22464 quite a lot of times now. It's definitely not fun always having to recover your database because the VM I/O got stuck.
Would it be possible to include the patch: https://github.com/ceph/ceph/pull/23273
Thanks,
Stefan
Ich glaube nicht, dass der Fix schon in 12.2.7 ist. Zumindest sind Bug und Pull Request noch offen. Also zumindest das, was ich verlinkt habe. Bei dem anderen geht es um eine ähnliche Sache, was aber ein Bug in 12.2.6 war, welches bei Proxmox meines Wissens nie kam (direktes update von 12.2.5...
Wir haben genau das gleiche Problem. Eventuell ist es dieser Bug: http://tracker.ceph.com/issues/22464
Bei uns findet sich im osd-Log so eine Zeile:
bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0xe000, got 0x6706be76
Zeroing out is ok maybe for a one-time. But in general, you should be using VirtIO-SCSI with "discard" enabled and run a periodic fstrim inside the guest. That will mark unused blocks and is much faster. With the dd method, those zeroes are actually written out to disk.
The fstrim we run daily...
But my system is not slow. All my backups where the cache mode is not directsync have >50MB/s speed.
This backup is not "slow", it stalls. Absolutely no activity and no load for hours. Of course I know that any bottleneck will slow down the backup process. The problem with that theory is that...
Sorry if I can't help directly, but yes, Ceph is slower than expected. You can speed up Ceph by using krbd.
Your numbers are very slow for SSD. We have a production system with SSD and 10GbE, read/write is around 150MB/s, IOPS are around 1000-2000.
Test system is on 1GbE public network, 2GbE...
Thank you Alwin. It's not that I can't read. In fact, I had read the things you quote and link already months ago. Nowhere does it state that directsync does not work with the Proxmox backup process.
As I understand it, PVE has an enhanced version qemu to implement the backup process as...
I already know a lot about caching, thank you. However, can you explain in detail why directsync does not work with the VMA backup? And I mean in technical terms.
I have changed the disk cache to "writethrough" and for the last days the daily backup has always run successfully. So either it's a bug in "directsync" or "directsync" should be marked as incompatible with backup.
So by "slow", you mean 407 kByte/s slow? Because that's the average speed of this (cancelled) backup job:
If that is the case, then basically you should mark "directsync" as not compatibly with backup. Which is unfortunate, because according to our testing, directsync is the fastest...
Are you saying that backups don't work with "directsync"? Because usually, the first backup after a fresh VM start has worked, only after a couple of days (or backups?) the backups slow down eventually. (I let one run over the weekend, it basically only made progress for VM disk writes).
Stefan
Well, the other 20 or so VMs on the same source storage (Ceph) are backing up just fine to the same target storage (GlusterFS), so I highly doubt that is the source of the problem. Indeed, 5 hosts can simultaneously (thanks to the silly cluster backup mechanism) back up VMs and they all work ok...
I did a backup of a small VM to local storage. As mentioned above, it finished in normal speed. I also have another VM of about the same size as the one that's having troubles, and it backs up to the same identical storage without problems (yet). You can see from the logs that also this...
Started on another host, next day, same problem.
I did some stuff with the QEMU monitor, and even after cancelling the backup, it shows:
info block-jobs
Type backup, device drive-scsi0: Completed 37059821568 of 150323855360 bytes, speed limit 10485760000 bytes/s
(I had set the speed limit...
Well after one day and only 10% more progress, I cancelled the backup. I rebooted the VM, took another backup, that finished normally:
Very unsatisfactory, who knows when it'll happen again.
Stefan
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.