Hello
@fabian and thank you for responding.
I only now have time to read through and will try to reply to your messages. It's going to be longish, I appologize in advance.
HTTP/2 as only transport mechanism for the backup and reader sessions is not set in stone. But any replacement/additional transport must have clear benefits without blowing up code complexity. We are aware that HTTP/2 performs (a lot) worse over higher latency connections, because in order to handle multiple streams over a single connection, round trips to handle stream capacity are needed. Why your TLS benchmark speed is so low I cannot say - it might be that your CPU just can't handle the throughput for the chosen cipher/crypto?
You are only viewing this from a single perspective problem, that of higher latency links, even though I already gave you some other examples previously. Let me repeat them here:
-
LACP will never utilize more than a single link capacity, because there's only one connection / one stream of data going on
-
congested links (think offsite backups) backup traffic will end up worse when competing for bandwidth with other traffic, then it would end up in case there were more connections in parallel instead of just 1
- also, server
network hardware and drivers may as well benefit from multiple TCP streams, as then it could handle them by multiple CPU-s
All of these points matter (and there are possibly other points I did not identify) and none of them can't benefit because of this choice that was made - to always use only 1 TCP connection for backup traffic. You can't ignore all of these other points and pretend there's a problem only a high latency links.
On "
Why your TLS benchmark speed is so low I cannot say - it might be that your CPU just can't handle the throughput for the chosen cipher/crypto?" I also don't know why it's slow. I can only tell you that what I have observed and that is:
my CPU has free cycles. And that clearly means there is room for improvement. Because if my CPUs are not all 100% utilized, what that means is that PBS is not using all the available hardware. But it should. Or, we'd love that it does. Because then things would run faster - and that's the goal, right ? So, it doesn't matter what CPU type I have and how old it is. If it's idling, then it's not the "CPU-s age" that is at fault, it's the solution can't utilize the CPU fully. Yes it will run better on a better CPU, what doesn't, but that's no solution - it can also run faster on a newer CPU if software was using the CPU potential better.
CPU age is not a valid answer for poor performance. With some exceptions around instructions, if the CPU doesn't support them, but that doesn't apply here - and it shouldn't apply if CPU is not fully utilized in any case.
xxHash is not a valid choice for this application - we do actually really need a cryptographic hash, because the digest is not just there to protect against bit rot.
Could you elaborate more on this please ? What else are you protecting ?
If you are aiming for "this way we can guarantee that no bad actor changed the contents of the file" I am willing to argue against that. When someone has access to files in your backup datastore you are more or less doomed at that point already, how does crypto hash help ? Trying to protect against intentional modification of protected data at rest, with crypto hashes, is... very debatable, at a minimum. Especially when/if it causes high costs (in newer servers and CPU power) for a user. Yes, SHA means that someone with access to chunks will have a harder time making files corrupt while retaining the hash, then with a non cryptographic checksum. But don't tell me that's the reason...
I know the benefits though: speed of calculation is orders of magnitude higher than SHA. Therefore I actually think it's very valid for this application, if not perfect. You say it is not - so, please explain what is the reason behind SHA-256 as I must be missing something.
Even if we don't agree on this, I think users should still have a choice to decide on their own. "I'll use the less secure hash because it's 50x faster" is something I might choose to do, given my level of trust in other protections I put in place of my backup machines - and based on my own risk evaluation. We should all do test restores anyway (that includes starting the VM/CT) as best practice, right ? so... what exactly are we loosing with a fast non-crypto hasher ? what is so important that you won't even consider a non-crypto hash in this case - I wish to know. Sorry I'm an engineer, so dogma is not an answer I can work with.
Compression or checksum algorithms changes mean breaking existing clients (and servers), so those will only happen if we find something that offers substantial improvements.
No, it doesn't need to break existing clients or servers. I even noted in my previous post an example of how this could be achieved. Without looking at the code, really, but there are many ways how backward compatibility can be retained when introducing a new feature (in this case a new hash or compression algorithm for chunks).
With backward compatibility I mean: newer PBS version will be able to read and work with a datastore that was used with previous versions - not vice versa, obviously. But that's ok, that's pretty standard behavior. In a case where newer client created a chunk and an older client then tries to restore it - that wouldn't work, true. But you could configure the system NOT to use the new format by default, so one has to turn it on consciously - maybe even make a new datastore, if you wish, to unlock the new features. Then, make sure older clients can't use the new datastore and you're set. Nothing breaks then. In my opinion at least, that's not prohibitively complicated.
I might also note that you will hardy find
"something that offers substantial improvements" if you don't agree to try anything. I think this would offer substantial improvements. With
this being:
-
xxHash for checksums
-
multiple TCP streams (they
can remain HTTP/2 connections if you wish, just use more of them in parallel)
-
an option to disable TLS for transfer (it doesn't have to be complex, just use another port for non-TLS API access and put a listener without TLS there; later you could block API routes and selectively leave only those for chunk transfer accessible on non-TLS API, but for testing... it shouldn't at all be complicated to bring up a non-TLS API endpoint)
Try it. Then tell me if it qualifies as substantial improvement - or not. Test on some older lab hardware as well.
That's not true at all. For many of the tasks PBS does, spinning rust is orders of magnitudes slower than flash. You might not notice the difference for small incremental backup runs, where very little actual I/O happens, but PBS basically does random I/O with 1-4MB of data when writing/accessing backup contents, and random I/O on tons of small files for things like GC and verify.
I said "in my testing". Restore speed from NVMe datastore was not any faster than restoring from HDD datastore. Reason for this is probably that HDD datastore is on ZFS that had all metadata from .chunks already in RAM. That is roughly the same as having ZFS with HDD-s + special vdev on NVMe or SSD (I say roughly the same, as RAM is still faster, but metadata on SSD does help a lot when metadata cache is empty).
Let's not argue about NVMe vs HDD speeds in general, we all agree on that. But when doing PBS restore -
in my testing - there was no significant difference between the two. Therefore the bottleneck must be somewhere else:
- Network ? - 10 Gbps (alas, in second try it was 10 Gbps)
- CPU ? - not fully utilized
- Disks ? - might be the reason... let's try with enterprise NVMe hardware -> same result
And that's why I (re)opened this whole topic.
Once you see what I've seen, explained above, you then go to the forums and communicate that. Because it's not normal for software not to utilize hardware fully (or... it
is quite normal unfortunately, but that's another topic). PVE being quite performant even on older machines, Linux-based, OSS inspired and all - I thought I won't have to explain to anyone that - software that is not utilizing the full hardware potential might be seen as something you'd actually want to look into, not try to reject.
I just want to point out - what might seem "obvious" to you might actually be factually wrong (see the hashsum algorithm point above).
The answer about a hashsum is pure dogma - no arguments given. You can't use a "no argument" as an argument in further conversation - really.
I otherwise agree with you, that I can't tell for sure where the problem lies. In past experience it turned out I do have a
good nose for this stuff. My nose is currently sniffing about: hashing, TLS and single-connection transfer. In a couple of years perhaps you can tell me how far off I was. At this point - neither of us knows. Right now I'm just
sniffing and giving my advice. And you're... trying to find ways to reject it, more or less, it would seem.
If things seem suboptimal, it's usually not because we are too lazy to do a simple change, but because fixing them is more complicated than it looks
Yeah, I lead a development team. I hear this a lot. I'll... politely skip the part where I tell you which of the above turns out to be true most of the time - in my experience
But that's not the point, anyway, something else is. Just because things
seem difficult at first doesn't mean they actually are difficult to implement, and above all, it absolutely doesn't mean we shouldn't dig there. Quite the opposite. In my company we dig a lot. My team always hates the idea - initially. Always. Then they are super happy a couple of days later about how they solved the big complex issue. It's not always like that, of course, but I can confidently say that more than 50% of time I get the "are you crazy, this will take us 6 months, no chance", then usually the same night I receive tons of messages from one of developers, who then also finishes the change either the same night or within a few days in total euphoria with "this is great, this will work awesomely now" and everyone else suddenly agreeing. Same people that were 100% against even thinking about touching it, just 2 days before, are in epiphany. Just saying... Most of the time things look much more difficult then they actually are. And thinking in advance it's complex, difficult and spreading such information around - it blocks developers from even considering changes in code. And then there's laziness that helps a lot to quickly agree on the point. That's actually the biggest single thing that stops progress in existing code. Don't think that way, don't say it until you know for sure what it actually takes to make the change. Also I know which things are complex, which are not. Implementing another hash shouldn't be that complex -> after a developer
is allowed to dig a bit in to the code and see what actually needs to be done. That should be before he is given assumptions, before thinking "it's more complicated than it looks". Because actually, in most occasions, it's not that complicated after all.
I am not saying this to make you angry - but your CPU is actually rather old, and I suspect, the bottle neck in this case.
As long as CPU is not 100% utilized then, by definition, the CPU can not be said to be the bottleneck.
And as we know the CPU is not fully utilized I can actually put this here:
the CPU is not the bottleneck, so let's move away from that assumption.
I understand that some part of processing might be single core bound (although, by monitoring CPU usage, I'm not seeing that here) - hence, the CPU was the bottleneck, you see, I told you. But in that case we should investigate ways to enable processing to utilize multiple cores. That might be achievable without changing much code (not saying it is). But the mere
"your CPU is actually rather old", while at the same time my CPU is wasting cycles, well... That's
maybe fine answer for an end customer looking for a quick solution - now. Go buy new hardware and try. Fine. But for an engineer, a developer, someone who's trying to make software perform better with hardware that is at it's disposal ? Not a valid answer. The CPU is not hitting the roof, so... let's try to find the cause and let's see how we can make better usage of the CPU.
For that: we need to identify where most time is spent during processing and what is waiting for what. Here, profiling the process, measuring different parts, etc is needed. This is where developers would need to jump in.
Or we can "blindly" (trusting
my nose) try something to see if it helps.
That being said - feedback is always welcome as long as it's brought forward in a constructive manner.
I completely agree. And I hope you are not finding me unconstructive. I do have a rather low tolerance for vague answers, technically incorrect approach to fixing problems - and bs in general (not saying your posts are bs, they are not, in general).
I can give you some more details why the "TLS benchmark" results are faster than your real world restore "line speed":
The TLS benchmark only uploads the same blob of semi-random data over and over and measure throughput - there is no processing done on either end. An actual restore chunk request has to find and load the chunk from disk, send its contents over the wire, parse, decode and verify the chunk on the client side, write the chunk to the target disk.
Ok, I understand.
I could test with another TLS benchmark just to make sure the numbers agree. The way you describe it it seems a web server with TLS configured and me trying to download a file from it - should get me roughly the same result in terms of GBps. I'll try to test when I have time and report back. This is just to discard the TLS implementation itself as a potential issue (I don't expect to find an issue, this is just to be thorough).
While our code tries to do things concurrently, there is always some overhead involved. In particular with restoring VMs, there was some limitation within Qemu that forces us to process one chunk after the other because writing from multiple threads to the same disk wasn't possible - I am not sure whether that has been lifted in the meantime. you could try figuring out whether a plain proxmox-backup-client restore of the big disk in your example is faster
Ok this could pose some issues in case you try to use parallel transfers. But we could still send the same chunk in parallel and rearrange it at the receiving end (a bit more complicated but not too much)...
There are still potential gains with faster hashing and without TLS, if you would be willing to try with such code.
Kind regards
(edited some typpos and grammar mistakes to make text easier to read)