Abysmally slow restore from backup

My PBS running in a container on PVE (Ryzen 5950X) gives for TLS 348 MB/s (28%) with 1 core, 726 MB/s (59%) with 2 cores and 839 MB/s (68%) with 3 cores or more. I guess there is some parallelization going on.
That is very strange, I don't see any cpu usage at all on any of my CPUs and if I test a VM with 8 CPUs vs. PVE host it is running on with 32 is no noticeable difference. I'm not testing the TLS, it is limited by the network speed to PBS, yet everything else stays (also) the same. I would have suspected that it would scale somehow.
 
  • Like
Reactions: lucius_the
Hello @fabian and thank you for responding.
I only now have time to read through and will try to reply to your messages. It's going to be longish, I appologize in advance.

HTTP/2 as only transport mechanism for the backup and reader sessions is not set in stone. But any replacement/additional transport must have clear benefits without blowing up code complexity. We are aware that HTTP/2 performs (a lot) worse over higher latency connections, because in order to handle multiple streams over a single connection, round trips to handle stream capacity are needed. Why your TLS benchmark speed is so low I cannot say - it might be that your CPU just can't handle the throughput for the chosen cipher/crypto?
You are only viewing this from a single perspective problem, that of higher latency links, even though I already gave you some other examples previously. Let me repeat them here:
- LACP will never utilize more than a single link capacity, because there's only one connection / one stream of data going on
- congested links (think offsite backups) backup traffic will end up worse when competing for bandwidth with other traffic, then it would end up in case there were more connections in parallel instead of just 1
- also, server network hardware and drivers may as well benefit from multiple TCP streams, as then it could handle them by multiple CPU-s

All of these points matter (and there are possibly other points I did not identify) and none of them can't benefit because of this choice that was made - to always use only 1 TCP connection for backup traffic. You can't ignore all of these other points and pretend there's a problem only a high latency links.

On "Why your TLS benchmark speed is so low I cannot say - it might be that your CPU just can't handle the throughput for the chosen cipher/crypto?" I also don't know why it's slow. I can only tell you that what I have observed and that is: my CPU has free cycles. And that clearly means there is room for improvement. Because if my CPUs are not all 100% utilized, what that means is that PBS is not using all the available hardware. But it should. Or, we'd love that it does. Because then things would run faster - and that's the goal, right ? So, it doesn't matter what CPU type I have and how old it is. If it's idling, then it's not the "CPU-s age" that is at fault, it's the solution can't utilize the CPU fully. Yes it will run better on a better CPU, what doesn't, but that's no solution - it can also run faster on a newer CPU if software was using the CPU potential better.

CPU age is not a valid answer for poor performance. With some exceptions around instructions, if the CPU doesn't support them, but that doesn't apply here - and it shouldn't apply if CPU is not fully utilized in any case.

xxHash is not a valid choice for this application - we do actually really need a cryptographic hash, because the digest is not just there to protect against bit rot.
Could you elaborate more on this please ? What else are you protecting ?
If you are aiming for "this way we can guarantee that no bad actor changed the contents of the file" I am willing to argue against that. When someone has access to files in your backup datastore you are more or less doomed at that point already, how does crypto hash help ? Trying to protect against intentional modification of protected data at rest, with crypto hashes, is... very debatable, at a minimum. Especially when/if it causes high costs (in newer servers and CPU power) for a user. Yes, SHA means that someone with access to chunks will have a harder time making files corrupt while retaining the hash, then with a non cryptographic checksum. But don't tell me that's the reason...

I know the benefits though: speed of calculation is orders of magnitude higher than SHA. Therefore I actually think it's very valid for this application, if not perfect. You say it is not - so, please explain what is the reason behind SHA-256 as I must be missing something.

Even if we don't agree on this, I think users should still have a choice to decide on their own. "I'll use the less secure hash because it's 50x faster" is something I might choose to do, given my level of trust in other protections I put in place of my backup machines - and based on my own risk evaluation. We should all do test restores anyway (that includes starting the VM/CT) as best practice, right ? so... what exactly are we loosing with a fast non-crypto hasher ? what is so important that you won't even consider a non-crypto hash in this case - I wish to know. Sorry I'm an engineer, so dogma is not an answer I can work with.

Compression or checksum algorithms changes mean breaking existing clients (and servers), so those will only happen if we find something that offers substantial improvements.
No, it doesn't need to break existing clients or servers. I even noted in my previous post an example of how this could be achieved. Without looking at the code, really, but there are many ways how backward compatibility can be retained when introducing a new feature (in this case a new hash or compression algorithm for chunks).

With backward compatibility I mean: newer PBS version will be able to read and work with a datastore that was used with previous versions - not vice versa, obviously. But that's ok, that's pretty standard behavior. In a case where newer client created a chunk and an older client then tries to restore it - that wouldn't work, true. But you could configure the system NOT to use the new format by default, so one has to turn it on consciously - maybe even make a new datastore, if you wish, to unlock the new features. Then, make sure older clients can't use the new datastore and you're set. Nothing breaks then. In my opinion at least, that's not prohibitively complicated.

I might also note that you will hardy find "something that offers substantial improvements" if you don't agree to try anything. I think this would offer substantial improvements. With this being:
- xxHash for checksums
- multiple TCP streams (they can remain HTTP/2 connections if you wish, just use more of them in parallel)
- an option to disable TLS for transfer (it doesn't have to be complex, just use another port for non-TLS API access and put a listener without TLS there; later you could block API routes and selectively leave only those for chunk transfer accessible on non-TLS API, but for testing... it shouldn't at all be complicated to bring up a non-TLS API endpoint)
Try it. Then tell me if it qualifies as substantial improvement - or not. Test on some older lab hardware as well.

That's not true at all. For many of the tasks PBS does, spinning rust is orders of magnitudes slower than flash. You might not notice the difference for small incremental backup runs, where very little actual I/O happens, but PBS basically does random I/O with 1-4MB of data when writing/accessing backup contents, and random I/O on tons of small files for things like GC and verify.
I said "in my testing". Restore speed from NVMe datastore was not any faster than restoring from HDD datastore. Reason for this is probably that HDD datastore is on ZFS that had all metadata from .chunks already in RAM. That is roughly the same as having ZFS with HDD-s + special vdev on NVMe or SSD (I say roughly the same, as RAM is still faster, but metadata on SSD does help a lot when metadata cache is empty).

Let's not argue about NVMe vs HDD speeds in general, we all agree on that. But when doing PBS restore - in my testing - there was no significant difference between the two. Therefore the bottleneck must be somewhere else:
- Network ? - 10 Gbps (alas, in second try it was 10 Gbps)
- CPU ? - not fully utilized
- Disks ? - might be the reason... let's try with enterprise NVMe hardware -> same result

And that's why I (re)opened this whole topic.

Once you see what I've seen, explained above, you then go to the forums and communicate that. Because it's not normal for software not to utilize hardware fully (or... it is quite normal unfortunately, but that's another topic). PVE being quite performant even on older machines, Linux-based, OSS inspired and all - I thought I won't have to explain to anyone that - software that is not utilizing the full hardware potential might be seen as something you'd actually want to look into, not try to reject.

I just want to point out - what might seem "obvious" to you might actually be factually wrong (see the hashsum algorithm point above).
The answer about a hashsum is pure dogma - no arguments given. You can't use a "no argument" as an argument in further conversation - really.

I otherwise agree with you, that I can't tell for sure where the problem lies. In past experience it turned out I do have a good nose for this stuff. My nose is currently sniffing about: hashing, TLS and single-connection transfer. In a couple of years perhaps you can tell me how far off I was. At this point - neither of us knows. Right now I'm just sniffing and giving my advice. And you're... trying to find ways to reject it, more or less, it would seem.

If things seem suboptimal, it's usually not because we are too lazy to do a simple change, but because fixing them is more complicated than it looks ;)
Yeah, I lead a development team. I hear this a lot. I'll... politely skip the part where I tell you which of the above turns out to be true most of the time - in my experience ;)
But that's not the point, anyway, something else is. Just because things seem difficult at first doesn't mean they actually are difficult to implement, and above all, it absolutely doesn't mean we shouldn't dig there. Quite the opposite. In my company we dig a lot. My team always hates the idea - initially. Always. Then they are super happy a couple of days later about how they solved the big complex issue. It's not always like that, of course, but I can confidently say that more than 50% of time I get the "are you crazy, this will take us 6 months, no chance", then usually the same night I receive tons of messages from one of developers, who then also finishes the change either the same night or within a few days in total euphoria with "this is great, this will work awesomely now" and everyone else suddenly agreeing. Same people that were 100% against even thinking about touching it, just 2 days before, are in epiphany. Just saying... Most of the time things look much more difficult then they actually are. And thinking in advance it's complex, difficult and spreading such information around - it blocks developers from even considering changes in code. And then there's laziness that helps a lot to quickly agree on the point. That's actually the biggest single thing that stops progress in existing code. Don't think that way, don't say it until you know for sure what it actually takes to make the change. Also I know which things are complex, which are not. Implementing another hash shouldn't be that complex -> after a developer is allowed to dig a bit in to the code and see what actually needs to be done. That should be before he is given assumptions, before thinking "it's more complicated than it looks". Because actually, in most occasions, it's not that complicated after all.

I am not saying this to make you angry - but your CPU is actually rather old, and I suspect, the bottle neck in this case.
As long as CPU is not 100% utilized then, by definition, the CPU can not be said to be the bottleneck.
And as we know the CPU is not fully utilized I can actually put this here: the CPU is not the bottleneck, so let's move away from that assumption.

I understand that some part of processing might be single core bound (although, by monitoring CPU usage, I'm not seeing that here) - hence, the CPU was the bottleneck, you see, I told you. But in that case we should investigate ways to enable processing to utilize multiple cores. That might be achievable without changing much code (not saying it is). But the mere "your CPU is actually rather old", while at the same time my CPU is wasting cycles, well... That's maybe fine answer for an end customer looking for a quick solution - now. Go buy new hardware and try. Fine. But for an engineer, a developer, someone who's trying to make software perform better with hardware that is at it's disposal ? Not a valid answer. The CPU is not hitting the roof, so... let's try to find the cause and let's see how we can make better usage of the CPU.

For that: we need to identify where most time is spent during processing and what is waiting for what. Here, profiling the process, measuring different parts, etc is needed. This is where developers would need to jump in.
Or we can "blindly" (trusting my nose) try something to see if it helps.

That being said - feedback is always welcome as long as it's brought forward in a constructive manner.
I completely agree. And I hope you are not finding me unconstructive. I do have a rather low tolerance for vague answers, technically incorrect approach to fixing problems - and bs in general (not saying your posts are bs, they are not, in general).

I can give you some more details why the "TLS benchmark" results are faster than your real world restore "line speed":

The TLS benchmark only uploads the same blob of semi-random data over and over and measure throughput - there is no processing done on either end. An actual restore chunk request has to find and load the chunk from disk, send its contents over the wire, parse, decode and verify the chunk on the client side, write the chunk to the target disk.
Ok, I understand.
I could test with another TLS benchmark just to make sure the numbers agree. The way you describe it it seems a web server with TLS configured and me trying to download a file from it - should get me roughly the same result in terms of GBps. I'll try to test when I have time and report back. This is just to discard the TLS implementation itself as a potential issue (I don't expect to find an issue, this is just to be thorough).

While our code tries to do things concurrently, there is always some overhead involved. In particular with restoring VMs, there was some limitation within Qemu that forces us to process one chunk after the other because writing from multiple threads to the same disk wasn't possible - I am not sure whether that has been lifted in the meantime. you could try figuring out whether a plain proxmox-backup-client restore of the big disk in your example is faster
Ok this could pose some issues in case you try to use parallel transfers. But we could still send the same chunk in parallel and rearrange it at the receiving end (a bit more complicated but not too much)...

There are still potential gains with faster hashing and without TLS, if you would be willing to try with such code.

Kind regards

(edited some typpos and grammar mistakes to make text easier to read)
 
Last edited:
  • Like
Reactions: LnxBil and JensF
Replying to my own message here...
please explain what is the reason behind SHA-256 as I must be missing something.
Actually there is another point - collision resistance. If you are using SHA-256 as a chunk identifier - then I'm not sure how safe xxHash-128 is here, I don't know about it's collision resistance.

Perhaps it could be used in conjunction with some faster crypto hash (like md5, that is not safe enough on its own) to make it collision-safe overall, but I'm guessing here. One needs to test if it's overall more performant in such combination and then, someone smarter than me, would have to decide on it's s collision resistance.

Didn't think of this at first. you could have noted it though, so I'd skip the chapter ;)
Is that the big reason for using SHA-256 ?

Kind regards
 
Aahhh... of course it is. Yeah, I completely forgot about the whole dedupe stuff (which is brilliant btw).
Well, now I have to apologize for being too pushing with xxHash, and I do take that back. I'll not delete those part of my post, as that wouldn't be fair.

Ok. Perhaps we can do something with the other points -> not excluding the "faster hash" argument completely right away -> it's only now it's going to be much more difficult for me to push it, since collision resistance obviously plays an extremely important part of deduplication logic - and another hash could easily break statistical collision guarantees... so yeah, I understand the why now.

But - TLS and multiple TCP connection argument are still valid, hopefully we can talk about that.

Kind regards
 
You are only viewing this from a single perspective problem, that of higher latency links, even though I already gave you some other examples previously. Let me repeat them here:
- LACP will never utilize more than a single link capacity, because there's only one connection / one stream of data going on
- congested links (think offsite backups) backup traffic will end up worse when competing for bandwidth with other traffic, then it would end up in case there were more connections in parallel instead of just 1
- also, server network hardware and drivers may as well benefit from multiple TCP streams, as then it could handle them by multiple CPU-s

nobody is saying we won't ever implement some sort of multi-connection transport - but there is considerable overhead in doing so (both code complexity and actual resources consumed), so it's not a panacea either.

All of these points matter (and there are possibly other points I did not identify) and none of them can't benefit because of this choice that was made - to always use only 1 TCP connection for backup traffic. You can't ignore all of these other points and pretend there's a problem only a high latency links.

I am not saying it's only an issue with high latency links, just that that is the most prominent one that repeatedly comes up. I fully agree that there are setups out there that would have better performance with data flowing over multiple actual TCP connections. there are also setups where that isn't the case (i.e., a busy PBS server handling multiple clients will run into contention issues).

On "Why your TLS benchmark speed is so low I cannot say - it might be that your CPU just can't handle the throughput for the chosen cipher/crypto?" I also don't know why it's slow. I can only tell you that what I have observed and that is: my CPU has free cycles. And that clearly means there is room for improvement. Because if my CPUs are not all 100% utilized, what that means is that PBS is not using all the available hardware. But it should. Or, we'd love that it does. Because then things would run faster - and that's the goal, right ? So, it doesn't matter what CPU type I have and how old it is. If it's idling, then it's not the "CPU-s age" that is at fault, it's the solution can't utilize the CPU fully. Yes it will run better on a better CPU, what doesn't, but that's no solution - it can also run faster on a newer CPU if software was using the CPU potential better.

CPU age is not a valid answer for poor performance. With some exceptions around instructions, if the CPU doesn't support them, but that doesn't apply here - and it shouldn't apply if CPU is not fully utilized in any case.

that's unfortunately not true for crypto in general which tends to be highly optimized code - depending on the implementation, a CPU can be maxed out and not be 100% busy (because of things like vector instructions). it would require a detailed perf analysis of your specific system to show where the actual bottle neck is coming from.

Could you elaborate more on this please ? What else are you protecting ?
If you are aiming for "this way we can guarantee that no bad actor changed the contents of the file" I am willing to argue against that. When someone has access to files in your backup datastore you are more or less doomed at that point already, how does crypto hash help ? Trying to protect against intentional modification of protected data at rest, with crypto hashes, is... very debatable, at a minimum. Especially when/if it causes high costs (in newer servers and CPU power) for a user. Yes, SHA means that someone with access to chunks will have a harder time making files corrupt while retaining the hash, then with a non cryptographic checksum. But don't tell me that's the reason...

I know the benefits though: speed of calculation is orders of magnitude higher than SHA. Therefore I actually think it's very valid for this application, if not perfect. You say it is not - so, please explain what is the reason behind SHA-256 as I must be missing something.

Even if we don't agree on this, I think users should still have a choice to decide on their own. "I'll use the less secure hash because it's 50x faster" is something I might choose to do, given my level of trust in other protections I put in place of my backup machines - and based on my own risk evaluation. We should all do test restores anyway (that includes starting the VM/CT) as best practice, right ? so... what exactly are we loosing with a fast non-crypto hasher ? what is so important that you won't even consider a non-crypto hash in this case - I wish to know. Sorry I'm an engineer, so dogma is not an answer I can work with.

that is covered below :)

No, it doesn't need to break existing clients or servers. I even noted in my previous post an example of how this could be achieved. Without looking at the code, really, but there are many ways how backward compatibility can be retained when introducing a new feature (in this case a new hash or compression algorithm for chunks).

With backward compatibility I mean: newer PBS version will be able to read and work with a datastore that was used with previous versions - not vice versa, obviously. But that's ok, that's pretty standard behavior. In a case where newer client created a chunk and an older client then tries to restore it - that wouldn't work, true. But you could configure the system NOT to use the new format by default, so one has to turn it on consciously - maybe even make a new datastore, if you wish, to unlock the new features. Then, make sure older clients can't use the new datastore and you're set. Nothing breaks then. In my opinion at least, that's not prohibitively complicated.

we actually do try very hard with PBS to have compatibility in both directions where possible, and that includes things like being able to sync from a newer server to an older one and vice-versa.

for example, the recently introduced change detection mechanism for host type backups (split pxar archives) solely broke browsing those backup contents using the API/GUI, while retaining full compat for everything else including sync, verify and co.

that doesn't mean we don't ever do format changes that break this, but
- we don't do so lightly, there needs to be a *massive* benefit
- we usually make such features opt-in for quite a while before making them the default, to allow most involved systems to gain support in the meantime while still giving users the choice to adopt it faster

I might also note that you will hardy find "something that offers substantial improvements" if you don't agree to try anything. I think this would offer substantial improvements. With this being:
- xxHash for checksums
- multiple TCP streams (they can remain HTTP/2 connections if you wish, just use more of them in parallel)
- an option to disable TLS for transfer (it doesn't have to be complex, just use another port for non-TLS API access and put a listener without TLS there; later you could block API routes and selectively leave only those for chunk transfer accessible on non-TLS API, but for testing... it shouldn't at all be complicated to bring up a non-TLS API endpoint)

the last point I am not sure we will ever implement - not because I don't agree that it might give a performance boost, but because offering such potential footguns that can have catastrophic consequences is something we don't like to do. similarly, ssh doesn't have a built-in option to disable encryption, our APIs are not available over non-TLS (except for a redirect to the TLS one), and so forth. it's the year 2024, transports without encryption are dangerous (and yes, this includes local networks, unless we are talking direct PTP between to highly secured, trusted endpoints). if I could travel back in time, I'd also not implement migration_mode insecure for PVE ;) but that is my personal opinion, there might be other developers with a different view point.

Try it. Then tell me if it qualifies as substantial improvement - or not. Test on some older lab hardware as well.

I'll repeat this once more - trying different transport mechanism is something we are planning to do, this includes:
- Quic
- multi-stream
- RusTLS vs the currently used OpenSSL
- anything else that looks promising when we sit down and do this experiment

I said "in my testing". Restore speed from NVMe datastore was not any faster than restoring from HDD datastore. Reason for this is probably that HDD datastore is on ZFS that had all metadata from .chunks already in RAM. That is roughly the same as having ZFS with HDD-s + special vdev on NVMe or SSD (I say roughly the same, as RAM is still faster, but metadata on SSD does help a lot when metadata cache is empty).

Let's not argue about NVMe vs HDD speeds in general, we all agree on that. But when doing PBS restore - in my testing - there was no significant difference between the two. Therefore the bottleneck must be somewhere else:
- Network ? - 10 Gbps (alas, in second try it was 10 Gbps)
- CPU ? - not fully utilized

I am fairly certain its this one, but se above ;)

- Disks ? - might be the reason... let's try with enterprise NVMe hardware -> same result

And that's why I (re)opened this whole topic.

Once you see what I've seen, explained above, you then go to the forums and communicate that. Because it's not normal for software not to utilize hardware fully (or... it is quite normal unfortunately, but that's another topic). PVE being quite performant even on older machines, Linux-based, OSS inspired and all - I thought I won't have to explain to anyone that - software that is not utilizing the full hardware potential might be seen as something you'd actually want to look into, not try to reject.

there are areas where we know there is more that can be done, but most low hanging fruit is already done. and what hurts on one system works well on another and vice-versa, providing the right number of and kind of knobs, without providing too many footguns is not an easy endeavour ;)

I otherwise agree with you, that I can't tell for sure where the problem lies. In past experience it turned out I do have a good nose for this stuff. My nose is currently sniffing about: hashing, TLS and single-connection transfer. In a couple of years perhaps you can tell me how far off I was. At this point - neither of us knows. Right now I'm just sniffing and giving my advice. And you're... trying to find ways to reject it, more or less, it would seem.


Yeah, I lead a development team. I hear this a lot. I'll... politely skip the part where I tell you which of the above turns out to be true most of the time - in my experience ;)
But that's not the point, anyway, something else is. Just because things seem difficult at first doesn't mean they actually are difficult to implement, and above all, it absolutely doesn't mean we shouldn't dig there. Quite the opposite. In my company we dig a lot. My team always hates the idea - initially. Always. Then they are super happy a couple of days later about how they solved the big complex issue. It's not always like that, of course, but I can confidently say that more than 50% of time I get the "are you crazy, this will take us 6 months, no chance", then usually the same night I receive tons of messages from one of developers, who then also finishes the change either the same night or within a few days in total euphoria with "this is great, this will work awesomely now" and everyone else suddenly agreeing. Same people that were 100% against even thinking about touching it, just 2 days before, are in epiphany. Just saying... Most of the time things look much more difficult then they actually are. And thinking in advance it's complex, difficult and spreading such information around - it blocks developers from even considering changes in code. And then there's laziness that helps a lot to quickly agree on the point. That's actually the biggest single thing that stops progress in existing code. Don't think that way, don't say it until you know for sure what it actually takes to make the change. Also I know which things are complex, which are not. Implementing another hash shouldn't be that complex -> after a developer is allowed to dig a bit in to the code and see what actually needs to be done. That should be before he is given assumptions, before thinking "it's more complicated than it looks". Because actually, in most occasions, it's not that complicated after all.



As long as CPU is not 100% utilized then, by definition, the CPU can not be said to be the bottleneck.
And as we know the CPU is not fully utilized I can actually put this here: the CPU is not the bottleneck, so let's move away from that assumption.

it would be interesting to get perf data for both ends during a TLS benchmark, it might show some optimization potential for systems like yours. if you want to try that, I can provide more guidance on how to get that running..

Aahhh... of course it is. Yeah, I completely forgot about the whole dedupe stuff (which is brilliant btw).
Well, now I have to apologize for being too pushing with xxHash, and I do take that back. I'll not delete those part of my post, as that wouldn't be fair.

Ok. Perhaps we can do something with the other points -> not excluding the "faster hash" argument completely right away -> it's only now it's going to be much more difficult for me to push it, since collision resistance obviously plays an extremely important part of deduplication logic - and another hash could easily break statistical collision guarantees... so yeah, I understand the why now.

yeah - we rely on the properties of sha256, but that maybe could be documented more prominently :) switching to a new scheme with similar properties but better performance would in principle be doable - but it would require either breaking the deduplicaton between old and new, or some semi-expensive compatibility mechanism.. you already found the corresponding bug tracker entry ;)
 
  • Like
Reactions: lucius_the
Hi fabian,
Thank you for your response. I can agree with most of what you replied... except this part:

the last point I am not sure we will ever implement - not because I don't agree that it might give a performance boost, but because offering such potential footguns that can have catastrophic consequences is something we don't like to do. similarly, ssh doesn't have a built-in option to disable encryption, our APIs are not available over non-TLS (except for a redirect to the TLS one), and so forth. it's the year 2024, transports without encryption are dangerous (and yes, this includes local networks, unless we are talking direct PTP between to highly secured, trusted endpoints). if I could travel back in time, I'd also not implement migration_mode insecure for PVE ;) but that is my personal opinion, there might be other developers with a different view point.

This holds true... for public web sites. It doesn't really translate to server environments, normally isolated with VLAN-s for different traffic types.

footguns ?? man, we're Linux users... rm -fR ? dd ? hdparm ? zfs destroy ? those wont even ask for confirmation.

ssh ? yes. But that's a secure alternative to telnet (it actually has "secure" in its name and is here to provide a secure layer on top of pre-existing insecure stuff). Using ssh as an example here is like saying that TLS is not offered without encryption :) c'mon

Anyway, with ssh in the system, that same system is in no way stopping you from using: telnet, if you want to. Or FTP, or whatever. Btw NFS and SMB are also not encrypted by default, as far as I know - and we surely can't say these are not used - in no isolated environments either. And what about iSCSI ? Would you then ban iSCSI and NFS in server environments, by same argument ? But I wonder how many users would support such decision.
Perhaps we should use AES-256 on SATA as well (I mean, there's eSATA), on USB for sure, on SCSI, on everything. How is that different ? Hell, we need it on monitors as well - they use a cable. Really, I read recently that someone has sucessfully sniffed the image using RMI coming from the monitor cable, so... that's being used in the wild ! That should get encrypted immediately - with no option to disable it ! Who cares how much it costs - buy more hardware, you need to stay safe. I can go futher with this nonsense... but I think you get the point.

"it's the year 2024, transports without encryption are dangerous" -> this is waaaaay too general, this can't/shouldn't be applied universally. Honestly I'm pretty shocked by such a broad statement.

Besides, as you mentioned, you already have it for migration traffic. I guess because someone was pushing it hardly enough and you had to let go. Which is a good thing. And it's - not the default, so you can sleep calmly at night. You do also support NFS and iSCSI in proxmox, right ? And that is for access to production block devices, right ? and... where's the footgun protection here, that we desperately need to survive in 2024 ? How do you defend that. In what way is safety here less important (as it's allowed witout encryption) than in backup transfers. Why do you think bakcup transfer links would be any less protected then iSCSI or NFS links - in any serious production setup, where it actually matters.

Anyway my take here is: You can protect users by using safe defaults, if you want to - not by taking away options.

Hopefully, more users can chime in on this one.

Kind regards
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!