Docker support in Proxmox

Can you elaborate? I thought QEMU and therefore PVE is able to use NUMA.
This is exactly what we want:
Code:
If you enable this feature, your system will try to arrange the resources such that a VM does have all its vCPUs on the same physical socket

Socket = Numa-Node

The issue is, that this doesnt work at all. Not even a bit.
And its absolutely critical on Single-Socket Milan/Genoa/Bergamo, because they use 4 CCD Chiplets on a single CPU.
You have to split those 4 Chiplets to 4 Numa-Nodes, because the communication between chiplets is slow as hell.

So you assume, according to the Wiki, that all vCPU's of one VM should run all on the same Chiplet (or numa-node).

But thats not the case, vCPU's are on the proxmox Host itself just threads.
But those threads are not only running randomly spreaded around all cores (ignoring numa completely), its even worse because each thread gets additionaly rotated every 1-2 Seconds to other physical cores.

I didnt tested on real dual-socket intel systems, but i can lay my hand into fire that there is the same Situation.
I dont have intel-systems where every single bit of Performance matters.
But i have 2 maxed out Genoa-Servers where each bit of Performance matters.

You can fix this yourself if you use cpu-pinning and do it yourself for each VM.
But with a lot of VM's and additionaly if you move them between hosts, its simply Impossible to use cpu-pinning.

The Performance impact on some tasks are between 40 and 300%. Even measurable with simple tools like iperf.

Cheers
 
Maybe we should split off to a new thread, this is much more interesting than than another homelabber chiming in to want a new Docker GUI.

So you assume, according to the Wiki, that all vCPU's of one VM should run all on the same Chiplet (or numa-node).
No, I assume that they run on the numa node from which the memory is used to reduce the inter-numa-node-communication.

The issue is, that this doesnt work at all. Not even a bit.
Where is your proof? You just provide anecdotal evidence, which is totally useless to interpret.

Trying to understand what you want to say and inspecting my dual-socket intel machines, my numastat shows, that I have this:

Code:
$ numastat
                           node0           node1
numa_hit            180134743078    129405155945
numa_miss             2028661746       461859704
numa_foreign           461859704      2028661746

Which shows that node0 has a miss ratio of 1,1% and node1 0,35% which are both far from "not even a bit". This is the worst numstat I found, others have even lower misses.

I don't have any AMD machine at my hand right now, so how does it look at your machine? Have you configured NUMA for EACH VM?
 
Maybe we should split off to a new thread, this is much more interesting than than another homelabber chiming in to want a new Docker GUI.


No, I assume that they run on the numa node from which the memory is used to reduce the inter-numa-node-communication.


Where is your proof? You just provide anecdotal evidence, which is totally useless to interpret.

Trying to understand what you want to say and inspecting my dual-socket intel machines, my numastat shows, that I have this:

Code:
$ numastat
                           node0           node1
numa_hit            180134743078    129405155945
numa_miss             2028661746       461859704
numa_foreign           461859704      2028661746

Which shows that node0 has a miss ratio of 1,1% and node1 0,35% which are both far from "not even a bit". This is the worst numstat I found, others have even lower misses.

I don't have any AMD machine at my hand right now, so how does it look at your machine? Have you configured NUMA for EACH VM?
In the context of NUMA (Non-Uniform Memory Access) configurations, it's crucial to understand that the significance extends beyond just memory; the L3 cache plays a pivotal role. On Genoa platforms, L3 caches are distributed across NUMA nodes, with each cache supporting eight threads. This distribution is similar to how memory is handled, but with a key distinction:
  • Both L3 cache and memory are managed similarly by the Linux operating system. However, L3 cache operates significantly faster than memory DIMMs.
An important point to note is that misses in the L3 cache are not recorded by tools like numastat.

Data sharing between threads in a multithreaded application typically involves the L3 cache before accessing the memory:
  • If the data is relatively small and can be completely contained within the L3 cache, it remains there, allowing other threads on the same CPU immediate access.
  • However, if the data needs to be accessed by a thread on a different chiplet that does not share the same L3 cache, the data must traverse through the memory system.
This behavior underscores the performance implications in NUMA systems where data locality can significantly impact application performance.

And yes, this was so long ago, that my Previous Posts with Intel-Systems is wrong.
I just remembered that on Intel-Systems you aren't affected as much if all, due to the Monotholic Design of the CPU.
There is indeed a difference, i did testings in May on my Intel Dual Socket Systems, and the impact with CPU-Pinning was almost none, while on Genoa im getting around twice the Performance in some Applications.

TBH, for my brain this tests were so long ago, that i barely remember anything. But i can say for sure that this will get solved when more and more People/Companies will switch to Chiplet-Design CPU's.

BTW, i tested even Ryzen-CPU's (they have actually 2 chiplets) and for whatever Reason, they have no Performance benefit either with CPU-Pinning (like Intel Servers). The Ryzen-CPUs have no numa, but with CPU-Pinning you don't need Numa to test the impact.
Just the Genoa-Servers do have a huge impact. But they are not slow, still a lot faster as Intel, but they are exactly on par with Ryzen.
A VM with 4 Cores on Genoa 9374F (without pinning) vs Ryzen 5800X, is almost equal on the Performance side.
With Pinning, the Genoa is almost twice as fast.

https://forum.proxmox.com/threads/iperf3-speed-same-node-vs-2-nodes-found-a-bug.146805/
Check my latest posts of that Thread.
There are your proofs and whatever you want.
 
There are your proofs and whatever you want.
Thank you very much for the detailed explanation. I wasn't aware of the cache situation, which is completly feasible.

I just read up on the topic and I have no AMD system accessible, so have you time to check this out or have you already checked this out? It's fairly old, yet is AFAIK not automatically set on PVE. The actual commit has moved and is now available here.
 
I conducted a series of experiments with different numa layouts and benchmarked the memory and sadly you're right that QEMU is not able (in its current configuration) to allocate memory or cpu threads in their respective NUMA nodes, which leads to the problems you described. I really wonder why that is.
 
You can fix this yourself if you use cpu-pinning and do it yourself for each VM.
But with a lot of VM's and additionaly if you move them between hosts, its simply Impossible to use cpu-pinning.
You will not solve the memory NUMA allocation, yet the cache allocation. I just tested it with mbw benchmark and on the hypervisor, the QEMU process got memory from both (I have two) nodes. CPU pinning will give better performance, yet as you already stated, not that much on intel. The difference varies on the ratio of the memory distribution over the numa nodes, yet allocating all the wrong cpus will significantly worsen the problem also up to 2.5x slower, yet this is worse than the default with cycling around so it may just be a strong cornercase.

I noticed that on the Host, there are anonymous hugepages allocated for the vm.
 
  • Like
Reactions: Ramalama
You will not solve the memory NUMA allocation
That is already available in the configuration file, yet not via the GUI and not automatically. I played around with it in this thread. It seems to work and I am really interested in seeing if it would be a solution for you and if it will be faster (and easier to setup than just running taskset).
 
Thanks for the interesting discussion on the NUMA issue. I was completely unaware of that, and just learned quite a lot.

It would be great if this entire dicussion could be broken out of this thread and into its own. It's not really a Docker issue at all. I wonder if a mod could do that?

Getting this back on topic ...

Yes, I'm only a home server user, so I'm blissfully unaware of how Proxmox VE is used in commercial production infrastructure, but so far I've not seen a clear articulation of why Docker needs to be in Proxmox itself, rather than being managed through at least one VM or LXC container. Creating Docker Swarm or Kubernetes clusters in VM/LXC environments seems to be well-documented in the homelab space (e.g., it's all over my YouTube feeds, complicating otherwise simple projects ;) ), which makes me think it must be even more standardized in commercial production environments, where it needs to be fast, reproducible, and reliable.

Docker networking, Docker storage, and even its nomenclature for managing its containers (start/stop vs. up/down) are all nothing like VM and LXC management, which mostly rely on the same storage and networking and general management (start, stop, restart, etc.) paradigm. So the existing shared LXC/VM UI principles couldn't just be adapted without serious redesign.

You'd need a completely separate UI for managing Docker's various components, and then if you do that, why aren't you implementing containerd more broadly? What about podman?

Well-develped, robust, and powerful GUI management tools for Docker/Containerd/Kubernetes/Podman already exist. LXCs now support Docker well, so that's an option if your hardware and use case needs it.

And, then there's the exponential growth in the amount of support requests Docker-in-the-GUI would generate from people trying to figure out how to make Docker work, who now go to Proxmox for support because, hey, Docker is in there.

Also: Proxmox would be responsible for monitoring Docker's release schedule and doing regression testing/etc. testing to push updates to the Docker that ships with Proxmox, and odds are that the Docker that's shipped with Proxmox would never be the latest one, and people would try to install the latest one anyway and break it and come here for help.

I'd much rather see Proxmox's dev team be allowed to focus on refining what's already there and implementing new features and continuing to surface existing features that only exist in config files into the GUI.
 
Last edited:
  • Like
Reactions: LnxBil

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!