Greetings,
I decided to do some UMA and NUMA testing to optimize my VM performance. Its not exactly scientific because the "load" is Plex, forced to transcode high definition video. Plex is installed as Rockstor container, and Rockstor itself is in a VM. Server is Dual socket Xeon 4c/8t each, 24GB of ram connected to each CPU. I did look for two things - CPU load after transcoding for a while and output of numastat. VM have 10GB RAM shared to it and all the CPUs aswell. CPUs are passed to the VM as "host"
Tested two cases:
1) in BIOS, Node Interleaving set to Enabled (UMA mode) and VM NUMA set to disabled
2) Node Interleaving set to Disabled (NUMA mode) and VM NUMA set to enabled
The result were weird. Case 1 seems to have slightly better performance - seems to have less CPU usage and lower smaller spikes. I wont post numastat for this case since everything is seen as one node and there is no numa_miss and numa_foreign registered. From all I read, case 2 was expected to be the better performer especially since linux (both proxmox and rockstor are linux based) is NUMA aware OS, but instead CPU usage on average seemed higher, but with less CPU usage spikes. What is even more interesting is the numastats
Here is the numastat for "node" in proxmox:
Here is numastat from rockstor itself:
So there is a conflict - proxmox says there are no misses but rockstor, inside the VM says there are a lot of them. So i did check how the nodes were organized:
proxmox:
rockstor:
You might notice that the CPU grouping for nodes is different in proxmox and rockstor which is what I think the problem is. Even when cpus are passed as "host", their listing isnt accurate. Rockstor seems to think it have 8 cores on each socket and 1 thread per core, while the correct description is 4 cores per socket and 2 threads per core. I'm thinking this is why rockstor sets the nodes all wrong and performance is worse.
Can I get around that somehow?!
I decided to do some UMA and NUMA testing to optimize my VM performance. Its not exactly scientific because the "load" is Plex, forced to transcode high definition video. Plex is installed as Rockstor container, and Rockstor itself is in a VM. Server is Dual socket Xeon 4c/8t each, 24GB of ram connected to each CPU. I did look for two things - CPU load after transcoding for a while and output of numastat. VM have 10GB RAM shared to it and all the CPUs aswell. CPUs are passed to the VM as "host"
Tested two cases:
1) in BIOS, Node Interleaving set to Enabled (UMA mode) and VM NUMA set to disabled
2) Node Interleaving set to Disabled (NUMA mode) and VM NUMA set to enabled
The result were weird. Case 1 seems to have slightly better performance - seems to have less CPU usage and lower smaller spikes. I wont post numastat for this case since everything is seen as one node and there is no numa_miss and numa_foreign registered. From all I read, case 2 was expected to be the better performer especially since linux (both proxmox and rockstor are linux based) is NUMA aware OS, but instead CPU usage on average seemed higher, but with less CPU usage spikes. What is even more interesting is the numastats
Here is the numastat for "node" in proxmox:
Code:
root@vm:~# numastat
node0 node1
numa_hit 3170118 2300968
numa_miss 0 0
numa_foreign 0 0
interleave_hit 29895 30151
local_node 3169777 2269656
other_node 341 31312
Here is numastat from rockstor itself:
Code:
[georgi@rockstor ~]$ numastat
node0 node1
numa_hit 6823544 8317971
numa_miss 0 955161
numa_foreign 955161 0
interleave_hit 10848 10632
local_node 6815319 8314150
other_node 8225 958982
So there is a conflict - proxmox says there are no misses but rockstor, inside the VM says there are a lot of them. So i did check how the nodes were organized:
proxmox:
Code:
root@vm:~# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 24189 MB
node 0 free: 13467 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 24092 MB
node 1 free: 15352 MB
node distances:
node 0 1
0: 10 20
1: 20 10
rockstor:
Code:
[georgi@rockstor ~]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 4959 MB
node 0 free: 1262 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 5037 MB
node 1 free: 713 MB
node distances:
node 0 1
0: 10 20
1: 20 10
You might notice that the CPU grouping for nodes is different in proxmox and rockstor which is what I think the problem is. Even when cpus are passed as "host", their listing isnt accurate. Rockstor seems to think it have 8 cores on each socket and 1 thread per core, while the correct description is 4 cores per socket and 2 threads per core. I'm thinking this is why rockstor sets the nodes all wrong and performance is worse.
Can I get around that somehow?!