Poor Memory Performance in host and KVM's

syadnom

Renowned Member
Oct 19, 2009
26
0
66
This is a bit of an extension of this thread:
https://forum.proxmox.com/threads/kvm-performance-issues.42635/#post-204952

*but*
My identification of the issue was misguided.

What I've found is memory performance on the host is just 277MB/s. This is pulled from the sysbench --test=memory benchmark. I thought maybe there was a problem with the hardware, but I'm able to run memtest86+ and get 3346MB/s.

This is a huge performance issue and this is on a production box. Unfortunately I didn't realize the performance issue until it was fully installed and basic functionality tested.

I've tried with the boot option pti=off which was suggested in IRC. That made no difference.

Enter LXC.

If I spin up an LXC container and run the benchmark within, I get somewhat better numbers. Not full speed, but up in the 550MB/s range.

An alternative benchmark I'm using is compiling asterisk. In KVM which has the same ~277MB/s memory throughput, the compile takes 25 Minutes. In LXC it's about 8 Minutes. This hardware should do this in about 3.5 Minutes.

The memtest86+ test virtually eliminates there being a bad module in the system. This is ECC and there are no reported ECC errors in the kernel logs so that's a double confirmation.

I'm dying here with this Pentium 3 era memory throughput , any help would be really really appreciated.

Thanks.
 
Honestly, that seems like a really big difference, too big...

Did you ran the exact same test with the same tool (same version ideally) on host and guest? I.e. in an Debian 9 VM and PVE 5 (which is based off Debian 9):
Code:
sysbench --test=memory --memory-block-size=1M --memory-total-size=10G run

(comparing two different tools result does not tells you anything)

Difference should be more like <10% not >1000%...

Compiling is also a lot CPU and IO bound, less memory - so maybe you should look rather in this direction rather than memory...
(just a reasonable guess, to make sense of your experience... it's just to much difference)
 
The memtest86+ test virtually eliminates there being a bad module in the system.
how many passes did you let it run? i would recommend at least 2 to be sure
 
1 pass should be more than enough to determine performance issues. I'm not seeing malfunctions from bad bits here.

Took 2 hours 50 minutes for 1 pass on 24GB ECC ram. I think that's high but the only system I could compare it to has DD4, it took 44 minutes for 24GB ECC.
 
This is a bit of an extension of this thread:
https://forum.proxmox.com/threads/kvm-performance-issues.42635/#post-204952

*but*
My identification of the issue was misguided.

What I've found is memory performance on the host is just 277MB/s. This is pulled from the sysbench --test=memory benchmark. I thought maybe there was a problem with the hardware, but I'm able to run memtest86+ and get 3346MB/s.

This is a huge performance issue and this is on a production box. Unfortunately I didn't realize the performance issue until it was fully installed and basic functionality tested.

I've tried with the boot option pti=off which was suggested in IRC. That made no difference.

Enter LXC.

If I spin up an LXC container and run the benchmark within, I get somewhat better numbers. Not full speed, but up in the 550MB/s range.

An alternative benchmark I'm using is compiling asterisk. In KVM which has the same ~277MB/s memory throughput, the compile takes 25 Minutes. In LXC it's about 8 Minutes. This hardware should do this in about 3.5 Minutes.

The memtest86+ test virtually eliminates there being a bad module in the system. This is ECC and there are no reported ECC errors in the kernel logs so that's a double confirmation.

I'm dying here with this Pentium 3 era memory throughput , any help would be really really appreciated.

Thanks.
Hi,
what settigns do you use in the bios for ECC (there are different modi)?
How looks your power save settings in the bios?

Do you have mixed RAM (like 2R*4 + 1R*8)?

How looks your output from
Code:
dmidecode -t memory
an short test on an fresh installed 5.1 without any update on an Dell R620 (DDR3) shows much better values:
Code:
sysbench --test=memory --memory-block-size=1M --memory-total-size=10G run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 10240M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 10240 ( 9556.59 ops/sec)

10240.00 MB transferred (9556.59 MB/sec)


Test execution summary:
    total time:                          1.0715s
    total number of events:              10240
    total time taken by event execution: 1.0704
    per-request statistics:
         min:                                  0.10ms
         avg:                                  0.10ms
         max:                                  0.67ms
         approx.  95 percentile:               0.10ms

Threads fairness:
    events (avg/stddev):           10240.0000/0.00
    execution time (avg/stddev):   1.0704/0.00
Udo
 
No ECC options in the BIOS
BIOS power savings is set to 'maximum performance'
DIMMS are 6x 4GB DDR2 667
dmidecode:
Handle 0x1100, DMI type 17, 23 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: DIMM
Set: None
Locator: DIMM 1A
Bank Locator: Not Specified
Type: DDR2
Type Detail: Synchronous
Speed: 667 MHz​

only changes are the Locator lines as expected.


In your test above, you are using 1M blocks. Try the same with 1k.
I can get decent numbers on 1M blocks, but that's not a common workload. Once I drop to 1k it's incredibly slow. Other systems I have still push >5200MB/s with 1k blocks.
 
examples:
sys1 (the slow one)
Doing memory operations speed test
Memory block size: 1K
Memory transfer size: 10240M
Memory operations type: write
Memory scope type: global
Threads started!
Done.
Operations performed: 10485760 (191647.02 ops/sec)
10240.00 MB transferred (187.16 MB/sec)
Test execution summary:
total time: 54.7139s
total number of events: 10485760
total time taken by event execution: 41.4370
per-request statistics:
min: 0.00ms
avg: 0.00ms
max: 1.08ms
approx. 95 percentile: 0.00ms
Threads fairness:
events (avg/stddev): 10485760.0000/0.00
execution time (avg/stddev): 41.4370/0.00
real 0m54.718s
user 0m15.540s
sys 0m39.172s​


sys2 (less powerful machine..)
Threads started!
Total operations: 10485760 (5416619.58 per second)
10240.00 MiB transferred (5289.67 MiB/sec)
General statistics:
total time: 1.9345s
total number of events: 10485760
Latency (ms):
min: 0.00
avg: 0.00
max: 0.01
95th percentile: 0.00
sum: 860.01
Threads fairness:
events (avg/stddev): 10485760.0000/0.00
execution time (avg/stddev): 0.8600/0.00
real 0m1.940s
user 0m1.936s
sys 0m0.004s​



Just look at the difference in execution times here.
nearly 55 seconds to run the test.
 
No ECC options in the BIOS
are you sure?
Any bios updates?
In your test above, you are using 1M blocks. Try the same with 1k.
I can get decent numbers on 1M blocks, but that's not a common workload. Once I drop to 1k it's incredibly slow. Other systems I have still push >5200MB/s with 1k blocks.
And why you don't write that you test with 1k in the first post?
And why you compare this with memtest - which don't use 1k blocks??
Normaly you should post your exactly test that other poeple can see, which values they got.

With 1k it's slower (one third left):
Code:
10240.00 MB transferred (3089.75 MB/sec)
But your system has DDR2...
You compare the mem-test between two systems - but I assume they haven't the same CPU/Memory controller... so it's not comparable.

No ECC-Option in bios and this performance looks for me like crapy hardware. Or mem-mismatch with type/bank... or both.

Udo
 
yes, stayed up super late last night to check the bios after hours. It's up-to-date and there are no knobs for memory. Bios is 'I20' which is HP's current listed max for this BL260cG5...

What is curious is that this seems to be a throughput between the ram and the CPU. If I scale the threads up "--num-threads=8" I do get more speed, but peak with 8 threads is still only 552MB/s, which doesn't compare to my other DDR2 system at 7276MB/s with 8 threads.

It's as if the FSB is cranked down :/ dmidecode says it's at 1333Mhz which matches the chip's specs (xeon e5420) .

dmidecode --type=memory says these are 667Mhz DDR2 but I'm not sure if that's reading the actual speed or the speed in the chip.

#lshw -short -C memory
H/W path Device Class Description
====================================================
/0/0 memory 64KiB BIOS
/0/400/710 memory 128KiB L1 cache
/0/400/720 memory 12MiB L2 cache
/0/406/716 memory 128KiB L1 cache
/0/406/726 memory 12MiB L2 cache
/0/1000 memory 24GiB System Memory
/0/1000/0 memory 4GiB DIMM DDR2 Synchronous 667 MHz (1.5 ns)
/0/1000/1 memory 4GiB DIMM DDR2 Synchronous 667 MHz (1.5 ns)
/0/1000/2 memory 4GiB DIMM DDR2 Synchronous 667 MHz (1.5 ns)
/0/1000/3 memory 4GiB DIMM DDR2 Synchronous 667 MHz (1.5 ns)
/0/1000/4 memory 4GiB DIMM DDR2 Synchronous 667 MHz (1.5 ns)
/0/1000/5 memory 4GiB DIMM DDR2 Synchronous 667 MHz (1.5 ns)


So I'm stumped. Unless some guru comes in here having been through this mess before I'm screwed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!