RAM latency/write issues

harmonyp

Member
Nov 26, 2020
196
4
23
47
A few days ago I noticed that my server was getting drastically slower. Running memory benchmark tests show there is huge latency issues and speed issues. I have been running

Code:
sudo memtester 51900 5

overnight now and it just finished 1/5. I did notice testing this with 1GB at first it took significantly longer than other servers (2-3x).

What other tests can I run to diagnose what the problem is? I don't want to have the server offline for days for memtest86 to fully run unless there is no alternative.

2232b0a773c13d3ad98bc078e61e671c.png


sudo lshw -short -C memory
Code:
H/W path            Device      Class          Description
==========================================================
/0/0                            memory         64KiB BIOS
/0/20                           memory         1TiB System Memory
/0/20/0                         memory         [empty]
/0/20/1                         memory         128GiB DIMM DDR4 Synchronous LRDIMM 2933 MHz (0.3 ns)
/0/20/2                         memory         [empty]
/0/20/3                         memory         128GiB DIMM DDR4 Synchronous LRDIMM 2933 MHz (0.3 ns)
/0/20/4                         memory         [empty]
/0/20/5                         memory         128GiB DIMM DDR4 Synchronous LRDIMM 2933 MHz (0.3 ns)
/0/20/6                         memory         [empty]
/0/20/7                         memory         128GiB DIMM DDR4 Synchronous LRDIMM 2933 MHz (0.3 ns)
/0/20/8                         memory         [empty]
/0/20/9                         memory         128GiB DIMM DDR4 Synchronous LRDIMM 2933 MHz (0.3 ns)
/0/20/a                         memory         [empty]
/0/20/b                         memory         128GiB DIMM DDR4 Synchronous LRDIMM 2933 MHz (0.3 ns)
/0/20/c                         memory         [empty]
/0/20/d                         memory         128GiB DIMM DDR4 Synchronous LRDIMM 2933 MHz (0.3 ns)
/0/20/e                         memory         [empty]
/0/20/f                         memory         128GiB DIMM DDR4 Synchronous LRDIMM 2933 MHz (0.3 ns)
/0/23                           memory         3MiB L1 cache
/0/24                           memory         24MiB L2 cache
/0/25                           memory         256MiB L3 cache

sysbench --test=memory --memory-block-size=4G --memory-total-size=32G run
Code:
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)


Running the test with following options:
Number of threads: 1
Initializing random number generator from current time




Running memory speed test with the following options:
  block size: 4194304KiB
  total size: 32768MiB
  operation: write
  scope: global


Initializing worker threads...


Threads started!


Total operations: 2 (    0.14 per second)


8192.00 MiB transferred (584.82 MiB/sec)




General statistics:
    total time:                          14.0036s
    total number of events:              2


Latency (ms):
         min:                                 6983.15
         avg:                                 7001.23
         max:                                 7019.32
         95th percentile:                     6960.17
         sum:                                14002.46


Threads fairness:
    events (avg/stddev):           2.0000/0.00
    execution time (avg/stddev):   14.0025/0.00

sysbench --test=memory --memory-block-size=1K --memory-total-size=100G --num-threads=1 run
Code:
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)


Running the test with following options:
Number of threads: 1
Initializing random number generator from current time




Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global


Initializing worker threads...


Threads started!


Total operations: 40415255 (4040170.13 per second)


39468.02 MiB transferred (3945.48 MiB/sec)




General statistics:
    total time:                          10.0004s
    total number of events:              40415255


Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    4.03
         95th percentile:                        0.00
         sum:                                 3813.57


Threads fairness:
    events (avg/stddev):           40415255.0000/0.00
    execution time (avg/stddev):   3.8136/0.00

sysbench --test=memory --memory-block-size=1M --memory-total-size=10G run
Code:
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)


Running the test with following options:
Number of threads: 1
Initializing random number generator from current time




Running memory speed test with the following options:
  block size: 1024KiB
  total size: 10240MiB
  operation: write
  scope: global


Initializing worker threads...


Threads started!


Total operations: 10240 ( 9741.62 per second)


10240.00 MiB transferred (9741.62 MiB/sec)




General statistics:
    total time:                          1.0482s
    total number of events:              10240


Latency (ms):
         min:                                    0.05
         avg:                                    0.10
         max:                                    1.69
         95th percentile:                        0.27
         sum:                                 1034.31


Threads fairness:
    events (avg/stddev):           10240.0000/0.00
    execution time (avg/stddev):   1.0343/0.00

dmidecode -t memory - https://pastebin.com/45tfdbek
 
My guess is that your DRAM-chips are failing but ECC is able to recover, which makes it (very) slow. Maybe there would be a clue in the system logs, but sometimes EDAC does not report to the OS because the platform already handles the errors transparently (Platform first versus OS first).
I would suggest removing each of your DIMMs one at a time until the problem is resolved and then replace that DIMM. Hopefully it is only one of them.
 
My guess is that your DRAM-chips are failing but ECC is able to recover, which makes it (very) slow. Maybe there would be a clue in the system logs, but sometimes EDAC does not report to the OS because the platform already handles the errors transparently (Platform first versus OS first).
I would suggest removing each of your DIMMs one at a time until the problem is resolved and then replace that DIMM. Hopefully it is only one of them.
Unfortunately this is a server I rent so getting them to remove each of the DIMMs test is not something I think can happen.

Where/what what the error logs look like? and is there any alternative testing.