[BUG?] ZFS data corruption on Proxmox 4

Not always my PC was filled all 4 slots of ram and I never had any problem with ram.

If you have problem combining ram maybe:

1. One of the ram is damaged
2. They just cant work together

I had played with memtest on one pc. Was hard to detect damaged ram. 2 ram sticks could work separately but not together.

You misunderstood my post. I don't have damaged RAM. I don't have a problem combining RAM. I don't need advice regarding RAM.

I have found the ZFS checksum error only happens when 4 RAM modules are installed. I have tried MANY memory modules. Doesn't matter what kind, speed or size (4 identical modules as well as differing pairs cause the error in two different motherboards).

But if I take two out, there is no error. I wanted to ask you to repeat this experiment in your setup and report your findings: does your system also NOT produce the checksum errors when only 2 memory modules are installed? Please test if you can.
 
Last edited:
"For some architectures, such as x86, Linux will "hide" any node representing a
physical cell that has no memory attached, and reassign any CPUs attached to
that cell to a node representing a cell that does have memory. Thus, on
these architectures, one cannot assume that all CPUs that Linux associates with
a given node will see the same local memory access times and bandwidth.

In addition, for some architectures, again x86 is an example, Linux supports
the emulation of additional nodes. For NUMA emulation, linux will carve up
the existing nodes--or the system memory for non-NUMA platforms--into multiple
nodes. Each emulated node will manage a fraction of the underlying cells'
physical memory. NUMA emluation is useful for testing NUMA kernel and
application features on non-NUMA platforms, and as a sort of memory resource
management mechanism when used together with cpusets.
[see Documentation/cgroups/cpusets.txt]"
https://www.kernel.org/doc/Documentation/vm/numa

By default Debian, Redhat, and Ubuntu kernels activates NUMA emulation on non-NUMA platforms.
 
By default Debian, Redhat, and Ubuntu kernels activates NUMA emulation on non-NUMA platforms.

Doesn't seem to be the case on Proxmox 4.

Code:
root@proxmox:~# numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0
nodebind: 0
membind: 0

root@proxmox:~# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 31666 MB
node 0 free: 28561 MB
node distances:
node   0
  0:  10

root@proxmox:~# grep -i numa /var/log/dmesg

root@proxmox:~# grep -i numa /var/log/messages
Nov 29 21:02:21 proxmox3 kernel: [    0.000000] No NUMA configuration found
Nov 29 21:07:39 proxmox3 kernel: [    0.000000] No NUMA configuration found

root@proxmox:~# cat /sys/devices/system/node/node*/numa*
numa_hit 826920214
numa_miss 0
numa_foreign 0
interleave_hit 32960
local_node 826920214
other_node 0
 
Last edited:
You misunderstood my post. I don't have damaged RAM. I don't have a problem combining RAM. I don't need advice regarding RAM.

I have found the ZFS checksum error only happens when 4 RAM modules are installed. I have tried MANY memory modules. Doesn't matter what kind, speed or size (4 identical modules as well as differing pairs cause the error in two different motherboards).

But if I take two out, there is no error. I wanted to ask you to repeat this experiment in your setup and report your findings: does your system also NOT produce the checksum errors when only 2 memory modules are installed? Please test if you can.

My understanding is that you are blaming ZFS because it works with 2 RAM sticks, but not with four. So the constant is ZFS and the variables are RAM sticks and motherboard.
Is that correct?
 
My understanding is that you are blaming ZFS because it works with 2 RAM sticks, but not with four. So the constant is ZFS and the variables are RAM sticks and motherboard.
Is that correct?

Blaming would be a strong word, but since I first encountered these errors, I have replaced every single piece of hardware in the machine (apart from the harddrives), and the errors remained. The only connection between hardware and the errors were the number of DIMMs installed.

Please see this post for more:
https://github.com/zfsonlinux/zfs/issues/3990#issuecomment-161129120

We would need Nemesiz to validate this connection to the number of DIMMs on his home setup.
 
It's pretty easy to decide if ZoL is the issue or not. Install SmartOS (if you still want virtualization) or any other Illumos-based distribution (e.g. OmniOS).
Restore the data and play with it.

Please note that SmartOS disables C-States, so, if it works, it may be a lead (did you try that on your current setup?).
 
I did some tests. My desktop pc use ADATA 16GB usb stick and did ZFS scrub with 4 ram sticks of 6GB. No problem. After removed 2 ram sticks ZFS scrub show no error. And then strange things starts to happen.

Funny story.

After I put back ram sticks to pc the programs started to segmentation fault. Lets do memtest. Running running ....... at ~4000 mem point red line. Did the ram stick become broken ? Removing second stick and running test again. Looks like ok. Lets try to do test with every stick separately. No one is broken. Putting all back to pc. Memtest .... $^^#@! - Qbert inside pc

qbert_in_ram.jpg

I think I shorted something. Now its 3AM lets go to bed.

Morning. Dismantling pc. Good time to clean the dust. Continuing ram test without graphic adapter and keeping motherboard outside box. Looks like ram works normally again.

ram_test_ok.jpg

Putting everything back to case and enjoying pc work.

p.s. collection of ram :)

ram_sticks.jpg
 
It's pretty easy to decide if ZoL is the issue or not. Install SmartOS (if you still want virtualization) or any other Illumos-based distribution (e.g. OmniOS).
Restore the data and play with it.

Please note that SmartOS disables C-States, so, if it works, it may be a lead (did you try that on your current setup?).

The problem with testing a different distribution is twofold:
- it introduces too many variables (different version of ZoL, different kernel & drivers, etc.)
- it defeats the purpose of debugging an issue of Proxmox (also I have to use Proxmox, not something else)

However, I have tried your idea regarding disabling C-states: I have put the intel_idle.max_cstate=0 kernel option into grub, and verified with i7z that the CPU did not go below C1 at all. Unfortunately, the checksum errors still get created.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!