Patch x570/Ryzen EDAC support into pve 6.1?

@aaron ; Actually hearsay might not be the best source:

I have gotten an official statement from AMD (rather than some site which might or might not be truthful in depicting ASrock rack possibly/perhaps deploying a smokescreen) them self stating that AM4 should support both ECC and ECC error reporting.
And that includes my setup.
And now I have another user in a forum with a contradicting statement ;)

Thanks for investigating this further!
 
And now I have another user in a forum with a contradicting statement ;)

Thanks for investigating this further!

I am not sure what I am contradicting. I think that only statements from AMD can be used as baseline. All other sources should be treated with skeptical respect.

So if there is a post in this thread stating that AMD says it is not supported then I will apologize for missing that.

kind regards
 
I am not sure what I am contradicting. I think that only statements from AMD can be used as baseline. All other sources should be treated with skeptical respect.

So if there is a post in this thread stating that AMD says it is not supported then I will apologize for missing that.

kind regards
It was a bit tongue in cheek after your comment regarding "some site which could also be a smokescreen". The thread that I linked to contained the info a person got from the Asrock Rack support. Something that I need to assume that the person has no intention to lie about. The same goes for your information. I have to believe that you have no bad intention and the information that you provide is genuine. Because we all know, on the internet no one knows you're a dog ;)

Sincerely thank you for further investigating this :)
 
FWIW, there's a opt-in 5.4 kernel available to test - I'd guess that it's at least worth a try to test:
Code:
apt update
apt install pve-kernel-5.4

(You can always boot older kernels if you run into more trouble)
 
I haven't had time for more testing yet, the next step would be finding a linux distro that should support ECC on Ryzen and boot from that. I'll see if I can give it a go this weekend.
Ok, So I will have another go at this, this time avoiding ASrock.
Mobo: Biostar x750gta
CPU: AMD Ryzen 9 3950x

@EricD Can You please advice me what you would have done if you would have found the time. For example what OS and what command on it can I try?

Kind regrds
 
mobo seems to not be in play.

more like the AMD hardware on it.

Just booted up:

Mobo: Biostar x750gta
CPU: AMD Ryzen 9 3950x
Mem: CT16G4WFD8266

and the same errors as I reported earlier.

Is anyone still interested in this thread? I can for a short while longer provide detailed info before I have to send the hardware back.
 
Not yet, sorry about that. What held me back is that I was not sure that was a supported scenario or not.
But you're right, now I am trouble shooting and am willing to go the distance if I have enough time.

Regarding time. I can hopefully try today. The implications of the pandemic around us has slowed me down to near 0. Not physically but taking care of my family and my regular work is proofing hard.

Anyway I am happy to learn you guys are still interested.

BTW. Is there an official stance for proxmox's view on AMD?

My other post regarding passthough using AMD onboard components is receiving no love ;(
https://forum.proxmox.com/threads/pci-passthrough-onboard-sound.66342/#post-298038
I would settle with a statement something along the lines of that the topic is just very low on the priority list that it might well be that it will never get an answer from the proxmox team. That would help me and future users a lot also.
 
Last edited:
Not yet, sorry about that. What held me back is that I was not sure that was a supported scenario or not.

If I thought that there where some problems with the 5.4 kernel I'd had mentioned them.

BTW. Is there an official stance for proxmox's view on AMD?

No, we have no official stance. We have multiple machines here, both AMD and intel, and try them all.
We have people also test pass-through quite a bit, currently on the previous generations of the zen3 arch.
The general issue with all of those HW vendors is that they often get their HW support a bit late in the kernel, if at all.
It got better but is still an issue, if you buy HW which underlying technology is a new generation and not release a bit longer ago you will run more often into issues. If you want to address that: complain to the HW vendors and vote with your wallet.
Pass through is additionally more fragile as it's more complicated, some HW firmware combinations make it work better, some worse. If you search the forum here you will find quite a few success stories but also quite a few headache threads, it's hit or miss.

Stances use nothing if one does not test the newer kernel we propose..
 
and then there is light at the end of the tunnel:

mobo: biostar x570gta
cpu: amd ryzen 9 3950x
memory: 1 x cruicial 16GB ECC (CT16G4WFD8266)

root@pve:~# pveversion
pve-manager/6.1-8/806edfe1 (running kernel: 5.4.24-1-pve)
root@pve:~# dmesg | grep EDAC
[ 0.351030] EDAC MC: Ver: 3.0.0
[ 12.198200] EDAC amd64: Node 0: DRAM ECC enabled.
[ 12.198201] EDAC amd64: F17h_M70h detected (node 0).
[ 12.198245] EDAC MC: UMC0 chip selects:
[ 12.198246] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 12.198246] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 12.198249] EDAC MC: UMC1 chip selects:
[ 12.198249] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 12.198250] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 12.198250] EDAC amd64: using x16 syndromes.
[ 12.198250] EDAC amd64: MCT channel count: 1
[ 12.198278] EDAC MC0: Giving out device to module amd64_edac controller F17h_M70h: DEV 0000:00:18.3 (INTERRUPT)
[ 12.198283] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[ 12.198283] AMD64 EDAC driver v3.5.0

Does this output suggest working ECC and working ECC error reporting?

Anyway I am going to put in the rest of the 3 modules now and report back
 
sweet, still no EDAC issues reported after inserting the rest of the modules.


mobo: biostar x570gta
cpu: amd ryzen 9 3950x
memory: 4 x cruicial 16GB ECC (CT16G4WFD8266) (64GB total)

The only thing I still can't get my head around is why do the different tests:
dmidecode (linux)
wmic (windows)

all report an incorrect total datawidth? (124 instead of 144)
the same with the previous memory I have already send back.
 
is there anything I can help with for the proxmox team regarding any of these issues while I still can? because I might well send this setup back because of the totalwidth reporting. So proxmox is doing great regarding this issue. But the bios developers for both biostar x570gta and asrock x570 creator might have to level up one.
 
My humble apologies for me being as I did.

I am happy to have been able to confirm the suggestion of the proxmox team as working in this/my scenario. @ErikD if you are sill around. Just try it. It might help you also, even in a production environment.

However, I am going to send all the components back because of the ECC totalwidth issue.

Unless I am mistaken 124 total width should not be reported on a 16GB ecc unbuffered module. I am expecting 144 totalwidth.

Please everone let me be clear. This has nothing to do with the proxmox software. This is a hardware issue that I will continue to dig in further.
 
  • Like
Reactions: t.lamprecht
I am happy to report that I was able to assess that ECC error reporting (and also correction (but this is not something the software is responsible for)) is working.

The method I used was to use the inner wires of some electrical cable and stick it in a memory bank with 8GB ECC UDIMM (bought to potentially sacrifice to make sure things are working) in it. The potential loss of hardware was well worth the investment to aid in this quest.

First try success. I saw errors being corrected using proxmox 6.1 with a 5.4 kernel and also memtest86 pro 8.4 rc 2 build 1001.
Board used was a ASrock rack X470D4U (BIOS 3.30, Platform First Error Handling = disabled)

20200419_185745 downsized.jpg
 
  • Like
Reactions: Stoiko Ivanov
I am happy to report that I was able to assess that ECC error reporting (and also correction (but this is not something the software is responsible for)) is working.

Thank you very much for your test, you could even break the CPU .

That's a good news, do you mind sharing it on level1tech ?
 
Thank you MasterPhi. Can you please explain more regarding the breaking of the CPU?

I don't mind sharing on level1tech. You are free to do so. Just let me know where you did it so I can read the responses.

Or would you like me to share on level1tech? if so then can you please advice on what thread or on what forum section?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!