[SOLVED] segfaults on Proxmox 6.2 VE with AMD Ryzen

May 16, 2020
262
16
38
51
Antwerp, Belgium
commandline.be
Dear,

Fresh to Proxmox 6.2 VE the experience has not been great, at all. This system is an AMD Ryzen 1700X which i typically have zero issue with. For some reason nothing but issue when trying out virtualisation software (proxmox, xp-ng)

With Proxmox i have seen the strangest behaviour ever for a Linux system. From packages not installing correctly during installation to uploads suddenly not working (after reinstall) to VMs not shutting down.

I am at the point i have downgraded the BIOS (since it was running on a BETA bios) and reinstalled proxmox. Now it is worse.

apt upgrade resulted in an upgrade of apt and libc ... now the system throws segfaults on apt and the entire system slows to a drag.

Not sure what to do next really. I am at a loss to explain why this is so hard and riddled with bugs. Both Debian and Qemu work fine when i use them manually.

= = = segfaults = = =

apt download net-tools (because regular apt install not working due to broken package)
dpkg -i net-tools

sudden notification of show_signal_msg: 6 callbacks suppressed
perl[3858]: segfault at 55730bde788e ip 000055730aa39605 sp 00007ffe9d3911a8 error 6 in perl[55730a96a000+15d000]
followed by preproxy worker segfault repeating three times

repeat of dpkg -i net-tools does not repeat segfault

systemctl restart pveproxy
causes a dump of BUG: Bad page map in process pveproxy worker pte:80000007fc874867 pmd:786c78067
followed by a long listing of debugging information
 
Last edited:
That's strange.

Proxmox is really a debian10, but the kernel is ubuntu 5.4 kernel. (and maybe zfs is you use it).

can you try with last ubuntu to see if it's a kernel bug ?

do you have any logs? (kern.log, dmesg ) ?
 
I can confirm that we also have issues with nested virtualization and recent AMD CPUs. Some VMs get stuck randomly sometimes (kernel panic) since we moved from Intel to AMD. And yes, we shutdown and started all VMs because of the physical CPU architecture change.
 
  • Like
Reactions: Joris L.
FYI: this system is an AMD 1700x with an x370 chipset running on latest motherboard bios. The sytem has 3 dedicated ssd with ZFS raid-0 on it. I used this system with various Linux distros in the past without much issues.

I cannot pinpoint the issue spefic to proxmox. The proxmox install appears 'alive' as issues appear almost randomly and are different between installations. Even during installation there were notable issues with packages not installing.

Sorry to say i do not consider this release of proxmox stable at this point. Segfaults typically occur because of bugs and compilation time parameters. There appears to be a real pandemic with AMD+Linux issues. I tried XCP-NG which simply will not boot on an AMD based system.
 
There are quite some reports with faulty memory/incompatible memory on such AMD Ryzen boxes recently.

Also one of my workstations was affected and changing the modules fixed it.
 
Just a shot in the dark, are you sure the ISO is ok and not corrupted (sha256sum -c ...)?
I'm going to build a Ryzen 5 3600 soon, really hope there are no issues (and/or RAM compatibility problems)!
My current homelab Proxmox server has worked for years flawlessy on AMD FX-6300 Six-Core Processor
 
This machine is an MSI X370 with 32GB AMD specific Flare-X memory, i have not had any issues before which i could not find back. If i were new to Linux or computing i'd surrender to the arguments here. For now, i do not. Segfaults are most always caused by software. Especially when the segfaul reoccurs in a specific applications such as APT and pveproxy.

The past years i noticed a serious decline in open source software usability in terms of platform compatiblity resilience. Typically because of upstream 'decisions'.I hope this is not again so to be the root cause. Look around and you will find many issues with distro's, services, application not supporting AMD CPU and/or GPU. There have been better times tbh. I hope to start providing some support if i find time to work through this.

Unless AMD CPU are rigged with some hardware backdoor i do not think these issues are at all hardware related but more likely to be caused by compilation flags and optimizations.

I will check by removing the extra 16GB i added two weeks ago.
 
I've also found something about BIOS settings that could worth a shot (not specific to Proxmox but seems important)
https://forums.unraid.net/bug-reports/prereleases/670-rc1-system-hard-lock-r354/?tab=comments
and probably googling around you can find even more.
In my old MB I've always disabled i.e. Cstate stuff, did you too?
i.e. in a Supermicro I had something like that
Code:
Advanced → Advanced Power Management Config
Power technology: Custom
Energy performance tuning: Disabled
    BIAS setting: Power
Efficient turbo: Enable
CPU P state control: Disable
CPU HWPM state control:    → Enable CPU HWPM: Disable
                → Enable CPU Autonomous Cstate: Disable
CPU C state control:        → Package C state limit: C6 (Retention state)
                → CPU C3 report: Disable
                → CPU C6 report: Disable
                → Enhanced Halt State (C1E): Disable
T State → ACPI T-States: Enabled, Throttling by OS
 
  • Like
Reactions: Joris L.
I've also found something about BIOS settings that could worth a shot (not specific to Proxmox but seems important)
https://forums.unraid.net/bug-reports/prereleases/670-rc1-system-hard-lock-r354/?tab=comments
and probably googling around you can find even more.
In my old MB I've always disabled i.e. Cstate stuff, did you too?
i.e. in a Supermicro I had something like that
Code:
Advanced → Advanced Power Management Config
Power technology: Custom
Energy performance tuning: Disabled
    BIAS setting: Power
Efficient turbo: Enable
CPU P state control: Disable
CPU HWPM state control:    → Enable CPU HWPM: Disable
                → Enable CPU Autonomous Cstate: Disable
CPU C state control:        → Package C state limit: C6 (Retention state)
                → CPU C3 report: Disable
                → CPU C6 report: Disable
                → Enhanced Halt State (C1E): Disable
T State → ACPI T-States: Enabled, Throttling by OS

TO be honest i am a bit red in the cheecks writing this, as a struggled i slowly collected memories i had such experience before and it was bios settings and nothing else.

So, yes, i am aware of this, kind of. I currently boot with such a similar configuration. The system appears to run stable now. Here are settings i think are relevant. I share them as they were the values i last changed with the MSI BIOS.

boot mode select: legacy+uefi
ErP ready: enabled
Core performance boost: disabled
> CPU Features (here typically set to enabled by me instead of auto, just works, better)
Global C-state control: disabled
Power Supply Idle control: Auto
P-State Adjustment : Pstate 0
Precision Boost Overdrive: AMD Default
Mode0: enabled


Things to know. I abandoned proxmox as the experience drove me crazy. However, similar segfault issues appeared with xcp-ng, archlinux and devuan. I consider this more or less solved and will be having a second look at proxmox now i have archlinux working reliably. I expect proxmox now to run and install flawlessly.
 
Last edited:
So, I worked with the AMD guys a while ago when this was a new issue with a newly released AGESA version...

The big thing that can cause instability is if your ATX power supply goes flakey when there is a small load on one of the supply rails. The BIOS option for "Power Supply Idle control" is to compensate for this by changing some of the power source options...

If your PSU is unstable, changing "Power Supply Idle control" from AUTO to "Typical Idle current" will cause the VRMs to take more power from one of the rails that normally becomes flakey under low power consumption.

With this option set, you can safely enable C-States and the performance boosts - which were incorrectly attributed to fixing this problem.

My power supply is one of the ones that goes flakey with this scenario - I can't remember if it was the 12v or 5v rail that causes the problem - but when the option was set, I no longer got hard resets / hangs when using my 1700x and AB350 chipset mainboard.
 
Worth noting. I upgraded this machine recently to 32GB. Running memtest86 showed plenty of ram errors. Next i found how to get rid of them.

Reseat the new DDR4 DIM modules from 2020 on the same channel, the 2017 ones on the other channel. Reset the BIOS to defaults. And propably most importantly, set the XML profile to 2900Mhz instead of 3200Mhz. ZERO ERRORS since. Time to buy an extra fan to point at the memory banks.
 
That'd do it as well... Just make sure you keep them in pairs and keep the timing / speeds matched to the slowest ones in the group.

I have a Ryzen 2700x that I use for my desktop - that has 32Gb of 3200Mhz RAM and seems to have issues if I set it to 3200Mhz. Setting it to 3133Mhz seems to be fine...
 
Truely perplexed it was that much of an issue. It slowly dawned on me i had similar issues with 2x8GB which went away after bios upgrade and an extra fan.

Now i ponder what i'll do. Motherboard upgrade to x570, newer CPU (4xxxx ?), 32GB should be fine i guess, right :D
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!