Proxmox 7.0.2 installation freeze on Dell R7525

gera9010

New Member
Oct 19, 2021
1
0
1
52
Hi,

I can't launch the installation of proxmo 7.0.2 on
a Dell R7525 server with last generation of AMD Epyc CPUs
(but it works without problems with Centos8)

The system freezes at launch with the message
`Loading initial RAM-Disk`

If I set the bios to Legacy instead of UEFI I get a black screen.

I tried to apply what is indicated at end of these messages
but without success.
https://forum.proxmox.com/threads/p...ot-fails-on-dell-r6525-epyc-7543-milan.89499/

Has anyone had the same installation problems
with Proxmox 7.0.2 and found the cause?

Thank You.
 
hmm - I would try to download the ISO again (and verify it's checksum, and put it on a different USB drive (or if you're installing via iDRAC, trying to boot from a physical USB instead).

Additionally make sure you have the latest firmware installed for all components of the system

If this does not help - start the ISO - go to the Debug mode entry - hit 'e' to edit the grub commandline - and remove 'quiet' from there - this should cause at least some messages to appear on the console.

If we don't find the cause - the following alternatives are also possible:
* try installing PVE 6.4 and then upgrade to 7.0 (with a fresh system this should be done in 10-15 minutes)
* you can also install PVE on top of debian bullseye: https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_11_Bullseye

I hope this helps!
 
Same issue here, but you're wrong, the server doesn't freeze, the installer process seems running but without video output (neither on iDRAC) so you can't go further and terminate the installation.
We have many DELL server here, Dell R440 and that new R7525, both have Matrox G200 VGA card. There is no issue on R440, we're running ProxmoxVE 6 and 7 on them.
I've cloned Proxmox7 installed on one of our R440 and copied to our R7525 and... still no video after kernel and initrd loading (kernel start) but ProxmoxVE is running well. The web interface and SSH are accessible.
I've tried many thing from grub, differents gfxpayload, still no video. When using gfxpayload=text, you can see kernel outputs its start for some seconds but quickly, proxmox change resolution's console and video disappears again.
I've tried to unload mgag200 module driver on R5725 running proxmox7 but I have this error logged:
# rmmod mgag200
# modprobe mgag200 modeset=0
modprobe: ERROR: could not insert 'mgag200': Invalid argument
# modprobe mgag200 modeset=1
seems ok but :
[ 7269.867716] mgag200 0000:62:00.0: vgaarb: deactivate vga console
[ 7269.867876] mgag200 0000:62:00.0: [drm] *ERROR* can't reserve VRAM

I've tried differents Linux systems installer on this server R7525:
-CentOS8 (kernel 4.18.0-348) -> OK
-Debian11 (kernel 5.10.0-10) -> NOK
-SysRescueCD8.0.3 (kernel 5.10.34-1) -> OK
-Ubuntu 20.04 (kernel 5.11.0-27) -> NOK
-Ubuntu 21.10 (kernel 5.13.0-19) -> NOK
-Proxmox7.1 (kernel 5.13.19-2) -> NOK
-CentOS9 (kernel 5.14.0-43) -> OK
-SysRescueCD9.0.0 (kernel 5.15.14-1) -> NOK

And finaly, we've tested to start Proxmox7 with SysRescueCD8.03 kernel/initrd and it works on this server... with video output !!!

So, there's a (hardware ?) problem with mostly Debian/Ubuntu kernels.
Support ticket is opened today and I'm waiting news from DELL.
Maybe Proxmox staff can help on this case ?
 
@Stoiko Ivanov :
-Editing grub kernel parameters (remove quiet) does not work because video output stops as soon as kernel is starting.
-I've tried Proxmox 6 too and had the same issue, no video as soon as the kernel is running.

Regards,

Herve

 
Support ticket is opened today and I'm waiting news from DELL.
Maybe Proxmox staff can help on this case ?
Not really, I'm afraid, we know of systems using the mga200 module and working just fine, so this seems a bit more specific to the vendor. Note that this may be potentially related to the userspace (i.e., x11 or mesa) not directly to the kernel module.

Worth to check maybe to see if there are BIOS and firmware updates for that system.

Next Proxmox 7.2 (kernel 5.15) doesn't solve the issue, still have no video on DELL R7525.
How do you test that? we do not have any installer available with that - especially as 7.2 is far from being released yet..

@Stoiko Ivanov :
-Install Proxmox from a Debian 11 Bullseye install is not possible because Debian 11 has the same issue.
You can use the text-based installer mode (i.e., not the Graphical mode) for installation from Debian, that should need no graphics...
 
@t.lamprecht :

Hi Thomas,

issue specific to vendor... Not really, how can you explain that some kernels (CentOS, SysRescueCD) are running just fine ? The problem appears as soon as kernel's starting, there is no kernel output even by removing "quiet" kernel option, we are far far away from x11 layer (that ProxmoxVE doesn't use as it stays in text mode) ! The problem is inside the kernel or in mgag200.ko module.
The last BIOS update has been done, but no success.
How can I tested Proxmox 7.2's kernel ? Remember, your post : https://forum.proxmox.com/threads/opt-in-linux-kernel-5-15-for-proxmox-ve-7-x-available.100936/
Installing ProxmoxVE or Debian 11 fails as they use the same affected kernel (for the installer or for the installed OS), no video output after kernel is starting (grub is fine, no more output after "loading kernel and initramfs", even with choosing text mode install). The only way I succeed installing PVE is by cloning another running PVE server, change its IP, hostname, some cleaning and hardware config adjustments and reboot. There is still no video output but ProxmoxVE is running fine through web interface.
Our other DELL servers (R440) have the same video card Matrox G200 and have no problem with PVE or Debian installer.
 
Here is how use SysRescueCD's kernel instead of ProxmoxVE's one and recover display output:

-Plug SysRescueCD v8.0.3 USB key in the R7525 server (UEFI mode, Secureboot disabled) and start it (press F11 to select boot device)
-At grub stage, press "e" to edit boot script, add "init=/bin/sh" at the end of kernel line to not boot SysRescueCD but to get a root shell inside a working kernel.
-mount PVE root LVM volume on /mnt using "mount /dev/mapper/pve-root /mnt" (takes some times has it discovers LVM volumes)
-chroot on ProxmoxVE volume and exec systemd using "exec chroot /mnt /lib/systemd/systemd" and VOILA ! you're running ProxmoxVE with another working kernel, video is now working :)

Thomas, as you see, the only thing that has changed is the kernel used, no hardware, no special config, just the kernel...
 
issue specific to vendor... Not really, how can you explain that some kernels (CentOS, SysRescueCD) are running just fine ?
Clarifying, issue specific to a kernel and/or userspace version in combination with the vendor - that can mean it works by luck with others versions - as said hard to tell, but as we have working mga200 systems we can assert that this is not a general issue.

The problem is inside the kernel or in mgag200.ko module.
How can you assert that if that kernel and also that exact module works just fine with other such HW?
I know that we have a 5.15 kernel available as opt-in for already working systems ;) That still does not explain how you tested that if you could not install PVE... Also, there's not PVE 7.2 released in any form yet, so one just cannot have tested that at time of writing - that's why I asked about what/how you actually tested that.

that ProxmoxVE doesn't use as it stays in text mode
Now I'm confused again by this post, is there some text, i.e., do you see anything at all or nothing after the grub step or don't?

Installing ProxmoxVE or Debian 11 fails as they use the same affected kernel (for the installer or for the installed OS),
They do not use the same kernel, Debian 11 uses the 5.10 based one, PVE 7.1 ISO the 5.13 one, different packaging too.
But yes, they may have the same issue-
The only way I succeed installing PVE is by cloning another running PVE server, change its IP, hostname, some cleaning and hardware config adjustments and reboot.
Ah, ok, that'd probably answer the question of how you tested the 5.15 kernel :)

Could you see the Proxmox VE installer acquiring a DHCP lease on your router/dhcp server even the screen stays dark?
Asking to see if the installer itself actually boots up fine.
 
I think that PVE installer continues to run after video freeze. If I remember well, PVE installer ask to select which network card we wanna use (had 4 network cards on that server), so it maybe not start by doing DHCP request before that, and I cannot see any screen to be able to select any NIC. One thing that makes me believe that the kernel is alive is that Verr-Num or Caps-Lock leds are functionnal, I'm enable to toggle them. CTRL-ALT-SUPPR works too and reboot the server. If kernel had crashed, system would be frozen, no leds would toggle, and CTRL-ALT-SUPPR would not work.

Also, PVE installer's kernel and PVE installed kernel are the same and they both have the same issue. By cloning PVE install form a DELL P440 to a DELL P7525, it shows that PVE v7 is running fine on that P7525, web interface and SSH are running, this installation is fullly functionnal but... without video (no console, no login prompt, nothing after the last message "loading kernel and initramfs" where's the kernel starts !
I'm convinced that there is an hardware difference between that Matrox G200 in P440 (working) and in this P7525 (not working), ok, we're agree ! But there are some kernels that can afford that difference, and some don't !? Debian's kernels (and so Proxmox and Ubuntu) have issues with this difference, CentOS (maybe Redhat) doesn't. That's a fact.
Proxmox compile their own kernel, based on Debian's config one so video issue with DELL P7525 is present.
If your kernel were based on Redhat's kernel (same kernel config), or SysRescueCD v8 (like I've tested), there would not be any video issue. Proxmox v7 run with SysRescueCD's kernel 5.10 is running fine without any video issue.

I'm afraid that it will be a nightmare to convice Proxmox and/or Dell to react on that case... Yes it's surely an hardware "problem" or "difference" but that problem has already been solved on some Linux OSes (CentOS / Redhat for example, on their old 4.18 or new 5.14 kernels).

Waiting for Dell support now...
 
@t.lamprecht :
@gera9010 :

Hi everyone,

Dell ask me for doing live test on our R7525 to see by themselves the issue, and they've seen the issue (no video on Ubuntu Server LTS 20.04 installer) so... they now searching some R7525 in their labs to reproduce the issue again... time's running and no-one seriously working on this issue.
They have even telling me that our R7525 has not been bought with pre-installed OS so they are in best effort mode, bla bla bla...

So, as always, you have to work by yourself !

Ok, with the cloned ProxmoxVE installed OS on the R7525, dmesg shows that there's a lot of memory address allocation errors of PCIe devices at kernel's start. It's probably why graphics card stops working. Interesting kernel message was to use pci=realloc=off if any problem...
So let's restart with this kernel parameter and... it works !!! No more video issue, still memory range alloc errors but less in quantity.
Ok here's the clue of the issue.
I've done many BIOS config + reboot arround this configuration and more specificaly in Config->BIOS parameters->Integrated Devices.
I was searching for a BIOS config that works without specifying pci=realloc=off kernel's parameter.

Here are BIOS parameters about our issue and their default (factory) values :
-PCIe Preferred IO Bus = Enabled
-PCIe Preferred IO Bus value = 65535 (error, value must be 0-255)
-Enhanced Preferred IO = Enabled
-SR-IOV Global Enable = Disabled

I don't know why this 65535 bad value !? Changed it with 0 + reboot or 64 + reboot, no change, still video issue.
Disabling PCIe Preferred* to disable + reboot, no change, still video issue.
When enabling SR-IOV (and that what we want with an virtualization platform), video issue had gone partially, video ok on iDRAC but no sync on physical VGA display monitor (FullHD monitor that complains not able to sync with current video format) ! So I search for an old old 4/3 vga monitor and bingo, I have video on iDRAC and physical VGA display !!

One more test, with these values (at that point, PCIe preferred values was disabled):
-PCIe Preferred IO Bus = Enabled
-PCIe Preferred IO Bus value = 0
-Enhanced Preferred IO = Enabled
-SR-IOV Global Enable = Enabled
I still be able to boot the cloned ProxmoxVE on an old VGA monitor, and even with SR-IOV disabled !? WTF !?
Searching again some magical kernel's parameters and with these parameters I'd be able to have display on a recent monitor:
-nomodeset
-mgag200.modeset=0

Anyway it seems that there is a weird thing arround these 4 BIOS parameters on a DELL R7525 !!

Last test: Install a new ProxmoxVE7 from USB key and these BIOS values and kernels parameters (add them on kernel line in grub, press 'e' at grub time, go to 'kernel' line, add the 3 parameters at the end of the line):

- PCIe Preferred IO Bus = Enabled
- PCIe Preferred IO Bus value = 0
- Enhanced Preferred IO = Enabled
- SR-IOV Global Enable = Enabled

- pci=realloc=off
- nomodeset
- mgag200.modeset=0

IT WORKS NOW !!! Installer boots up, install process succeed and after the final reboot, no more issues and (dmesg seems fine) !

NB: Dell is still trying to find a R7525 in their labs...
 
To be more precise,

nomodeset and mgag200.modeset=0 are for solving video sync on physical VGA display. I've done a lot of test + reboot using iDRAC distant display and I've seen lately that physical video was not working when iDRAC video was ok thank's to pci=realloc=off kernel's parameter.
So there's two problem on that server. The PCI Realloc function that messed up memory reservation of PCIe devices, and video sync problem on FullHD monitors.
 
  • Like
Reactions: StephanS
@hboterman

I recently bought a R7525 for my company and had the exact same Problem. No chance of booting a 7.2 or 6.4 usb-stick or dvd. With 7.2 i even got a hardware-error-message in the lifecycle-protocol (A fatal error was detected on a component at bus 99 device 0 function 0.).

I am so glad and thankful that you figured out this solution and posted it here.

I would like to add that i only had to change those 4 bios-settings, i changed nothing in grub.

With those Settings i was able to get a visual in iDRAC as well as on the local screen (ATEN KVMP-Switch CL5708M-ATA, 1280x1024 max. Resolution).

Everything works fine, the test-system is online for 6 days and i received no more error-messages until now.

 
@hboterman

I recently bought a R7525 for my company and had the exact same Problem. No chance of booting a 7.2 or 6.4 usb-stick or dvd. With 7.2 i even got a hardware-error-message in the lifecycle-protocol (A fatal error was detected on a component at bus 99 device 0 function 0.).

I am so glad and thankful that you figured out this solution and posted it here.

I would like to add that i only had to change those 4 bios-settings, i changed nothing in grub.

With those Settings i was able to get a visual in iDRAC as well as on the local screen (ATEN KVMP-Switch CL5708M-ATA, 1280x1024 max. Resolution).

Everything works fine, the test-system is online for 6 days and i received no more error-messages until now.

@StephanS had the exact same issue R7525. After trying prox7.3, 7.0, 6.4, ubuntu server, debian and all kernel parameters from above in proxmox with no luck, I have only changed:

PCIe Preferred IO Bus value = 255
to
PCIe Preferred IO Bus value = 0

SR-IOV Global Enable initially set to enable, than switched to disabled (default value for my servers) and still working.

Thanks @StephanS and @hboterman

Regards
Vlad
 
@StephanS had the exact same issue R7525. After trying prox7.3, 7.0, 6.4, ubuntu server, debian and all kernel parameters from above in proxmox with no luck, I have only changed:

PCIe Preferred IO Bus value = 255
to
PCIe Preferred IO Bus value = 0

SR-IOV Global Enable initially set to enable, than switched to disabled (default value for my servers) and still working.

Thanks @StephanS and @hboterman

Regards
Vlad

Sorry for reviving this thread more than a year after the last post! I have exactly the same error with my Dell R7525 and I was able to boot in most OS installs using the above settings. However, problem still comes back to haunt me after the server is up and running for 1-5 days and the server completely freezes with the same error message in the log "A fatal error was detected on a component at bus 99 device 0 function 0"

idrac video still has last displayed image but is frozen. I tried this in Proxmox, vmware 7.xx and XCP-NG, the installer runs but OS ends up freezing after a while. System BIOS, Drivers and idrac are up to date.

I was hoping someone has survived throught that and to learn how!

The R7525 is still a current server but I can't find any info on this specific error except this thread!

Thanks in advance.

JF.
 
Hello, sorry for reviving this thread but we're in the process to buy not one but three of these servers and I'm wondering if in version 8.2-1 the problem is still persistent.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!