All broken because i removed graphics card

Ozarc · Jun 4, 2022

I must say, im very unimpressed about how quickly proxmox breaks fro the smallest reason.
I really thought it would be more capable and stable than this after so many years in production. Its now causing me massive headaches all because i unplugged and re-plugged a graphics card?

It really cannot be this fragile.

Ive been building PC for over 2 decades so know my way around this. I needed the Nvidia 1070ti that i have in my dedicated proxmox PC for another PC for a short while as i traveled. So while booted down, i removed teh GPU and put it in the other PC. When done, i simply re-plugged it back into my Proxmox machine. Technically the Proxmox machine didnt even know the GPU was removed.

This caused Massive endless issues, so much so that i cannot even find the network of the proxmox anymore. Its gone from bad to worse within hours and i cant understand how this system is so fragile. From having a clean running system with 10VMs for over a year to now not being able to find it on the network all because i re-plugged a GPU is mind blowing.

The first problem that it presented was that normally it would boot and stay on the screen saying

loading initial ramdisk until i fire dup a VM

Now, it shows tha message for a fews seconds, but then transitions to this message:

Found volume group "pve" using metadata type lvm2"
3 Logical volumes(s) in volume group "pve" now active

I found this odd as i had never seen it before.

I then tried to boot up one of my windows VMs, its simply refused to show on screen. I then tried Ubunut, same issue. I then tried to remove the PCI device in hardware section and re add it, still nothing. I then set Display to default instead of none to try run it through VNC, still didn't work. I checked all my settings.

I tried my NAs Scale, which still workedv ia its own IP address, so the issue was clearly the GPU for anthing that wsa tyrign to use the GPU.

After a day of struggling, I then got a message saying that it cannot capture any more logs as my Hardisk is full. Again found this odd because its never had issues. I deleted some ISO files to free up space, but saw it quickly filling up that emptied space with logs data as NasScale and a VM were running. I check this and it had endless messages saying:

pve kernel: vfio-pci 0000:0b:00.0: BAR 1: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]

I then tried to shit them down and refresh teh GUI, and now it refused to connect to the GUI with an error "The connection has timed out"

I then used Putty to access the machine, which worked. I ran ls -allh /var/log/sys* which output:


[LIST]
[*]-rw-r----- 1 root adm  23G Jun  4 11:51 /var/log/syslog
[*]-rw-r----- 1 root adm 300K Jun  3 13:12 /var/log/syslog.1
[*]-rw-r----- 1 root adm  14K May 15 00:01 /var/log/syslog.2.gz
[*]-rw-r----- 1 root adm  66K May 13 17:47 /var/log/syslog.3.gz
[*]-rw-r----- 1 root adm 162K Apr 30 09:54 /var/log/syslog.4.gz

[/LIST]

23GB of log data. So then i ran rm /var/log/syslog to remove that and clear up space. now i cant even access it via putty and my router no longer sees it on the lan. Absolute mess!

I cannot believe that my entire proxmox server is unaccesabel jsut because i pulled and repluuged my GPOU while its off. It seriaouly has to be bult better than this. Everythign was 100% befor ei did this, so no nothign changed other than that.

Ozarc · Jun 4, 2022

So i am able to again access the GUI.

My LAN cable goes into the realtek LAN port on my motherboard, but I also bought an ASUS 10GB Intel PCIe NIC that i have in one of the slower bottom PCIe slots for future use. I unplugged it to see if it was causing conflict issues with the GPU. Since then i couldnt connect to the GUI or the PC, even though there is no LAN cable connected to it? All LAN runs through the other LAN port on the motherboard, including the cable How is that possible?

I then also noticed that my GPU (0b:00.0) used to be in group 27, but after the re-plug it moved to group 28. The 10GB NIC is now in group 27. So this must be causing a conflict somehow.

How can i separate the Intel NIC from not trying to access GPU memory?

Ozarc · Jun 4, 2022

Looks like it could also be a bigger issue with 7.2 and kernel 5.15 https://forum.proxmox.com/threads/gpu-passthrough-issues-after-upgrade-to-7-2.109051/#post-469132

Ozarc · Jun 4, 2022

so after running proxmox-boot-tool kernel pin 5.13.19-6-pve in the shell, it fixed the problem as it roleld it back to 5.13 from 5.15.

I also edited my grub by typing nano /etc/default/grub in the shell then editing the line to

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt textonly video=efifb:off video=vesafb:off video=simplefb:off video=astdrmfb"

instead of the one i had, in case that helped the resolution.

Eitehr way mine was workign fine befor ethis update like amny otehrs int eh above link, so there really needs ot be a proper fix in 5.15 adn to be properly tested before releasing it. Lost 2 days of work because of this

Dunuin · Jun 4, 2022

Do you always read the release notes before manually upgrading and then reboot your server after an kernel upgrade so you can test the new kernel without being surprised, when the server isn't booting anymore, when rebooting weeks/months later?
The 7.2 release note warned about "known issues" with PCIe at the day of release: https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_7.2

PCI(e) pass through related:

Systems passing through a GPU may be affected from the switch to the SYS_FB (system frame buffer) KConfig build options using the simplefb module as driver in the new default 5.15 based kernel.The sys-fb allows taking over the FB from the firmware/earlier boot stages. Note that Proxmox VE uses the legacy simplefb driver over the modern simpledrm one due to regressions and issues we encountered on testing with the latter.Most of those issues are already fixed in newer kernels and Proxmox VE may try to switch to the modern, DRM based FB driver once it moves to 5.17, or newer, as its default kernel.If your systems is configured to pass through the (i)GPU, and you had to avoid the host kernel claiming the device, you may now need to also add video=simplefbff to the kernel boot command line.

Setups using vendor-reset for PCIe pass through need to adapt to changes of the new default 5.15 based kernel, see For details see this issue.They must run the command echo 'device_specific' > /sys/bus/pci/devices/<PCI-ID>/reset_method before the VM is started. This can be automated by using a systemd service or using a on-boot cron script.Alternatively one can also use a VM hook script with the pre-start hook.

intel_iommu now defaults to on. The kernel config of the new 5.15 series enables the intel_iommu parameter by default - this can cause problems with older hardware (issues were reported with e.g. HP DL380 g8 servers, and Dell R610 servers - so hardware older than 10 years)

And you always should disable PCI passthrough or autostart of your VMs before removing or adding new PCIe cards so that no wrong device will be passed through to the VM.

All broken because i removed graphics card

Ozarc

New Member

Ozarc

New Member

Ozarc

New Member

Ozarc

New Member

Dunuin

Distinguished Member

We value your privacy