[ISSUE] VM is stuck in starting state. Only force power off by the button works

andrei indreies

New Member
Nov 2, 2023
6
0
1
I have a proxmox setup with:
AMD ryzen 5 4650g pro
b550m ds3h
32GB ram ECC micron 3200
2x 16tb exos x16
2x ssd 1TB 2.5inch samsung 870 evo
Asm1166.

My system booted with truenas core vm where I passthrough my asm1166 but 3 months. Now I tried to create an ubuntu server vm where I tried passthrough my igpu without success and now my truenas vm is stuck in starting state.

my config
Code:
  GNU nano 7.2                                              /etc/pve/qemu-server/100.conf                                                     
balloon: 0
bios: seabios
boot: order=ide0;net0
cores: 4
cpu: host
efidisk0: local-lvm:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:05:00.0
ide0: local-lvm:vm-100-disk-1,discard=on,size=32G,ssd=1
localtime: 1
machine: q35
memory: 16385
meta: creation-qemu=9.0.2,ctime=1734878992
name: Truenas-core
net0: virtio=BC:24:11:B5:A2:FF,bridge=vmbr0
net1: virtio=BC:24:11:B1:3D:CA,bridge=vmbr1,mtu=9000
numa: 0
ostype: other
scsihw: virtio-scsi-pci
smbios1: uuid=6f14563f-6e8d-4f93-a22f-eb062a98e630
sockets: 1
vmgenid: 2f447000-33aa-40f9-ac0a-ff4490fb8aa5

Code:
05:00.0 0106: 1b21:1166 (rev 02) (prog-if 01 [AHCI 1.0])
        Subsystem: 1b21:2116
        Flags: bus master, fast devsel, latency 0, IRQ 38, IOMMU group 9
        Memory at fcf82000 (32-bit, non-prefetchable) [size=8K]
        Memory at fcf80000 (32-bit, non-prefetchable) [size=8K]
        Expansion ROM at fcf00000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [80] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [130] Secondary PCI Express
        Kernel driver in use: ahci
        Kernel modules: ahci

I tried to unlock the vm, to search for some logs to understand why it is stuck. Only powering it off from button works.
 

Attachments

  • Screenshot 2025-03-26 at 09.43.51.png
    Screenshot 2025-03-26 at 09.43.51.png
    322.3 KB · Views: 3
  • Screenshot 2025-03-26 at 09.44.01.png
    Screenshot 2025-03-26 at 09.44.01.png
    414.3 KB · Views: 3
Last edited:
Hello andrei indreies! Maybe you are already aware, but there's a chapter in the PVE documentation about PCI(e) passthrough and also a wiki page with even more details. Could you please provide us with the output of the following commands from the host:
  1. The output of lspci -nnk
  2. The output of dmesg
  3. The output of journalctl --boot
  4. The output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
  5. The output of dmidecode -t bios
Also, please note the VM config recommendations for the best PCI(e) passthrough compatibility - you might want to use OVMF instead of SeaBIOS if that is possible with your iGPU.
 
Hello and thanks a lot for your quick reply. I attached all the requested logs in separated files.
Meanwhile I changed the boot from legacy/bios to EFI.

EXTRA: Would be really great if you could also help me understand how I can passthrough my Igpu radeon Vega from my 4650g pro. I did all the steps from the documentations. I don't need display output I have the reset bug, this ubuntu-server vm is my control plane in my k3s cluster made of 2x rpi and this, and I need the gpu for transcoding only.. I tried first with lxc and it worked, but I needed rook-ceph instead of nfs and it creates a lot of disks which I can't really easily mount them like I do inside a vm. But in the vm I got problems with my gpu. If is not possible I will buy an arc a310.
 

Attachments

Last edited:
First of all, some of the outputs you provided are not complete, e.g. the output of dmesg does not show the beginning. In the future, I would recommend using dmesg > dmesg.txt to avoid such issues. However, for now, please provide us with the following (shorter) outputs:
  1. dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
  2. lsmod | grep vfio

Secondly, I see that the iGPU uses the amdgpu kernel module, meaning that the iGPU will be used by the host instead of the guest VM. As you are probably aware, this is not possible, as either the host, or a VM, can use the GPU, but not both at once (note: this is possible with LXC containers, which explains why it worked). In other words, you will need to blacklist the amdgpu driver on the host (and update your initramfs afterwards) to be able to pass it through, as explained in the guides I linked to. Obviously, this means you won't have any video output on the host anymore.

I don't need display output I have the reset bug
Do you mean the AMD GPU reset bug?
 
amdgpu and radeon already added in the blacklist.

Code:
06:00.0 0300: 1002:1636 (rev d9) (prog-if 00 [VGA controller])
        Subsystem: 1458:d000
        Flags: fast devsel, IRQ 31, IOMMU group 10
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e0000000 (64-bit, prefetchable) [size=2M]
        I/O ports at e000 [size=256]
        Memory at fcb00000 (32-bit, non-prefetchable) [size=512K]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
        Capabilities: [c0] MSI-X: Enable- Count=4 Masked-
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270] Secondary PCI Express
        Capabilities: [2a0] Access Control Services
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [400] Data Link Feature <?>
        Capabilities: [410] Physical Layer 16.0 GT/s <?>
        Capabilities: [440] Lane Margining at the Receiver <?>
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu

Code:
root@drewspace:~# lsmod | grep vfio
vfio_pci               16384  1
vfio_pci_core          86016  1 vfio_pci
irqbypass              12288  3 vfio_pci_core,kvm
vfio_iommu_type1       49152  1
vfio                   65536  8 vfio_pci_core,vfio_iommu_type1,vfio_pci
iommufd                94208  1 vfio

I managed to boot inside truenas somehow after 30 restarts and updating kernel, but I'm 100% sure if I'm gonna stop it right now it won't boot again.


Code:
root@drewspace:~# dmesg |  grep -e DMAR -e IOMMU -e AMD-Vi
[    0.165868] AMD-Vi: Using global IVHD EFR:0x206d73ef22254ade, EFR2:0x0
[    0.448512] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.449613] AMD-Vi: Extended features (0x206d73ef22254ade, 0x0): PPR X2APIC NX GT IA GA PC GA_vAPIC
[    0.449623] AMD-Vi: Interrupt remapping enabled
[    0.449624] AMD-Vi: X2APIC enabled
[    0.533543] AMD-Vi: Virtual APIC enabled
[    0.536356] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).


Yes. but for me doesn't work anything I found, I don't find any vbios that can work with my ubuntu vm..
 

Attachments

Last edited:
Well, I think I managed to make it work. Video output form hdmi is not present but jellyfin on my Kubernetes cluster runs really good with VAAPI. I have some frames drops after a while but I think I will investigate it.
I will publish here: I used the vbios from here: https://gist.github.com/c4software/c824f11ac55ebe8cf01df040a4cb58b2
I don't use acs_overwritte since is blocking my ahci asm1166 controller somehow, I only use iommu=pt and that's it.
Blacklisted amdgpu and radeon.

Screenshot 2025-03-27 at 09.18.10.png

Code:
drew@thor-node:~$ vainfo
error: can't connect to X server!
libva info: VA-API version 1.20.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_20
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.20 (libva 2.12.0)
vainfo: Driver version: Mesa Gallium driver 24.2.8-1ubuntu1~24.04.1 for AMD Radeon Graphics (radeonsi, renoir, LLVM 19.1.1, DRM 3.59, 6.8.0-55-generic)
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            :    VAEntrypointVLD
      VAProfileMPEG2Main              :    VAEntrypointVLD
      VAProfileVC1Simple              :    VAEntrypointVLD
      VAProfileVC1Main                :    VAEntrypointVLD
      VAProfileVC1Advanced            :    VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:    VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:    VAEntrypointEncSlice
      VAProfileH264Main               :    VAEntrypointVLD
      VAProfileH264Main               :    VAEntrypointEncSlice
      VAProfileH264High               :    VAEntrypointVLD
      VAProfileH264High               :    VAEntrypointEncSlice
      VAProfileHEVCMain               :    VAEntrypointVLD
      VAProfileHEVCMain               :    VAEntrypointEncSlice
      VAProfileHEVCMain10             :    VAEntrypointVLD
      VAProfileHEVCMain10             :    VAEntrypointEncSlice
      VAProfileJPEGBaseline           :    VAEntrypointVLD
      VAProfileVP9Profile0            :    VAEntrypointVLD
      VAProfileVP9Profile2            :    VAEntrypointVLD
      VAProfileNone                   :    VAEntrypointVideoP
 
Last edited: