Hi Everyone!
I decided to write a tutorial for PCI/GPU passthrough for the HPE ML/DL servers, because many information you can found is old/misleading and not working.
Here you you can found the a working method.
1., Requirements:
2., BIOS Settings:
The Video settings are needed, the BIOS to do not initialize/grab the GPU card just leave as is, use the integrated onboard GPU.
Otherwise the GPU is initialized, maybe you need the "vbios" dump from you GPU card to correctly "start/restart the card" in the VM ( some settings only adjusted behind this process, later cannot thats why need the "vbios") - we don't want this, dont need any vbios.
3., Proxmos host kernel settings:
Reboot the Machine
4.,HPE IOMMU configuration
After the restart, we need to adjust IOMMU config, we need the "hp-scripting-tools" from the "http://downloads.linux.hpe.com/SDR/repo/stk/" website.
We dont need to add the repo, just download the latest available version, the XML is needed too.
We need the Physical portnumber, where the PCI-E card is located:
Create the following file "exclude.dat" with following content ( RMRDS_SlotX -> "RMRDS_Slot3" - The card 'Physical Slot' number )
Apply the config
Query the status
Check the config is okay:
( If you see the following, you are okay )
If you have more PCI/GPU card that you want to passthrough, you need to repeat the "conrep" process.
Reboot the Machine
You can check the IOMMU is working:
If you see somthing like this, then it is working:
There is a common misleading line:
This is not a bug or broken IOMMU, because we enabled only the "Slot3 PCI-E" for IOMMU with the "hp-scripting-tools - conrep command" - Just ignore this fake errors.
5., Configure the VM:
The last step is to add the PCI/GPU card to the VM config, you can use the WebGUI.
I decided to write a tutorial for PCI/GPU passthrough for the HPE ML/DL servers, because many information you can found is old/misleading and not working.
Here you you can found the a working method.
1., Requirements:
Code:
-HPE ML/DL series server
-PCI-E card / GPU card
-ILO access
2., BIOS Settings:
Code:
F9 BIOS (RBSU), and Press CTR+A (Hidden menu will appear at the bottom "Service Options"):
- "System Options" > "Intel(R) VT-d" > "Enabled"
- "Advanced Options" > "Video Options" > "Embedded video primary, optional video secondary"
- "Advanced Options" > "Remote Graphics Mode" > "Enabled"
- "Service Options" > "Processor Power and Utilization Monitoring" > "Enabled"
- "Service Options" > "Shared Memory Communication" > "Enabled"
- "Service Options" > "PCI Express 64bit BAR Support" > "Enabled"
The Video settings are needed, the BIOS to do not initialize/grab the GPU card just leave as is, use the integrated onboard GPU.
Otherwise the GPU is initialized, maybe you need the "vbios" dump from you GPU card to correctly "start/restart the card" in the VM ( some settings only adjusted behind this process, later cannot thats why need the "vbios") - we don't want this, dont need any vbios.
3., Proxmos host kernel settings:
Code:
/etc/default/grub
# INTEL processor:
GRUB_CMDLINE_LINUX=" intel_iommu=on iommu=pt initcall_blacklist=sysfb_init"
# AMD processor:
GRUB_CMDLINE_LINUX=" amd_iommu=on iommu=pt initcall_blacklist=sysfb_init"
Code:
/etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1 report_ignored_msrs=0
Code:
/etc/modprobe.d/nvidia.conf
blacklist nvidiafb
blacklist nouveau
blacklist nvidia
blacklist nvidia_drm
Code:
/etc/modprobe.d/radeon.conf
blacklist radeon
blacklist amdgpu
Code:
/etc/modules
vfio
vfio_iommu_type1
vfio_pci
Code:
$> update-initramfs -c -d -u
$> update-grub
Reboot the Machine
4.,HPE IOMMU configuration
After the restart, we need to adjust IOMMU config, we need the "hp-scripting-tools" from the "http://downloads.linux.hpe.com/SDR/repo/stk/" website.
We dont need to add the repo, just download the latest available version, the XML is needed too.
Code:
$> wget "http://downloads.linux.hpe.com/SDR/repo/stk/Debian/pool/non-free/hp-scripting-tools_11.60-20_amd64.deb"
$> wget -O conrep_rmrds.xml "https://downloads.hpe.com/pub/softlib2/software1/pubsw-linux/p1472592088/v95853/conrep_rmrds.xml"
$> dpkg -i hp-scripting-tools_11.60-20_amd64.deb
We need the Physical portnumber, where the PCI-E card is located:
Code:
(Example: AMD GPU card)
$> lspci -nnk | grep 'AMD'
0c:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi ...
0d:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi ...
0e:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi ...
0e:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] ...
$> lspci -s 0c:00.0 -vvv | grep 'Physical Slot'
Physical Slot: 3
$> lspci -s 0d:00.0 -vvv | grep 'Physical Slot'
$> lspci -s 0e:00.0 -vvv | grep 'Physical Slot'
$> lspci -s 0e:00.1 -vvv | grep 'Physical Slot'
Create the following file "exclude.dat" with following content ( RMRDS_SlotX -> "RMRDS_Slot3" - The card 'Physical Slot' number )
Code:
<Conrep> <Section name="RMRDS_Slot3" helptext=".">Endpoints_Excluded</Section> </Conrep>
Apply the config
Code:
$> conrep -l -x conrep_rmrds.xml -f exclude.dat
Query the status
Code:
$> conrep -s -x conrep_rmrds.xml -f verify.dat
Check the config is okay:
Code:
$> cat verify.dat | grep -i excluded
<Section name="RMRDS_Slot3" helptext=".">Endpoints_Excluded</Section>
( If you see the following, you are okay )
If you have more PCI/GPU card that you want to passthrough, you need to repeat the "conrep" process.
Reboot the Machine
You can check the IOMMU is working:
Code:
$> journalctl -xb 0 | grep -ie DMAR -ie IOMMU -ie VFIO
If you see somthing like this, then it is working:
Code:
kernel: DMAR: IOMMU enabled
kernel: DMAR: Host address width 46
kernel: DMAR: DRHD base: 0x000000fabfe000 flags: 0x0
kernel: DMAR: dmar0: reg_base_addr fabfe000 ver 1:0 cap d2078c106f0466 ecap f020de
kernel: DMAR: DRHD base: 0x000000f4ffe000 flags: 0x1
kernel: DMAR: dmar1: reg_base_addr f4ffe000 ver 1:0 cap d2078c106f0466 ecap f020de
kernel: DMAR: RMRR base: 0x000000bdffd000 end: 0x000000bdffffff
kernel: DMAR: RMRR base: 0x000000bdff6000 end: 0x000000bdffcfff
kernel: DMAR: RMRR base: 0x000000bdf83000 end: 0x000000bdf84fff
kernel: DMAR: RMRR base: 0x000000bdf7f000 end: 0x000000bdf82fff
kernel: DMAR: RMRR base: 0x000000bdf6f000 end: 0x000000bdf7efff
kernel: DMAR: RMRR base: 0x000000bdf6e000 end: 0x000000bdf6efff
kernel: DMAR: RMRR base: 0x000000000f4000 end: 0x000000000f4fff
kernel: DMAR: RMRR base: 0x000000000e8000 end: 0x000000000e8fff
kernel: DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x00000000000e8000-0x00000000000e8fff], contact BIOS vendor for fixes
kernel: DMAR: [Firmware Bug]: Your BIOS is broken; bad RMRR [0x00000000000e8000-0x00000000000e8fff]
kernel: DMAR: RMRR base: 0x000000bddde000 end: 0x000000bdddefff
kernel: DMAR: ATSR flags: 0x0
kernel: DMAR-IR: IOAPIC id 10 under DRHD base 0xfabfe000 IOMMU 0
kernel: DMAR-IR: IOAPIC id 8 under DRHD base 0xf4ffe000 IOMMU 1
kernel: DMAR-IR: IOAPIC id 0 under DRHD base 0xf4ffe000 IOMMU 1
kernel: DMAR-IR: HPET id 0 under DRHD base 0xf4ffe000
kernel: DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
kernel: DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
kernel: DMAR-IR: Enabled IRQ remapping in xapic mode
kernel: iommu: Default domain type: Passthrough (set via kernel command line)
kernel: DMAR: No SATC found
kernel: DMAR: dmar0: Using Queued invalidation
kernel: DMAR: dmar1: Using Queued invalidation
kernel: pci 0000:40:00.0: Adding to iommu group 0
kernel: pci 0000:40:01.0: Adding to iommu group 1
kernel: pci 0000:40:01.1: Adding to iommu group 2
kernel: pci 0000:40:02.0: Adding to iommu group 3
kernel: pci 0000:40:02.1: Adding to iommu group 4
kernel: pci 0000:40:02.2: Adding to iommu group 5
kernel: pci 0000:40:02.3: Adding to iommu group 6
kernel: pci 0000:40:03.0: Adding to iommu group 7
kernel: pci 0000:40:03.1: Adding to iommu group 8
kernel: pci 0000:40:03.2: Adding to iommu group 9
kernel: pci 0000:40:03.3: Adding to iommu group 10
kernel: pci 0000:40:04.0: Adding to iommu group 11
kernel: pci 0000:40:04.1: Adding to iommu group 12
kernel: pci 0000:40:04.2: Adding to iommu group 13
kernel: pci 0000:40:04.3: Adding to iommu group 14
kernel: pci 0000:40:04.4: Adding to iommu group 15
kernel: pci 0000:40:04.5: Adding to iommu group 16
kernel: pci 0000:40:04.6: Adding to iommu group 17
kernel: pci 0000:40:04.7: Adding to iommu group 18
kernel: pci 0000:41:00.0: Adding to iommu group 19
kernel: pci 0000:47:00.0: Adding to iommu group 20
kernel: pci 0000:47:00.1: Adding to iommu group 20
kernel: pci 0000:00:00.0: Adding to iommu group 21
kernel: pci 0000:00:01.0: Adding to iommu group 22
kernel: pci 0000:00:01.1: Adding to iommu group 23
kernel: pci 0000:00:02.0: Adding to iommu group 24
kernel: pci 0000:00:02.1: Adding to iommu group 25
kernel: pci 0000:00:02.2: Adding to iommu group 26
kernel: pci 0000:00:02.3: Adding to iommu group 27
kernel: pci 0000:00:03.0: Adding to iommu group 28
kernel: pci 0000:00:03.1: Adding to iommu group 29
kernel: pci 0000:00:03.2: Adding to iommu group 30
kernel: pci 0000:00:03.3: Adding to iommu group 31
kernel: pci 0000:00:04.0: Adding to iommu group 32
kernel: pci 0000:00:04.1: Adding to iommu group 33
kernel: pci 0000:00:04.2: Adding to iommu group 34
kernel: pci 0000:00:04.3: Adding to iommu group 35
kernel: pci 0000:00:04.4: Adding to iommu group 36
kernel: pci 0000:00:04.5: Adding to iommu group 37
kernel: pci 0000:00:04.6: Adding to iommu group 38
kernel: pci 0000:00:04.7: Adding to iommu group 39
kernel: pci 0000:00:05.0: Adding to iommu group 40
kernel: pci 0000:00:05.2: Adding to iommu group 41
kernel: pci 0000:00:05.4: Adding to iommu group 42
kernel: pci 0000:00:11.0: Adding to iommu group 43
kernel: pci 0000:00:1a.0: Adding to iommu group 44
kernel: pci 0000:00:1c.0: Adding to iommu group 45
kernel: pci 0000:00:1c.4: Adding to iommu group 46
kernel: pci 0000:00:1c.7: Adding to iommu group 47
kernel: pci 0000:00:1d.0: Adding to iommu group 48
kernel: pci 0000:00:1e.0: Adding to iommu group 49
kernel: pci 0000:00:1f.0: Adding to iommu group 50
kernel: pci 0000:00:1f.2: Adding to iommu group 50
kernel: pci 0000:0f:00.0: Adding to iommu group 51
kernel: pci 0000:0f:00.1: Adding to iommu group 51
kernel: pci 0000:0f:00.2: Adding to iommu group 51
kernel: pci 0000:0f:00.3: Adding to iommu group 51
kernel: pci 0000:04:00.0: Adding to iommu group 52
kernel: pci 0000:05:04.0: Adding to iommu group 53
kernel: pci 0000:05:05.0: Adding to iommu group 54
kernel: pci 0000:05:08.0: Adding to iommu group 55
kernel: pci 0000:07:00.0: Adding to iommu group 56
kernel: pci 0000:08:00.0: Adding to iommu group 57
kernel: pci 0000:03:00.0: Adding to iommu group 58
kernel: pci 0000:0c:00.0: Adding to iommu group 59
kernel: pci 0000:0d:00.0: Adding to iommu group 60
kernel: pci 0000:0e:00.0: Adding to iommu group 61
kernel: pci 0000:0e:00.1: Adding to iommu group 62
kernel: pci 0000:02:00.0: Adding to iommu group 63
kernel: pci 0000:02:00.1: Adding to iommu group 63
kernel: pci 0000:02:00.2: Adding to iommu group 63
kernel: pci 0000:02:00.3: Adding to iommu group 63
kernel: pci 0000:01:00.0: Adding to iommu group 64
kernel: pci 0000:01:00.1: Adding to iommu group 64
kernel: pci 0000:01:00.2: Adding to iommu group 64
kernel: pci 0000:01:00.4: Adding to iommu group 64
kernel: pci 0000:40:05.0: Adding to iommu group 65
kernel: pci 0000:40:05.2: Adding to iommu group 66
kernel: pci 0000:40:05.4: Adding to iommu group 67
kernel: pci 0000:3f:08.0: Adding to iommu group 68
kernel: pci 0000:3f:08.2: Adding to iommu group 68
kernel: pci 0000:3f:08.6: Adding to iommu group 69
kernel: pci 0000:3f:09.0: Adding to iommu group 70
kernel: pci 0000:3f:09.2: Adding to iommu group 70
kernel: pci 0000:3f:09.6: Adding to iommu group 71
kernel: pci 0000:3f:0a.0: Adding to iommu group 72
kernel: pci 0000:3f:0a.1: Adding to iommu group 72
kernel: pci 0000:3f:0a.2: Adding to iommu group 72
kernel: pci 0000:3f:0a.3: Adding to iommu group 72
kernel: pci 0000:3f:0b.0: Adding to iommu group 73
kernel: pci 0000:3f:0b.3: Adding to iommu group 73
kernel: pci 0000:3f:0c.0: Adding to iommu group 74
kernel: pci 0000:3f:0c.1: Adding to iommu group 74
kernel: pci 0000:3f:0c.2: Adding to iommu group 74
kernel: pci 0000:3f:0c.3: Adding to iommu group 74
kernel: pci 0000:3f:0c.4: Adding to iommu group 74
kernel: pci 0000:3f:0d.0: Adding to iommu group 75
kernel: pci 0000:3f:0d.1: Adding to iommu group 75
kernel: pci 0000:3f:0d.2: Adding to iommu group 75
kernel: pci 0000:3f:0d.3: Adding to iommu group 75
kernel: pci 0000:3f:0d.4: Adding to iommu group 75
kernel: pci 0000:3f:0e.0: Adding to iommu group 76
kernel: pci 0000:3f:0e.1: Adding to iommu group 76
kernel: pci 0000:3f:0f.0: Adding to iommu group 77
kernel: pci 0000:3f:0f.1: Adding to iommu group 78
kernel: pci 0000:3f:0f.2: Adding to iommu group 79
kernel: pci 0000:3f:0f.3: Adding to iommu group 80
kernel: pci 0000:3f:0f.4: Adding to iommu group 81
kernel: pci 0000:3f:0f.5: Adding to iommu group 82
kernel: pci 0000:3f:10.0: Adding to iommu group 83
kernel: pci 0000:3f:10.1: Adding to iommu group 84
kernel: pci 0000:3f:10.2: Adding to iommu group 85
kernel: pci 0000:3f:10.3: Adding to iommu group 86
kernel: pci 0000:3f:10.4: Adding to iommu group 87
kernel: pci 0000:3f:10.5: Adding to iommu group 88
kernel: pci 0000:3f:10.6: Adding to iommu group 89
kernel: pci 0000:3f:10.7: Adding to iommu group 90
kernel: pci 0000:3f:13.0: Adding to iommu group 91
kernel: pci 0000:3f:13.1: Adding to iommu group 91
kernel: pci 0000:3f:13.4: Adding to iommu group 91
kernel: pci 0000:3f:13.5: Adding to iommu group 91
kernel: pci 0000:3f:16.0: Adding to iommu group 92
kernel: pci 0000:3f:16.1: Adding to iommu group 92
kernel: pci 0000:3f:16.2: Adding to iommu group 92
kernel: pci 0000:5f:08.0: Adding to iommu group 93
kernel: pci 0000:5f:08.2: Adding to iommu group 93
kernel: pci 0000:5f:08.6: Adding to iommu group 94
kernel: pci 0000:5f:09.0: Adding to iommu group 95
kernel: pci 0000:5f:09.2: Adding to iommu group 95
kernel: pci 0000:5f:09.6: Adding to iommu group 96
kernel: pci 0000:5f:0a.0: Adding to iommu group 97
kernel: pci 0000:5f:0a.1: Adding to iommu group 97
kernel: pci 0000:5f:0a.2: Adding to iommu group 97
kernel: pci 0000:5f:0a.3: Adding to iommu group 97
kernel: pci 0000:5f:0b.0: Adding to iommu group 98
kernel: pci 0000:5f:0b.3: Adding to iommu group 98
kernel: pci 0000:5f:0c.0: Adding to iommu group 99
kernel: pci 0000:5f:0c.1: Adding to iommu group 99
kernel: pci 0000:5f:0c.2: Adding to iommu group 99
kernel: pci 0000:5f:0c.3: Adding to iommu group 99
kernel: pci 0000:5f:0c.4: Adding to iommu group 99
kernel: pci 0000:5f:0d.0: Adding to iommu group 100
kernel: pci 0000:5f:0d.1: Adding to iommu group 100
kernel: pci 0000:5f:0d.2: Adding to iommu group 100
kernel: pci 0000:5f:0d.3: Adding to iommu group 100
kernel: pci 0000:5f:0d.4: Adding to iommu group 100
kernel: pci 0000:5f:0e.0: Adding to iommu group 101
kernel: pci 0000:5f:0e.1: Adding to iommu group 101
kernel: pci 0000:5f:0f.0: Adding to iommu group 102
kernel: pci 0000:5f:0f.1: Adding to iommu group 103
kernel: pci 0000:5f:0f.2: Adding to iommu group 104
kernel: pci 0000:5f:0f.3: Adding to iommu group 105
kernel: pci 0000:5f:0f.4: Adding to iommu group 106
kernel: pci 0000:5f:0f.5: Adding to iommu group 107
kernel: pci 0000:5f:10.0: Adding to iommu group 108
kernel: pci 0000:5f:10.1: Adding to iommu group 109
kernel: pci 0000:5f:10.2: Adding to iommu group 110
kernel: pci 0000:5f:10.3: Adding to iommu group 111
kernel: pci 0000:5f:10.4: Adding to iommu group 112
kernel: pci 0000:5f:10.5: Adding to iommu group 113
kernel: pci 0000:5f:10.6: Adding to iommu group 114
kernel: pci 0000:5f:10.7: Adding to iommu group 115
kernel: pci 0000:5f:13.0: Adding to iommu group 116
kernel: pci 0000:5f:13.1: Adding to iommu group 116
kernel: pci 0000:5f:13.4: Adding to iommu group 116
kernel: pci 0000:5f:13.5: Adding to iommu group 116
kernel: pci 0000:5f:16.0: Adding to iommu group 117
kernel: pci 0000:5f:16.1: Adding to iommu group 117
kernel: pci 0000:5f:16.2: Adding to iommu group 117
kernel: DMAR: Intel(R) Virtualization Technology for Directed I/O
kernel: DMAR: DRHD: handling fault status reg 2
kernel: DMAR: [INTR-REMAP] Request device [01:00.0] fault index 0x67 [fault reason 0x26] Blocked an interrupt request due to source-id verification failure
kernel: VFIO - User Level meta-driver version: 0.3
kernel: vfio-pci 0000:0e:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
kernel: vfio-pci 0000:0e:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
kernel: vfio-pci 0000:0e:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
kernel: vfio-pci 0000:0e:00.0: enabling device (0040 -> 0043)
kernel: vfio-pci 0000:0e:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
kernel: vfio-pci 0000:0e:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
kernel: vfio-pci 0000:0e:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
kernel: vfio-pci 0000:0e:00.0: vfio_ecap_init: hiding ecap 0x27@0x450
There is a common misleading line:
Code:
kernel: DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x00000000000e8000-0x00000000000e8fff], contact BIOS vendor for fixes
kernel: DMAR: [Firmware Bug]: Your BIOS is broken; bad RMRR [0x00000000000e8000-0x00000000000e8fff]
....
kernel: DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
kernel: DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
This is not a bug or broken IOMMU, because we enabled only the "Slot3 PCI-E" for IOMMU with the "hp-scripting-tools - conrep command" - Just ignore this fake errors.
5., Configure the VM:
The last step is to add the PCI/GPU card to the VM config, you can use the WebGUI.
Code:
Add PCI-Device
(*) Raw Device []
[X] Primary GPU
[X] All Functions
[X] PCI-Express
[X] ROM-BAR
Last edited: