HW Info:
Manager: pve-manager/7.4-17/513c62be
Kernel: Linux 5.15.126-1-pve #1 SMP PVE 5.15.126-1 (2023-10-03T17:24Z)
CPU: AMD Ryzen 9 5900X 12-Core Processor
Mobo: MSI Meg X570 Unify
My goal is to have a Virtual Gaming machine running in Proxmox with a dedicated GPU and NVME drive to ensure performance. I understand the limitations and view this as a learning project mostly.
I currently have my GPU in my VM with passthrough. It seems to be working just fine and even handles a full restart of the node which previously was an issue.
I have previously had a machine working with both GPU and NVME passthrough but about a year or so ago it broke so I am starting the build over now that I have some time.
Current issue:
So I can see the drive in my Windows 11 install but it appears to be Write-Protected. I can copy files but I can't create/edit/delete anything. I have tested this with live CDs, attempted to image an OS onto the disk (both windows and linux), etc. There is already a Windows 10 install on the drive from before that should be intact but it can't boot up (probably inability to write to page or something similar). I haven't attempted to image the drive outside of the VM environment so I acknowledge that the issue could actually be with the NVME itself (I really hope not).
If anyone has any tests or configuration I haven't tried in the last week that would be appreciated. Also, if anything is missing from above or if any other info could be helpful please let me know. I tried to get what I thought was relevant but I likely missed something I am less familiar with.
Thanks in advance.
SOLVED:
Unfortunately, it does look like a HW failure on the drive. I will need to put in for a warranty replacement.
I found the issue by running the command:
smartctl -a /dev/nvme0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- available spare has fallen below threshold
- media has been placed in read only mode
Important files:
Code:
cat /etc/modprobe.d/blacklist.conf
blacklist nouveau
blacklist nvidia
blacklist snd_hda_intel
cat /etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1 report_ignored_msrs=0
pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE
# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1e84,10de:10f8,10de:1ad8,10de:1ad9 disable_vga=1
/etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pci=noats initcall_blacklist=sysfb_init"
GRUB_CMDLINE_LINUX=""
The PCI.IDs in the vfio.conf are the supported features of my GPU.
Additional commands:
Code:
dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
[ 0.147022] AMD-Vi: ivrs, add hid:AMDI0020, uid:\_SB.FUR0, rdevid:160
[ 0.147023] AMD-Vi: ivrs, add hid:AMDI0020, uid:\_SB.FUR1, rdevid:160
[ 0.147024] AMD-Vi: ivrs, add hid:AMDI0020, uid:\_SB.FUR2, rdevid:160
[ 0.147025] AMD-Vi: ivrs, add hid:AMDI0020, uid:\_SB.FUR3, rdevid:160
[ 0.708148] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.709312] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 0.709313] AMD-Vi: Extended features (0x58f77ef22294a5a): PPR NX GT IA PC GA_vAPIC
[ 0.709315] AMD-Vi: Interrupt remapping enabled
[ 0.721412] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/17/devices/0000:21:06.0
/sys/kernel/iommu_groups/7/devices/0000:00:07.0
/sys/kernel/iommu_groups/25/devices/0000:2d:00.2
/sys/kernel/iommu_groups/25/devices/0000:2d:00.0
/sys/kernel/iommu_groups/25/devices/0000:2d:00.3
/sys/kernel/iommu_groups/25/devices/0000:2d:00.1
/sys/kernel/iommu_groups/15/devices/0000:21:01.0
/sys/kernel/iommu_groups/5/devices/0000:00:04.0
/sys/kernel/iommu_groups/23/devices/0000:27:00.0
/sys/kernel/iommu_groups/13/devices/0000:20:00.0
/sys/kernel/iommu_groups/3/devices/0000:00:03.0
/sys/kernel/iommu_groups/21/devices/0000:22:00.0
/sys/kernel/iommu_groups/11/devices/0000:00:14.3
/sys/kernel/iommu_groups/11/devices/0000:00:14.0
/sys/kernel/iommu_groups/1/devices/0000:00:01.2
/sys/kernel/iommu_groups/28/devices/0000:2f:00.3
/sys/kernel/iommu_groups/18/devices/0000:2a:00.3
/sys/kernel/iommu_groups/18/devices/0000:2a:00.1
/sys/kernel/iommu_groups/18/devices/0000:21:08.0
/sys/kernel/iommu_groups/18/devices/0000:2a:00.0
/sys/kernel/iommu_groups/8/devices/0000:00:07.1
/sys/kernel/iommu_groups/26/devices/0000:2e:00.0
/sys/kernel/iommu_groups/16/devices/0000:21:05.0
/sys/kernel/iommu_groups/6/devices/0000:00:05.0
/sys/kernel/iommu_groups/24/devices/0000:28:00.0
/sys/kernel/iommu_groups/14/devices/0000:21:00.0
/sys/kernel/iommu_groups/4/devices/0000:00:03.1
/sys/kernel/iommu_groups/22/devices/0000:23:00.0
/sys/kernel/iommu_groups/12/devices/0000:00:18.3
/sys/kernel/iommu_groups/12/devices/0000:00:18.1
/sys/kernel/iommu_groups/12/devices/0000:00:18.6
/sys/kernel/iommu_groups/12/devices/0000:00:18.4
/sys/kernel/iommu_groups/12/devices/0000:00:18.2
/sys/kernel/iommu_groups/12/devices/0000:00:18.0
/sys/kernel/iommu_groups/12/devices/0000:00:18.7
/sys/kernel/iommu_groups/12/devices/0000:00:18.5
/sys/kernel/iommu_groups/2/devices/0000:00:02.0
/sys/kernel/iommu_groups/20/devices/0000:21:0a.0
/sys/kernel/iommu_groups/20/devices/0000:2c:00.0
/sys/kernel/iommu_groups/10/devices/0000:00:08.1
/sys/kernel/iommu_groups/29/devices/0000:2f:00.4
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/19/devices/0000:21:09.0
/sys/kernel/iommu_groups/19/devices/0000:2b:00.0
/sys/kernel/iommu_groups/9/devices/0000:00:08.0
/sys/kernel/iommu_groups/27/devices/0000:2f:00.0
lspci -nnk
23:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a801]
Kernel driver in use: vfio-pci
Kernel modules: nvme
I have previously added a few additional options to the above configurations files but have since rolled back. These configs would have been:
Code:
cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1e84,10de:10f8,10de:1ad8,10de:1ad9,144d:a801,144d:a808 disable_vga=1
cat /etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1 report_ignored_msrs=0
options vfio_iommu_type1 allow_unsafe_interrupts=1
Current VM config (Recently added the CPU flags so they don't help or hurt):
Code:
cat /etc/pve/nodes/homelan/qemu-server/105.conf
acpi: 1
agent: 1
balloon: 0
bios: ovmf
boot: order=ide2;sata0
cores: 6
cpu: host,flags=+ibpb;+virt-ssbd;+amd-ssbd;+amd-no-ssb;+pdpe1gb;+aes
efidisk0: local-lvm:vm-105-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:2d:00.0,pcie=1,x-vga=1
hostpci1: 0000:2d:00.1,pcie=1
hostpci2: 0000:2d:00.2,pcie=1
hostpci3: 0000:2d:00.3,pcie=1
hostpci4: 0000:23:00,pcie=1
hotplug: disk,network,usb,memory,cpu
ide2: none,media=cdrom
kvm: 1
machine: pc-q35-7.2
memory: 16384
meta: creation-qemu=7.2.0,ctime=1699598302
name: GamePC
net0: e1000=AA:D3:43:F6:ED:1D,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: win11
sata0: LANCache:vm-105-disk-0,size=100G
scsihw: virtio-scsi-pci
smbios1: uuid=d6773b8c-5995-49a1-8c96-5423bf9a15ca
sockets: 1
startup: up=30
tags: gaming
tpmstate0: local-lvm:vm-105-disk-1,size=4M,version=v2.0
vga: none
vmgenid: f360b48a-1fdd-4333-893d-4274228ab6a9
vmstatestorage: LANCache
All above files are while the NVME is attached to VM 105 which is on.
EDIT:
Further testing. Mounted the drive in Proxmox host using:
mount /dev/nvme0n1p3 /mnt/temp/
and this also comes out as read-only. So it may be at a lower level then just the passthrough. Hmm.
Checked the temp and the NVME is sitting at a comfortable 50C+-10 so that isn't the issue.
Last edited: