Hi all
I was just updating my homelab/homeserver (HP Proliant Gen 8 Microserver) with a fair number of updates as I’d not gotten around to updating the server for a while (bad me I know). I’ve ben running this for a while now (probably 6.X roughly) and have been loving learning and using Proxmox.
I noticed something didn’t seem to update properly while doing that, but I stupidly assumed it was something that would probably update fine after a reboot…. Now the web gui is not accessible and the server doesn’t even seem to request an IP address or register with my router. I get these two errors in the console at boot:
along with a few disk errors as well and a failure to import my ZFS pool.
The Kernel panics and the server doesn’t boot past the first couple of steps when I try to boot using 5.15.108-1-pve or 5.13.19-6-pve. It boots using 5.11.22-4-pve and 5.13.19-3-pve but in both cases is incredibly slow (even typing) in iLO and issues the same DMAR errors and has no network. I assume I was running the 5.11.XX Kernel before this as its the oldest one listed in Grub, but I’m not sure.
Some time ago I did some attempts on PCIe passthrough, but I (now!) know there’s an issue with IOMMU and RMRR on the Gen8 Microserver. When I first tried to enable it, it almost brought the server down completely, so I disabled it in Grub and its not caused a problem since.
I’ve been troubleshooting and searching the Proxmox forums a lot, and have come up with 4 possible issues so far:
Looking at apt / dpkg issue I've grabbed these three: (and pveversion shows the vast majority of pve packages are now "not correctly installed" (I've attached these having OCRed the screenshots to text for pveversion and history.log and a long screenshot for term.log)
I've also looked at the following but not sure if these are useful so not attached them yet:
General info (from a recent boot now the server doesn't actually boot):
Looking at it being a disk issue or disks being full (they aren't)
In relation to the networking (and I’ve tried to bring the interfaces up, but it never gets an IP and my vmbr0 seems to have vanished):
Thinking it might be an IOMMU/RMRR issue:
I do have backups of some of the data, but as is often the case (yes I have learned my lesson) - they are not as up to date as I would like.
Annoyingly I have also been tinkering with a small cluster setup on another set of devices as part of learning what I was intending to replace the G8 Microserver with. If I’d just waited and finished that off and mitigated over then I would have avoided this mess as I could have updated one node, saw it was a problem and not worried…. *sigh*
I’d love to get this back up and running just so I can rescue my data which is on a ZFS pool passed through as a disk to an OMV VM. Any help is very much appreciated as I’m a little broken having been digging around lots and generally not wanting to break it further with running commands in a situation that feels quite delicate!
Thanks
I was just updating my homelab/homeserver (HP Proliant Gen 8 Microserver) with a fair number of updates as I’d not gotten around to updating the server for a while (bad me I know). I’ve ben running this for a while now (probably 6.X roughly) and have been loving learning and using Proxmox.
I noticed something didn’t seem to update properly while doing that, but I stupidly assumed it was something that would probably update fine after a reboot…. Now the web gui is not accessible and the server doesn’t even seem to request an IP address or register with my router. I get these two errors in the console at boot:
Code:
DMAR: DRHD: handling fault status reg 2
DMAR: [INTR-REMAP] Request device [01:00.0] fault index 15 [fault reason 38] Blocked an interrupt request due to source-id verification failure
along with a few disk errors as well and a failure to import my ZFS pool.
The Kernel panics and the server doesn’t boot past the first couple of steps when I try to boot using 5.15.108-1-pve or 5.13.19-6-pve. It boots using 5.11.22-4-pve and 5.13.19-3-pve but in both cases is incredibly slow (even typing) in iLO and issues the same DMAR errors and has no network. I assume I was running the 5.11.XX Kernel before this as its the oldest one listed in Grub, but I’m not sure.
Some time ago I did some attempts on PCIe passthrough, but I (now!) know there’s an issue with IOMMU and RMRR on the Gen8 Microserver. When I first tried to enable it, it almost brought the server down completely, so I disabled it in Grub and its not caused a problem since.
I’ve been troubleshooting and searching the Proxmox forums a lot, and have come up with 4 possible issues so far:
- Network Interface not getting an IP - still doesn’t when you try to bring the interface up
- The apt/dpkg packages which failed to install during the updates
- Full system drive?
- IOMMU/RMRR issue
Looking at apt / dpkg issue I've grabbed these three: (and pveversion shows the vast majority of pve packages are now "not correctly installed" (I've attached these having OCRed the screenshots to text for pveversion and history.log and a long screenshot for term.log)
Code:
/var/log/apt/history.log
/var/log/apt/term.log
pveversion -v
I've also looked at the following but not sure if these are useful so not attached them yet:
General info (from a recent boot now the server doesn't actually boot):
Code:
Dmesg
Looking at it being a disk issue or disks being full (they aren't)
Code:
lsblk
df -h
In relation to the networking (and I’ve tried to bring the interfaces up, but it never gets an IP and my vmbr0 seems to have vanished):
Code:
ip addr
/etc/network/interfaces
Thinking it might be an IOMMU/RMRR issue:
Code:
/etc/default/grub
cat proc/cmdline
I do have backups of some of the data, but as is often the case (yes I have learned my lesson) - they are not as up to date as I would like.
Annoyingly I have also been tinkering with a small cluster setup on another set of devices as part of learning what I was intending to replace the G8 Microserver with. If I’d just waited and finished that off and mitigated over then I would have avoided this mess as I could have updated one node, saw it was a problem and not worried…. *sigh*
I’d love to get this back up and running just so I can rescue my data which is on a ZFS pool passed through as a disk to an OMV VM. Any help is very much appreciated as I’m a little broken having been digging around lots and generally not wanting to break it further with running commands in a situation that feels quite delicate!
Thanks