Oh dear, oh dear, oh dear - Assistance required for noob

SpongeRob

Member
Nov 29, 2021
15
4
8
Co. Donegal, Eire
I've purchased an old HP ML330 G6 for my father, I have exactly the same server myself. Everything will be fine, it'll be easy to set up (as my home server is the same.. or so I thought.. famous last words!) Unfortunately reality is completely different and I don't have enough knowledge to understand the problems that I'm facing i'm afraid. Additionally I have to head home tomorrow (25th) so will be attempting to help my father remotely going forward, so anything that can be done today will be greatly appreciated. My thanks in advance.

I've already had to motherboard swap my fathers server due to some random problem which kept occurring at different intervals, setting all the fans into fail safe mode (i.e. full speed) and finally spent a few days hardware troubleshooting only to get down to one stick of RAM, no expansion cards, one SATA DVD drive booting into Memtest using a boot CD. Although I never established the exact cause, the motherboard swap seemed to rectify the issue. So at least the server is now stable enough for me to at least install Proxmox.

I had purchased (but removed when I was hardware troubleshooting) a Mellanox ConnectX-2 10GB dual NIC, one of my error messages I was getting previously was an 'NMI PCI Error' - Non-Maskable Interrupt (NMI) or Peripheral Component Interconnect (PCI) error "NMI: PCI system error (SERR) for reason a1 on CPU 0, Kernel is Dazed and Confused'. Hence why I removed the dual NIC and the server was behaving well (..until today) - The HBA card (which I feel may be the culprit) is a LSI 9200-8i - MPT2BIOS 7-29.01.00 (which is a different card to my own, so our systems aren't exactly the same) and maybe I'm just unlucky in terms of hardware but I'd like to find out, just to be certain.

My home server, is running a 80GB Linux Mint VM and a second media drive defined in /etc/fstab. However for my fathers server, I initially used the same 80GB Mint partition, then expanded the disk in Proxmox before expanding the system with fdisk. I obviously did something wrong as although it correctly expanded the third EXT4 partition it was saying I still had little free space. So I decided to cut my losses and just reinstalled the Mint VM.

Interestingly enough, I created one large 7108GB partition for the VM, then I let Mint do it's recommended HDD install with default options excluding LVM. After the reboot I was getting the error 'attempt to read or write outside of disk 'hd0'' and though maybe I'd tried to use too much HDD space in my VM so I reinstalled it using 7TB and got exactly the same error again. A little more research and it appears that the drive size is too big, and maybe I should of copied my own install and kept the 80GB partition + Data drive, however I stumbled upon a quick fix of just changing the BIOS options for the VM to UEFI, adding a EFI drive to the VM and reinstalling Grub boot loader and Mint was running correctly (possibly..)

Now I seem to be getting random ProxMox disconnects and the server restarts itself randomly. Frustrated I've tried to diagnose that but if I'm honest I don't really know what I'm doing hence my cry for help please! I have tried the following;

(i) Disabling C-States in the HP BIOS.
(ii) Adding intel_iommu=off into boot options, updating Grub, restarting.
(iii) Doing some live error discovery using journalctl -p err -f

Errors include:

ACPI: SPCR: [Firmware Bug] : Unexpected SPCR Access Width - Ignored, don't think this is relevant, happens on my own server, doesn't seem to impact anything.
ERST: Failed to get Error Log Address Range - Tried adding acpi=off to boot options, system fails to boot.
Handle_request_update: Could not read RRD file. - Wiping and restarting the RRD Cache (rm -r /var/lib/rrdcached/db & systemctl restart rrdcached.service)

However that's me all out of options, like I say I don't really know what I'm doing (outside of Googling for solutions to whatever errors that I see) so I'm hoping somebody will kindly take me under their wing, and guide me please?

To recap, my exact problem is that ProxMox is restarting on it's own, sometimes I hear a tiny 'squeek' out of the HDD's just as it's crashing..

ProxMox kernel version is 6.8.4-2

Other information can be found here:

Code:
pveversion -v
--------------
proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2


cat /etc/pve/storage.cfg
-----------------------------
dir: local
        path /var/lib/vz
        content backup,vztmpl
        shared 0

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

zfspool: ML330-ZFS
        pool ML330-ZFS
        content images,rootdir
        mountpoint /ML330-ZFS
        nodes ML330

cifs: ProxBackupi5
        path /mnt/pve/ProxBackupi5
        server 192.168.100.51
        share PROXMOX
        content iso,backup,images
        prune-backups keep-all=1
        username Derek


lsblk
------
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda                  8:0    0 37.3G  0 disk
├─sda1               8:1    0 1007K  0 part
├─sda2               8:2    0  512M  0 part
└─sda3               8:3    0 36.8G  0 part
  ├─pve-swap       252:0    0  4.5G  0 lvm  [SWAP]
  ├─pve-root       252:1    0 16.1G  0 lvm  /
  ├─pve-data_tmeta 252:2    0    1G  0 lvm
  │ └─pve-data     252:4    0  9.6G  0 lvm
  └─pve-data_tdata 252:3    0  9.6G  0 lvm
    └─pve-data     252:4    0  9.6G  0 lvm
sdb                  8:16   0  2.7T  0 disk
├─sdb1               8:17   0  2.7T  0 part
└─sdb9               8:25   0    8M  0 part
sdc                  8:32   0  2.7T  0 disk
├─sdc1               8:33   0  2.7T  0 part
└─sdc9               8:41   0    8M  0 part
sdd                  8:48   0  2.7T  0 disk
├─sdd1               8:49   0  2.7T  0 part
└─sdd9               8:57   0    8M  0 part
sde                  8:64   0  2.7T  0 disk
├─sde1               8:65   0  2.7T  0 part
└─sde9               8:73   0    8M  0 part
sr0                 11:0    1 1024M  0 rom
zd0                230:0    0  6.8T  0 disk
├─zd0p1            230:1    0    1M  0 part
├─zd0p2            230:2    0  513M  0 part
└─zd0p3            230:3    0  6.8T  0 part
zd16               230:16   0    1M  0 disk
zd32               230:32   0    5G  0 disk
└─zd32p1           230:33   0    5G  0 part
zd48               230:48   0   12G  0 disk
├─zd48p1           230:49   0    1M  0 part
├─zd48p2           230:50   0  1.8G  0 part
└─zd48p3           230:51   0 10.2G  0 part
 
Last edited:
Boot up with a live Linux CD and let it idle for a bit, maybe open some stuff that can use system resources like a bunch of browser tabs or something, and see if you get the same problem. It sounds like it could be a hardware issue, but this could be a way to rule it out.

If you're actually hearing weird noises from the HDD it sounds like it could be a drive about to fail as well, in which case the live CD test might not have any failure if it's not hitting the hard drive. Check the SMART data to see if there's any errors.

Ex:

Code:
smartctl -a /dev/sda
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!