I've purchased an old HP ML330 G6 for my father, I have exactly the same server myself. Everything will be fine, it'll be easy to set up (as my home server is the same.. or so I thought.. famous last words!) Unfortunately reality is completely different and I don't have enough knowledge to understand the problems that I'm facing i'm afraid. Additionally I have to head home tomorrow (25th) so will be attempting to help my father remotely going forward, so anything that can be done today will be greatly appreciated. My thanks in advance.
I've already had to motherboard swap my fathers server due to some random problem which kept occurring at different intervals, setting all the fans into fail safe mode (i.e. full speed) and finally spent a few days hardware troubleshooting only to get down to one stick of RAM, no expansion cards, one SATA DVD drive booting into Memtest using a boot CD. Although I never established the exact cause, the motherboard swap seemed to rectify the issue. So at least the server is now stable enough for me to at least install Proxmox.
I had purchased (but removed when I was hardware troubleshooting) a Mellanox ConnectX-2 10GB dual NIC, one of my error messages I was getting previously was an 'NMI PCI Error' - Non-Maskable Interrupt (NMI) or Peripheral Component Interconnect (PCI) error "NMI: PCI system error (SERR) for reason a1 on CPU 0, Kernel is Dazed and Confused'. Hence why I removed the dual NIC and the server was behaving well (..until today) - The HBA card (which I feel may be the culprit) is a LSI 9200-8i - MPT2BIOS 7-29.01.00 (which is a different card to my own, so our systems aren't exactly the same) and maybe I'm just unlucky in terms of hardware but I'd like to find out, just to be certain.
My home server, is running a 80GB Linux Mint VM and a second media drive defined in /etc/fstab. However for my fathers server, I initially used the same 80GB Mint partition, then expanded the disk in Proxmox before expanding the system with fdisk. I obviously did something wrong as although it correctly expanded the third EXT4 partition it was saying I still had little free space. So I decided to cut my losses and just reinstalled the Mint VM.
Interestingly enough, I created one large 7108GB partition for the VM, then I let Mint do it's recommended HDD install with default options excluding LVM. After the reboot I was getting the error 'attempt to read or write outside of disk 'hd0'' and though maybe I'd tried to use too much HDD space in my VM so I reinstalled it using 7TB and got exactly the same error again. A little more research and it appears that the drive size is too big, and maybe I should of copied my own install and kept the 80GB partition + Data drive, however I stumbled upon a quick fix of just changing the BIOS options for the VM to UEFI, adding a EFI drive to the VM and reinstalling Grub boot loader and Mint was running correctly (possibly..)
Now I seem to be getting random ProxMox disconnects and the server restarts itself randomly. Frustrated I've tried to diagnose that but if I'm honest I don't really know what I'm doing hence my cry for help please! I have tried the following;
(i) Disabling C-States in the HP BIOS.
(ii) Adding intel_iommu=off into boot options, updating Grub, restarting.
(iii) Doing some live error discovery using journalctl -p err -f
Errors include:
ACPI: SPCR: [Firmware Bug] : Unexpected SPCR Access Width - Ignored, don't think this is relevant, happens on my own server, doesn't seem to impact anything.
ERST: Failed to get Error Log Address Range - Tried adding acpi=off to boot options, system fails to boot.
Handle_request_update: Could not read RRD file. - Wiping and restarting the RRD Cache (rm -r /var/lib/rrdcached/db & systemctl restart rrdcached.service)
However that's me all out of options, like I say I don't really know what I'm doing (outside of Googling for solutions to whatever errors that I see) so I'm hoping somebody will kindly take me under their wing, and guide me please?
To recap, my exact problem is that ProxMox is restarting on it's own, sometimes I hear a tiny 'squeek' out of the HDD's just as it's crashing..
ProxMox kernel version is 6.8.4-2
Other information can be found here:
I've already had to motherboard swap my fathers server due to some random problem which kept occurring at different intervals, setting all the fans into fail safe mode (i.e. full speed) and finally spent a few days hardware troubleshooting only to get down to one stick of RAM, no expansion cards, one SATA DVD drive booting into Memtest using a boot CD. Although I never established the exact cause, the motherboard swap seemed to rectify the issue. So at least the server is now stable enough for me to at least install Proxmox.
I had purchased (but removed when I was hardware troubleshooting) a Mellanox ConnectX-2 10GB dual NIC, one of my error messages I was getting previously was an 'NMI PCI Error' - Non-Maskable Interrupt (NMI) or Peripheral Component Interconnect (PCI) error "NMI: PCI system error (SERR) for reason a1 on CPU 0, Kernel is Dazed and Confused'. Hence why I removed the dual NIC and the server was behaving well (..until today) - The HBA card (which I feel may be the culprit) is a LSI 9200-8i - MPT2BIOS 7-29.01.00 (which is a different card to my own, so our systems aren't exactly the same) and maybe I'm just unlucky in terms of hardware but I'd like to find out, just to be certain.
My home server, is running a 80GB Linux Mint VM and a second media drive defined in /etc/fstab. However for my fathers server, I initially used the same 80GB Mint partition, then expanded the disk in Proxmox before expanding the system with fdisk. I obviously did something wrong as although it correctly expanded the third EXT4 partition it was saying I still had little free space. So I decided to cut my losses and just reinstalled the Mint VM.
Interestingly enough, I created one large 7108GB partition for the VM, then I let Mint do it's recommended HDD install with default options excluding LVM. After the reboot I was getting the error 'attempt to read or write outside of disk 'hd0'' and though maybe I'd tried to use too much HDD space in my VM so I reinstalled it using 7TB and got exactly the same error again. A little more research and it appears that the drive size is too big, and maybe I should of copied my own install and kept the 80GB partition + Data drive, however I stumbled upon a quick fix of just changing the BIOS options for the VM to UEFI, adding a EFI drive to the VM and reinstalling Grub boot loader and Mint was running correctly (possibly..)
Now I seem to be getting random ProxMox disconnects and the server restarts itself randomly. Frustrated I've tried to diagnose that but if I'm honest I don't really know what I'm doing hence my cry for help please! I have tried the following;
(i) Disabling C-States in the HP BIOS.
(ii) Adding intel_iommu=off into boot options, updating Grub, restarting.
(iii) Doing some live error discovery using journalctl -p err -f
Errors include:
ACPI: SPCR: [Firmware Bug] : Unexpected SPCR Access Width - Ignored, don't think this is relevant, happens on my own server, doesn't seem to impact anything.
ERST: Failed to get Error Log Address Range - Tried adding acpi=off to boot options, system fails to boot.
Handle_request_update: Could not read RRD file. - Wiping and restarting the RRD Cache (rm -r /var/lib/rrdcached/db & systemctl restart rrdcached.service)
However that's me all out of options, like I say I don't really know what I'm doing (outside of Googling for solutions to whatever errors that I see) so I'm hoping somebody will kindly take me under their wing, and guide me please?
To recap, my exact problem is that ProxMox is restarting on it's own, sometimes I hear a tiny 'squeek' out of the HDD's just as it's crashing..
ProxMox kernel version is 6.8.4-2
Other information can be found here:
Code:
pveversion -v
--------------
proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2
cat /etc/pve/storage.cfg
-----------------------------
dir: local
path /var/lib/vz
content backup,vztmpl
shared 0
lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir
zfspool: ML330-ZFS
pool ML330-ZFS
content images,rootdir
mountpoint /ML330-ZFS
nodes ML330
cifs: ProxBackupi5
path /mnt/pve/ProxBackupi5
server 192.168.100.51
share PROXMOX
content iso,backup,images
prune-backups keep-all=1
username Derek
lsblk
------
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 37.3G 0 disk
├─sda1 8:1 0 1007K 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 36.8G 0 part
├─pve-swap 252:0 0 4.5G 0 lvm [SWAP]
├─pve-root 252:1 0 16.1G 0 lvm /
├─pve-data_tmeta 252:2 0 1G 0 lvm
│ └─pve-data 252:4 0 9.6G 0 lvm
└─pve-data_tdata 252:3 0 9.6G 0 lvm
└─pve-data 252:4 0 9.6G 0 lvm
sdb 8:16 0 2.7T 0 disk
├─sdb1 8:17 0 2.7T 0 part
└─sdb9 8:25 0 8M 0 part
sdc 8:32 0 2.7T 0 disk
├─sdc1 8:33 0 2.7T 0 part
└─sdc9 8:41 0 8M 0 part
sdd 8:48 0 2.7T 0 disk
├─sdd1 8:49 0 2.7T 0 part
└─sdd9 8:57 0 8M 0 part
sde 8:64 0 2.7T 0 disk
├─sde1 8:65 0 2.7T 0 part
└─sde9 8:73 0 8M 0 part
sr0 11:0 1 1024M 0 rom
zd0 230:0 0 6.8T 0 disk
├─zd0p1 230:1 0 1M 0 part
├─zd0p2 230:2 0 513M 0 part
└─zd0p3 230:3 0 6.8T 0 part
zd16 230:16 0 1M 0 disk
zd32 230:32 0 5G 0 disk
└─zd32p1 230:33 0 5G 0 part
zd48 230:48 0 12G 0 disk
├─zd48p1 230:49 0 1M 0 part
├─zd48p2 230:50 0 1.8G 0 part
└─zd48p3 230:51 0 10.2G 0 part
Last edited: