Instable Proxmox Installation on DELL PowerEdge R420

felix.hix

Member
Nov 11, 2020
1
0
6
32
Hi all,

I wanted to switch my virtual machines from a DELL R220 II with Proxmox 4.4.1 (which was really stable over the years) to a more powerful DELL PowerEdge R420 and the newest Proxmox Release (6.2-1) but I'm facing a quite instable system (which is really anoying).

Don't be confused about the timestamps - I did the the stuff at the end of the year and didn't found time until yet to prepare the posting.

The Hardware is the following:
  • 2x Intel Xeon E5-2430 (2 Sockets)
  • 12x 8GB DDR3 1333 ECC (HMT31GR7EFR4A-H9) - All Slots allocated
  • DELL PERC H710 (IT Mode)
ZFS pool1 (mirrored ZRAID 1):
2x Intel SC2KB480G8 480GB Enterprise SSD (fresh bought 2 weeks ago) <- Proxmox Base System is installed here (local-zfs)
ZFS pool2: (currently not in usage also mirrored)
2x Toshiba AL13SXB3 CLAR300 0B08 300GB SAS 2,5" (SMART ok)

The machine shows the following symptoms:
- Sometimes random freeze and reboot
- Permission denied / invalid PVE Ticket 401
- Web UI stucks often
- Backup Jobs crash
- Random corrupt data on virtual machines (e.g. on Windows Update or other situations)

The BIOS is the newest available from DELL v2.9.0 [01/09/2020] (Microcode of the CPU is also latest: 0x718)

I checked the RAM for faulty sticks with memtest86 (did the full test with 14 Stages twices - took about one week :D) but there were no issues found.

I also tried updating to latest PVE see (this was by the end of the last year):

Code:
root@pve:~# pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.73-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-4
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-10
pve-cluster: 6.2-1
pve-container: 3.2-3
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-6
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-20
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Below some errors I found in the journal and syslog:
Code:
root@pve:~# grep -i -e fail -e error -e corrupt /var/log/syslog
....
Nov 25 18:01:01 pve kernel: [143945.876806] traps: server[25251] general protection fault ip:7f1ed85df3e0 sp:7f1ecfffb9c0 error:0 in libc-2.28.so[7f1ed8594000+148000]
Nov 25 18:01:01 pve pvesr[27106]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Nov 25 18:01:01 pve pvesr[27106]: ipcc_send_rec[2] failed: Connection refused
Nov 25 18:01:01 pve systemd[1]: pve-cluster.service: Failed with result 'signal'.
Nov 25 18:01:01 pve pvesr[27106]: ipcc_send_rec[3] failed: Connection refused
Nov 25 18:01:01 pve systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 25 18:01:01 pve systemd[1]: Failed to start Proxmox VE replication runner.
Nov 25 18:03:34 pve kernel: [144098.794397] #PF: error_code(0x0000) - not-present page
Nov 25 18:05:34 pve kernel: [144219.029716] #PF: error_code(0x0000) - not-present page
...
Nov 25 20:06:43 pve kernel: [151487.894399] #PF: error_code(0x0000) - not-present page
Nov 25 21:38:21 pve qm[48292]: VM 105 qmp command failed - VM 105 not running
Nov 25 21:38:21 pve pvedaemon[48290]: Failed to run vncproxy.
Nov 25 21:38:21 pve pvedaemon[2637]: <root@pam> end task UPID:pve:0000BCA2:00EF89EE:5FBEC0BC:vncproxy:105:root@pam: Failed to run vncproxy.
Nov 25 21:38:25 pve kernel: [156990.405493] #PF: error_code(0x0000) - not-present page
...
Nov 25 21:39:59 pve systemd[1]: lxc.service: Control process exited, code=exited, status=1/FAILURE
Nov 25 21:39:59 pve systemd[1]: lxc.service: Failed with result 'exit-code'.
Nov 25 21:39:59 pve fusermount[2649]: /bin/fusermount: failed to unmount /var/lib/lxcfs: Invalid argument
Nov 25 21:43:58 pve kernel: [    1.492203] ERST: Error Record Serialization Table (ERST) support is initialized.
Nov 25 21:43:58 pve kernel: [    1.729252] RAS: Correctable Errors collector initialized.
Nov 25 21:43:58 pve kernel: [   17.727915] ACPI Error: No handler for Region [SYSI] ((____ptrval____)) [IPMI] (20190816/evregion-132)
Nov 25 21:43:58 pve kernel: [   17.738989] ACPI Error: Region IPMI (ID=7) has no handler (20190816/exfldio-265)
Nov 25 21:43:58 pve kernel: [   17.767064] ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
Nov 25 21:43:58 pve kernel: [   17.767833] ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
Nov 25 21:43:58 pve kernel: [   17.790545] ACPI Error: AE_NOT_EXIST, Evaluating _PMC (20190816/power_meter-743)


-- Logs begin at Wed 2020-11-25 21:43:55 CET, end at Wed 2020-11-25 22:37:02 CET. --
Nov 25 21:43:55 pve kernel: ACPI: SPCR: Unexpected SPCR Access Width.  Defaulting to byte size
Nov 25 21:43:56 pve kernel: mpt2sas_cm0: overriding NVDATA EEDPTagMode setting
Nov 25 21:43:56 pve kernel: ACPI Error: No handler for Region [SYSI] ((____ptrval____)) [IPMI] (20190816/evregion-132)
Nov 25 21:43:56 pve kernel: ACPI Error: Region IPMI (ID=7) has no handler (20190816/exfldio-265)
Nov 25 21:43:56 pve kernel: ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
Nov 25 21:43:56 pve kernel: ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
Nov 25 21:43:56 pve kernel: ACPI Error: AE_NOT_EXIST, Evaluating _PMC (20190816/power_meter-743)
Nov 25 21:44:00 pve iscsid[2289]: iSCSI daemon with pid=2290 started!
Nov 25 21:49:06 pve QEMU[8603]: kvm: terminating on signal 15 from pid 2205 (/usr/sbin/qmeventd)
Nov 25 21:49:09 pve qm[14164]: VM 104 qmp command failed - VM 104 not running
Nov 25 21:49:09 pve pvedaemon[14162]: Failed to run vncproxy.
Nov 25 21:49:09 pve pvedaemon[2585]: <root@pam> end task UPID:pve:00003752:000080C7:5FBEC344:vncproxy:104:root@pam: Failed to run vncproxy.
Nov 25 21:52:54 pve QEMU[17614]: kvm: terminating on signal 15 from pid 2205 (/usr/sbin/qmeventd)
Nov 25 22:01:59 pve kernel: BUG: unable to handle page fault for address: 0000000192f66b3c
Nov 25 22:01:59 pve kernel: #PF: supervisor read access in kernel mode
Nov 25 22:01:59 pve kernel: #PF: error_code(0x0000) - not-present page
Nov 25 22:27:01 pve kernel: BUG: unable to handle page fault for address: 0000000192f66b3c
Nov 25 22:27:01 pve kernel: #PF: supervisor read access in kernel mode
Nov 25 22:27:01 pve kernel: #PF: error_code(0x0000) - not-present page
Do you have any ideas why the host is such unstable? Maybe trying an older kernel or something? (I am really despairing.)

Thanks in advance
Cheers
 
the general protection fault and the ACPI errors might point to a hardware issue - things I would try:
* upgrade to the latest available BIOS/Firmware for all components in the system
* run memcheck extensively

I hope this helps!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!