Hi all,
I wanted to switch my virtual machines from a DELL R220 II with Proxmox 4.4.1 (which was really stable over the years) to a more powerful DELL PowerEdge R420 and the newest Proxmox Release (6.2-1) but I'm facing a quite instable system (which is really anoying).
Don't be confused about the timestamps - I did the the stuff at the end of the year and didn't found time until yet to prepare the posting.
The Hardware is the following:
2x Intel SC2KB480G8 480GB Enterprise SSD (fresh bought 2 weeks ago) <- Proxmox Base System is installed here (local-zfs)
ZFS pool2: (currently not in usage also mirrored)
2x Toshiba AL13SXB3 CLAR300 0B08 300GB SAS 2,5" (SMART ok)
The machine shows the following symptoms:
- Sometimes random freeze and reboot
- Permission denied / invalid PVE Ticket 401
- Web UI stucks often
- Backup Jobs crash
- Random corrupt data on virtual machines (e.g. on Windows Update or other situations)
The BIOS is the newest available from DELL v2.9.0 [01/09/2020] (Microcode of the CPU is also latest: 0x718)
I checked the RAM for faulty sticks with memtest86 (did the full test with 14 Stages twices - took about one week ) but there were no issues found.
I also tried updating to latest PVE see (this was by the end of the last year):
Below some errors I found in the journal and syslog:
Do you have any ideas why the host is such unstable? Maybe trying an older kernel or something? (I am really despairing.)
Thanks in advance
Cheers
I wanted to switch my virtual machines from a DELL R220 II with Proxmox 4.4.1 (which was really stable over the years) to a more powerful DELL PowerEdge R420 and the newest Proxmox Release (6.2-1) but I'm facing a quite instable system (which is really anoying).
Don't be confused about the timestamps - I did the the stuff at the end of the year and didn't found time until yet to prepare the posting.
The Hardware is the following:
- 2x Intel Xeon E5-2430 (2 Sockets)
- 12x 8GB DDR3 1333 ECC (HMT31GR7EFR4A-H9) - All Slots allocated
- DELL PERC H710 (IT Mode)
2x Intel SC2KB480G8 480GB Enterprise SSD (fresh bought 2 weeks ago) <- Proxmox Base System is installed here (local-zfs)
ZFS pool2: (currently not in usage also mirrored)
2x Toshiba AL13SXB3 CLAR300 0B08 300GB SAS 2,5" (SMART ok)
The machine shows the following symptoms:
- Sometimes random freeze and reboot
- Permission denied / invalid PVE Ticket 401
- Web UI stucks often
- Backup Jobs crash
- Random corrupt data on virtual machines (e.g. on Windows Update or other situations)
The BIOS is the newest available from DELL v2.9.0 [01/09/2020] (Microcode of the CPU is also latest: 0x718)
I checked the RAM for faulty sticks with memtest86 (did the full test with 14 Stages twices - took about one week ) but there were no issues found.
I also tried updating to latest PVE see (this was by the end of the last year):
Code:
root@pve:~# pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.73-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-4
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-10
pve-cluster: 6.2-1
pve-container: 3.2-3
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-6
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-20
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
Below some errors I found in the journal and syslog:
Code:
root@pve:~# grep -i -e fail -e error -e corrupt /var/log/syslog
....
Nov 25 18:01:01 pve kernel: [143945.876806] traps: server[25251] general protection fault ip:7f1ed85df3e0 sp:7f1ecfffb9c0 error:0 in libc-2.28.so[7f1ed8594000+148000]
Nov 25 18:01:01 pve pvesr[27106]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Nov 25 18:01:01 pve pvesr[27106]: ipcc_send_rec[2] failed: Connection refused
Nov 25 18:01:01 pve systemd[1]: pve-cluster.service: Failed with result 'signal'.
Nov 25 18:01:01 pve pvesr[27106]: ipcc_send_rec[3] failed: Connection refused
Nov 25 18:01:01 pve systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 25 18:01:01 pve systemd[1]: Failed to start Proxmox VE replication runner.
Nov 25 18:03:34 pve kernel: [144098.794397] #PF: error_code(0x0000) - not-present page
Nov 25 18:05:34 pve kernel: [144219.029716] #PF: error_code(0x0000) - not-present page
...
Nov 25 20:06:43 pve kernel: [151487.894399] #PF: error_code(0x0000) - not-present page
Nov 25 21:38:21 pve qm[48292]: VM 105 qmp command failed - VM 105 not running
Nov 25 21:38:21 pve pvedaemon[48290]: Failed to run vncproxy.
Nov 25 21:38:21 pve pvedaemon[2637]: <root@pam> end task UPID:pve:0000BCA2:00EF89EE:5FBEC0BC:vncproxy:105:root@pam: Failed to run vncproxy.
Nov 25 21:38:25 pve kernel: [156990.405493] #PF: error_code(0x0000) - not-present page
...
Nov 25 21:39:59 pve systemd[1]: lxc.service: Control process exited, code=exited, status=1/FAILURE
Nov 25 21:39:59 pve systemd[1]: lxc.service: Failed with result 'exit-code'.
Nov 25 21:39:59 pve fusermount[2649]: /bin/fusermount: failed to unmount /var/lib/lxcfs: Invalid argument
Nov 25 21:43:58 pve kernel: [ 1.492203] ERST: Error Record Serialization Table (ERST) support is initialized.
Nov 25 21:43:58 pve kernel: [ 1.729252] RAS: Correctable Errors collector initialized.
Nov 25 21:43:58 pve kernel: [ 17.727915] ACPI Error: No handler for Region [SYSI] ((____ptrval____)) [IPMI] (20190816/evregion-132)
Nov 25 21:43:58 pve kernel: [ 17.738989] ACPI Error: Region IPMI (ID=7) has no handler (20190816/exfldio-265)
Nov 25 21:43:58 pve kernel: [ 17.767064] ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
Nov 25 21:43:58 pve kernel: [ 17.767833] ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
Nov 25 21:43:58 pve kernel: [ 17.790545] ACPI Error: AE_NOT_EXIST, Evaluating _PMC (20190816/power_meter-743)
-- Logs begin at Wed 2020-11-25 21:43:55 CET, end at Wed 2020-11-25 22:37:02 CET. --
Nov 25 21:43:55 pve kernel: ACPI: SPCR: Unexpected SPCR Access Width. Defaulting to byte size
Nov 25 21:43:56 pve kernel: mpt2sas_cm0: overriding NVDATA EEDPTagMode setting
Nov 25 21:43:56 pve kernel: ACPI Error: No handler for Region [SYSI] ((____ptrval____)) [IPMI] (20190816/evregion-132)
Nov 25 21:43:56 pve kernel: ACPI Error: Region IPMI (ID=7) has no handler (20190816/exfldio-265)
Nov 25 21:43:56 pve kernel: ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
Nov 25 21:43:56 pve kernel: ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20190816/psparse-531)
Nov 25 21:43:56 pve kernel: ACPI Error: AE_NOT_EXIST, Evaluating _PMC (20190816/power_meter-743)
Nov 25 21:44:00 pve iscsid[2289]: iSCSI daemon with pid=2290 started!
Nov 25 21:49:06 pve QEMU[8603]: kvm: terminating on signal 15 from pid 2205 (/usr/sbin/qmeventd)
Nov 25 21:49:09 pve qm[14164]: VM 104 qmp command failed - VM 104 not running
Nov 25 21:49:09 pve pvedaemon[14162]: Failed to run vncproxy.
Nov 25 21:49:09 pve pvedaemon[2585]: <root@pam> end task UPID:pve:00003752:000080C7:5FBEC344:vncproxy:104:root@pam: Failed to run vncproxy.
Nov 25 21:52:54 pve QEMU[17614]: kvm: terminating on signal 15 from pid 2205 (/usr/sbin/qmeventd)
Nov 25 22:01:59 pve kernel: BUG: unable to handle page fault for address: 0000000192f66b3c
Nov 25 22:01:59 pve kernel: #PF: supervisor read access in kernel mode
Nov 25 22:01:59 pve kernel: #PF: error_code(0x0000) - not-present page
Nov 25 22:27:01 pve kernel: BUG: unable to handle page fault for address: 0000000192f66b3c
Nov 25 22:27:01 pve kernel: #PF: supervisor read access in kernel mode
Nov 25 22:27:01 pve kernel: #PF: error_code(0x0000) - not-present page
Thanks in advance
Cheers