proxmox many errors on hp dl380 g9

Di. · Nov 22, 2022

I don't understand what's going on
a lot of errors

server hp dl380 g9

pveversion

proxmox-ve: 7.2-1 (running kernel: 5.15.30-2-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-helper: 7.2-2
pve-kernel-5.15: 7.2-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-1
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

shanreich · Nov 22, 2022

This is most likely a Hardware issue, did you look at the SMART values of your disk? Might have other reasons, but this is the most likely. I would check on the hard drive immediately and in any case, get a backup ASAP while you can.

Di. · Nov 22, 2022

shanreich said:
This is most likely a Hardware issue, did you look at the SMART values of your disk? Might have other reasons, but this is the most likely. I would check on the hard drive immediately and in any case, get a backup ASAP while you can.

Smart OK

Di. · Nov 22, 2022

What is the best way to check the hard drive?

shanreich · Nov 22, 2022

You can check your NVME via the following command (you might need to install the package nvme-cli):

Code:

nvme --smart-log /dev/nvme0n1

Can you post the output? It looks like the disk is relatively new, so it might be an issue with the new disk, some disks are just broken from the start due to manufacturing issues. Can never rule that out.

Have you run fsck on your filesystem? It's probably only fixing symptoms though - not the causes of your issues. If the disk has issues, filesystem issues will just pop up again after fixing.

Di. · Nov 22, 2022

shanreich said:
You can check your NVME via the following command (you might need to install the package nvme-cli):

Code:

nvme --smart-log /dev/nvme0n1

Can you post the output? It looks like the disk is relatively new, so it might be an issue with the new disk, some disks are just broken from the start due to manufacturing issues. Can never rule that out.

Have you run fsck on your filesystem? It's probably only fixing symptoms though - not the causes of your issues. If the disk has issues, filesystem issues will just pop up again after fixing.

root@pvehp:~# nvme --smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 40 C
available_spare : 100%
available_spare_threshold : 1%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 2,529,739
data_units_written : 2,468,813
host_read_commands : 10,737,307
host_write_commands : 34,909,949
controller_busy_time : 28
power_cycles : 72
power_on_hours : 1,592
unsafe_shutdowns : 63
media_errors : 0
num_err_log_entries : 67
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 50 C
Temperature Sensor 2 : 51 C
Temperature Sensor 3 : 52 C
Temperature Sensor 4 : 53 C
Temperature Sensor 5 : 54 C
Temperature Sensor 6 : 55 C
Temperature Sensor 7 : 56 C
Temperature Sensor 8 : 57 C
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0

shanreich · Nov 22, 2022

According to this you have 63 unsafe shutdowns:

Code:

 unsafe_shutdowns : 63

Sudden power loss can definitely damage the files / filesystem on your disk, you might want to look into what is causing them. That is probably the cause for your filesystem corruption. In order to repair the damage, you can use fsck.

There are also several error log entries, you can look into them with:
nvme error-log /dev/nvme0n1

Di. · Nov 22, 2022

at the moment, the virtual machine works on nvme, but not on other 2 disks.
I can't backup or restore VM on SSD 8tb Samsung_SSD_870_QVO_8TB and HDD 16tb ST16000NM001G-2KK103

n'gen'f · Nov 23, 2022

connect to ILO and see status server hardware.

ITT · Nov 23, 2022

Looks like RAM or Mainboard failure.

Di. · Nov 23, 2022

dfgh said:
connect to ILO and see status server hardware.

everything is fine, there are no errors

n'gen'f · Nov 23, 2022

Disks connected to RAID-controller? Battery's pack install or not? What show config?

Code:

ssacli ctrl slot=0 show config 
ssacli ctrl slot=0 show

proxmox many errors on hp dl380 g9

Di.

Active Member

Attachments

shanreich

Proxmox Staff Member

Di.

Active Member

Attachments

Di.

Active Member

shanreich

Proxmox Staff Member

Di.

Active Member

shanreich

Proxmox Staff Member

Di.

Active Member

Attachments

n'gen'f

New Member

ITT

Renowned Member

Di.

Active Member

n'gen'f

New Member

We value your privacy