proxmox many errors on hp dl380 g9

Di.

Active Member
Aug 20, 2015
11
0
41
I don't understand what's going on
a lot of errors

server hp dl380 g9

pveversion
proxmox-ve: 7.2-1 (running kernel: 5.15.30-2-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-helper: 7.2-2
pve-kernel-5.15: 7.2-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-1
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 

Attachments

  • Снимок экрана 2022-11-22 в 13.27.01.png
    Снимок экрана 2022-11-22 в 13.27.01.png
    499.2 KB · Views: 33
  • Снимок экрана 2022-11-22 в 10.59.16.png
    Снимок экрана 2022-11-22 в 10.59.16.png
    431.4 KB · Views: 33
  • Снимок экрана 2022-11-22 в 11.37.56.png
    Снимок экрана 2022-11-22 в 11.37.56.png
    498.4 KB · Views: 27
  • Снимок экрана 2022-11-22 в 11.01.08.png
    Снимок экрана 2022-11-22 в 11.01.08.png
    252.7 KB · Views: 21
  • Снимок экрана 2022-11-22 в 02.50.17.png
    Снимок экрана 2022-11-22 в 02.50.17.png
    313.5 KB · Views: 19
  • 11.png
    11.png
    606.6 KB · Views: 27
This is most likely a Hardware issue, did you look at the SMART values of your disk? Might have other reasons, but this is the most likely. I would check on the hard drive immediately and in any case, get a backup ASAP while you can.
 
This is most likely a Hardware issue, did you look at the SMART values of your disk? Might have other reasons, but this is the most likely. I would check on the hard drive immediately and in any case, get a backup ASAP while you can.
Smart OK
 

Attachments

  • Снимок экрана 2022-11-22 в 13.59.45.png
    Снимок экрана 2022-11-22 в 13.59.45.png
    176.4 KB · Views: 23
You can check your NVME via the following command (you might need to install the package nvme-cli):
Code:
nvme --smart-log /dev/nvme0n1

Can you post the output? It looks like the disk is relatively new, so it might be an issue with the new disk, some disks are just broken from the start due to manufacturing issues. Can never rule that out.

Have you run fsck on your filesystem? It's probably only fixing symptoms though - not the causes of your issues. If the disk has issues, filesystem issues will just pop up again after fixing.
 
You can check your NVME via the following command (you might need to install the package nvme-cli):
Code:
nvme --smart-log /dev/nvme0n1

Can you post the output? It looks like the disk is relatively new, so it might be an issue with the new disk, some disks are just broken from the start due to manufacturing issues. Can never rule that out.

Have you run fsck on your filesystem? It's probably only fixing symptoms though - not the causes of your issues. If the disk has issues, filesystem issues will just pop up again after fixing.
root@pvehp:~# nvme --smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 40 C
available_spare : 100%
available_spare_threshold : 1%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 2,529,739
data_units_written : 2,468,813
host_read_commands : 10,737,307
host_write_commands : 34,909,949
controller_busy_time : 28
power_cycles : 72
power_on_hours : 1,592
unsafe_shutdowns : 63
media_errors : 0
num_err_log_entries : 67
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 50 C
Temperature Sensor 2 : 51 C
Temperature Sensor 3 : 52 C
Temperature Sensor 4 : 53 C
Temperature Sensor 5 : 54 C
Temperature Sensor 6 : 55 C
Temperature Sensor 7 : 56 C
Temperature Sensor 8 : 57 C
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
 
According to this you have 63 unsafe shutdowns:
Code:
 unsafe_shutdowns : 63

Sudden power loss can definitely damage the files / filesystem on your disk, you might want to look into what is causing them. That is probably the cause for your filesystem corruption. In order to repair the damage, you can use fsck.


There are also several error log entries, you can look into them with:
nvme error-log /dev/nvme0n1
 
at the moment, the virtual machine works on nvme, but not on other 2 disks.
I can't backup or restore VM on SSD 8tb Samsung_SSD_870_QVO_8TB and HDD 16tb ST16000NM001G-2KK103
 

Attachments

Disks connected to RAID-controller? Battery's pack install or not? What show config?
Code:
ssacli ctrl slot=0 show config 
ssacli ctrl slot=0 show
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!