Installation only partially functional after hard disk error

maturos · Jul 9, 2023

Hi,
I have a server with Raid 5/6 configuration. After a normal physical cleanup, two of my disks failed and the RAID controller automatically started rebuilding on a spare drive. Raid 5/6 can handle 2 failed drives so i was not worried. The first reboot I could reach the web interface. To verify that the disks are irreparable i rebooted the server.

After the reboot i saw the message that the web interface would be reachable etc. but it wasn't. Instead i got about 7 errors thrown repeated all 5 seconds. The error was like

Code:

ext4-fs error ... ext4_lookup:1785 ... rs:main Q:Reg: iget: checksum invalid

or

Code:

... pmxcfs: iget: checksum invalid

I was woundering why I get errors now when the first reboot was successful so i initiated an ACPI signal to shut down the server. Nothing happened so I forced an ILO cold reboot.
After this third reboot i got a prompt like described here: https://forum.proxmox.com/threads/why-do-i-get-pve-root-needs-manual-fsck-at-initramfs.104829/ I ran

Code:

fsck -c /dev/mapper/pve-root

and confirmed everything with yes. After that I exited the shell and the reboot continued. Now the "ext4-fs error ... ext4_lookup:1785 ... checksum invalid" is not there anymore but i still have no web interface.
I was able to login via ILO remote console now and looked for the webserver. It seems running only on IPv6 which i cant explain (config broken but still readable?). Other services are running reachable on IPv4 (f.x. the HP smart storage administrator here on port 2381)

I am really worried now what i can do and what i shouldn't do so I want ask you for tips how to solve this problem and not making it bigger :/...

Any ideas?

Edit: This is what i get directly before the login prompt:

leesteken · Jul 9, 2023

maturos said:
Hi,
I have a server with Raid 5/6 configuration. After a normal physical cleanup, two of my disks failed and the RAID controller automatically started rebuilding on a spare drive. Raid 5/6 can handle 2 failed drives so i was not worried. The first reboot I could reach the web interface. To verify that the disks are irreparable i rebooted the server.

Is it RAID5 or RAID6? Is it BTRFS or ZFS that can detect data corruption, or a software or hardware RAID that cannot? Your files are now at risk of any other error since there is no more redundancy.

maturos said:
After the reboot i saw the message that the web interface would be reachable etc. but it wasn't. Instead i got about 7 errors thrown repeated all 5 seconds. The error was like

Code:

ext4-fs error ... ext4_lookup:1785 ... rs:main Q:Reg: iget: checksum invalid

or

Code:

... pmxcfs: iget: checksum invalid

Maybe you had silent data corruption or 3 failures on a particular file? Or an additional error now that you have no redundancy anymore.

maturos said:
I was woundering why I get errors now when the first reboot was successful so i initiated an ACPI signal to shut down the server. Nothing happened so I forced an ILO cold reboot.

Maybe it's an additional failure or maybe it's the RAM instead of the disks? You Proxmox installation is most liekly corrupted.

maturos said:
After this third reboot i got a prompt like described here: https://forum.proxmox.com/threads/why-do-i-get-pve-root-needs-manual-fsck-at-initramfs.104829/ I ran

Code:

fsck -c /dev/mapper/pve-root

and confirmed everything with yes. After that I exited the shell and the reboot continued. Now the "ext4-fs error ... ext4_lookup:1785 ... checksum invalid" is not there anymore but i still have no web interface.

This does not repair data corruption on a file level it just does a best effort on the filesystem structure (directories etc.). Better reinstall Proxmox (on known good drives).

maturos · Jul 9, 2023

Hi, i am using a hardware RAID 6 with simple ext4.
I have following mounts to other partitions:

Code:

# fdisk -l
Disk /dev/sda: 9.1 TiB, 10001826537472 bytes, 19534817456 sectors
Disk model: LOGICAL VOLUME
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 1310720 bytes
Disklabel type: gpt
Disk identifier: <disk id removed>

Device           Start         End     Sectors  Size Type
/dev/sda1           34        2047        2014 1007K BIOS boot
/dev/sda2         2048     1050623     1048576  512M EFI System
/dev/sda3      1050624    62425087    61374464 29.3G Linux LVM
/dev/sda4     62425600   188254719   125829120   60G Linux filesystem
/dev/sda5    188254720 11907004719 11718750000  5.5T Linux LVM
/dev/sda6  11907005440 13786053119  1879047680  896G Linux filesystem

# cat /etc/fstab
/dev/sda4 /media/ISO ext4 defaults 0 0
/dev/sda5 /media/disks ext4 defaults 0 0
/dev/sda6 /media/backups ext4 defaults 0 0

# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=20533584k,nr_inodes=5133396,mode=755,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=4113312k,mode=755,inode64)
/dev/mapper/pve-root on / type ext4 (rw,relatime,errors=remount-ro,stripe=320)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,inode64)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=22260)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
/dev/sda6 on /media/backups type ext4 (rw,relatime,stripe=320)
/dev/sda5 on /media/disks type ext4 (rw,relatime,stripe=320)
/dev/sda4 on /media/ISO type ext4 (rw,relatime,stripe=320)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=4113308k,nr_inodes=1028327,mode=700,inode64)

if i now want to reinstall proxmox i would use a separate disk. What do i need to apply the configuration from the old installation?

maturos · Jul 9, 2023

So i was able to get the webserver running on IPv4 with creating a new /var/log/pveproxy directory. Now i see that proxmox lost all of this vms.

maturos · Jul 9, 2023

So i found out that all my config is stored in the config.db SQLite database. I opened a copy and was faced with this:

On another installation the config files of the VMs are visible here.
Am i going right that my config is finally lost and there is no backup of the config.db by proxmox?

leesteken · Jul 9, 2023

maturos said:
So i found out that all my config is stored in the config.db SQLite database.

Proxmox does indeed store its configuration in a database and make it accessible via /etc/pve.

maturos said:
On another installation the config files of the VMs are visible here.
Am i going right that my config is finally lost and there is no backup of the config.db by proxmox?

Probably. You could try recreating the VM configuration if you remember them yourself and attach existing virtual drives, but I would assume that the virtual drives are also corrupted already.

maturos · Jul 9, 2023

Proxmox does indeed store its configuration in a database and make it accessible via /etc/pve.

Man am I unlucky...

You could try recreating the VM configuration if you remember them yourself and attach existing virtual drives, but I would assume that the virtual drives are also corrupted already.

Yes that's also my assumption. But as the partition on which the vm data is stored was not written data on I think i have chances to fix them offline.

maturos · Jul 10, 2023

So i managed to get all qcow2 disk images from the sda5 partition and imported them into a new proxmox installation. Booting was successful with the first two machines. I'll continue that and think about a new backup strategy

Search

Search

Installation only partially functional after hard disk error

maturos

Member

leesteken

Distinguished Member

maturos

Member

maturos

Member

maturos

Member

leesteken

Distinguished Member

maturos

Member

maturos

Member