[SOLVED] Proxmox host failt to boot, missing boot sector (no grub nothing)

mauro2306 · Jul 24, 2023

Hello everyone,

after years of service, my proxmox host decided to give up on me, at the right moment when i was planning to replace with a new host (not brand new but still new storage) and while i was preparing it building the raid and so on on the new server.

The failing host is a DL380 G8 Dual Xeon E5-2430L 64GB of RAM with a P822 RAID card with 2 LUN configured :
20x 900GB SAS 10K rpm Seagate disks (with 2 hot spare configured) RAID 10 (root LUN, containing LVM local and local-thin where VMs, ISO images and backups are by default stored)
24x 2TB SAS 7.2K rpm Seagate disks (with 2 hot spare) RAID 6 for pure data storage

running on Proxmox 7.4, updated recently (one week ago or so).
I was making some benchmark on the new system to configure my RAID 10 of SSDs for best performance setting up the cache controller, the physical devices cache and so on, and comparing with only 4 benchmarks i run on my old host, just to see how the new system would perform.
I ran those 2 commands, one in read and one in write operation for 4K and 1M chunk size :

https://pve.proxmox.com/wiki/Benchmarking_Storage
fio --ioengine=libaio --direct=1 --sync=1 --rw=read --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_read --filename=/dev/sda
fio --ioengine=libaio --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_read --filename=/dev/sda
fio --ioengine=libaio --direct=1 --sync=1 --rw=read --bs=1M --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_read --filename=/dev/sda
fio --ioengine=libaio --direct=1 --sync=1 --rw=write --bs=1M --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_read --filename=/dev/sda

then i left my host like that and when i finally found the best settings on the new host, decided to move forward with the final installation and move my VMs there.
I wanted to reboot my host, but selected the wrong tab in my terminal and restarted my old server. Not a big deal you think .... after 10minutes waiting, i decided to go and check because still not available. The server wouldn't boot any longer, POST says no boot drive detected.
I went in the hardware raid controller to check, both volumes are fine, no RAID issues or whatsoever detected and all my disks are still online. I tried to set up again the logical boot volume on the first LUN, no success.

Controller Status OK
Serial Number PDVTF0BRH8Y1RD
Model Smart Array P822 Controller
Firmware Version 8.32
Controller Type HPE Smart Array
Cache Module Status OK
Cache Module Serial Number PBKUD0BRH8V4FT
Cache Module Memory

Logical Drive 01

Status OK
Capacity 5029 GiB
Fault Tolerance RAID 1/RAID 1+0
Logical Drive Type Data LUN
Encryption Status Not Encrypted

Logical Drive 02

Status OK
Capacity 37259 GiB
Fault Tolerance RAID 6
Logical Drive Type Data LUN
Encryption Status Not Encrypted

If i start over the USB Proxmox installation, the boot LVM is missing, not detected at all :

the volume detected is the second one, the data storage.
As the volume is even not detected, i don't even know where to start.
I tried to run testdisk, but not sure how relevant this tool is in getting back the partition like that, what i can say is that he does see LVM2 volumes in my /dev/sda :

From now on, i would please request any advise before proceeding to anything with testdisk. The backup of all my VMs were on the same host, i know it is pretty stupid, but that's why the new host was coming to recycle the old as a Proxmox Backup server, i am running this stuff at home, so it is not like i have budget to buy this kind of hardware every day, and i was already using my old host for pretty much everything. thanks for the help.

mauro2306 · Jul 28, 2023

Hi there, just to keep you informed, i am currently doing a copy of the disk to another place and will try to recover from there using testdisk.
I installed on a similar volume and will try to mimic the partition scheme using testdisk and we will see.
The clone i am doing is using "dd" to make sure i do an exact mirror copy byte wise.

fabian · Jul 28, 2023

just to make sure - the fio command you posted you ran on your production system? and /dev/sda is a disk that is in use by that system? if so, you overwrote at least some of your data..

Dependent on your used storage type (file or block based), you need a test file or a block device to test. Make sure that you will not destroy any data if you perform a write test. best to double or tripple check the devices and files.

fabian · Jul 28, 2023

do you remember what you stored on the first (5TB) logical disk and what on the second big one? at least the VM volume vm-101-disk-0 seems to still be there..

mauro2306 · Jul 28, 2023

fabian said:
just to make sure - the fio command you posted you ran on your production system? and /dev/sda is a disk that is in use by that system? if so, you overwrote at least some of your data..

yes it was and now that you point me to that, i am realizing the big mistake i made following that guide :https://pve.proxmox.com/wiki/Benchmarking_Storage and not paying attention that the destination wasn't a file but the whole disk. And i used the write command once for each block size, so that's it.
Ok so indeed i will have data lost, for sure, but maybe i can retrieve my VMs ?
On /dev/sda it was a regular Proxmox installation, so LVM2 with 3 volumes :
/dev/sda1 for Bios Boot
/dev/sda2 for EFI system
/dev/sda3 for Linux LVM

Within the LVM, 3 volumes, the Swap, the data and root (i assume that root is for filesystem and ISO images as well as backups, and the data one is for the VMs themselves).

What do you think about it ?

The VM-101-disk-0 is on another RAID volume, i only had my pure data storage there, not the VMs

fabian · Jul 28, 2023

at this point, you probably want to make a full, byte-wise copy of /dev/sda and attempt recovery using that copy. depending on how important your VMs are/were, you might want to get professional data recovery help. you might be able to recover the disk and LVM layout, and with any luck, some of the LVs. it's impossible to tell without trying how much of your disk got overwritten, since you used a time-based fio command this very much depends on the speed of your disks.

mauro2306 · Jul 28, 2023

on top of my head, around 60 to 70GB, because the write was around 1000MB/s and the time is 60secs.
my root partition was 2TB, so i guess that would mean that my data volume should be safe (in theory of course).
I definitely do not have the budget for a professional help, so my question would be how would you start to redefine the partition scheme the same way it was ? Using testdisk or something else ?

fabian · Jul 28, 2023

the most important part is to create a backup of the full disk as it is now, then you can try various recovery options but can always roll back to the current, pre-recovery state.

yes, things like testdisk are an option.

mauro2306 · Jul 28, 2023

fabian said:
the most important part is to create a backup of the full disk as it is now, then you can try various recovery options but can always roll back to the current, pre-recovery state.

yes, things like testdisk are an option.

The backup using "dd" is in progress right now, once it is finished (another couple of hours), i will use testdisk to see if both disks are showing a similar schema (to make sure the dd went good) and then i will proceed with recovery on the destination disk (to keep the source untouched).
The point is i know roughly the partition scheme that my disk should have, but i don't know how to recreate it best

fabian · Jul 28, 2023

well, roughly is not really good enough

the partitions you can create manually with something like gdisk. depending on which PVE version you installed with, and whether you used the default LVM-thin+ext4 setup or not, the layout might be re-creatable.

mauro2306 · Jul 28, 2023

Just to show you what i mean, i have another similar server, where i created the same raid volume as the old server with the same amount of disks and therefore, having the same size. So basically, i know how the proxmox partitions should look like from partition point of view of the disk, and even the lvm inside.I was using the default setup values besides the root size set to 2TB, all EXT4.

fabian · Jul 28, 2023

if you are sure that the setup is the same (note, you also need the same ISO version, because there were changes in the past with alignment, size of EFI partitions, and so on), you can use gdisk to transfer the partition table to your copy of the broken disk.

mauro2306 · Jul 28, 2023

fabian said:
if you are sure that the setup is the same (note, you also need the same ISO version, because there were changes in the past with alignment, size of EFI partitions, and so on), you can use gdisk to transfer the partition table to your copy of the broken disk.

I reinstalled the same 7.4 version that was in place, how does that sound to you ? proceed with gdisk then ?

mauro2306 · Jul 28, 2023

Ho wait, look what i found, this is how my system was looking before i did my mistake

mauro2306 · Jul 29, 2023

I start to think that i won't be able to get my data back (the VMs at least), so i think i will start to deploy the new server and that's it.
The only question i have then is about the other logical volume with my data, how would you move them in a safe way to the new storage. Would you plug the drive bays to the new server with the same raid hardware controller and add the storage to the new proxmox to then move it to the new storage or would you have another suggestion for me ?

Edit:
I was wondering if for more safety i shouldn't clonezilla the working data disk to the new server through the network (i know 40TB will take a long time, but i can use 10Gb cards) and then use this method :
https://superuser.com/questions/116617/how-to-mount-an-lvm-volume
What do you think about it ?

Edit 2 : Ok clonezilla didn't wanted to clone that volume, invalid GPT signature error message.
I tried with dd but after a good start at 500MB/s, it slowed down to 65MB/s for a reason i don't understand.
I run the command like this; first on the destination server, the second on the source one :

nc -l -p 9901 | dd of=/dev/sdb bs=1M status=progress
dd if=/dev/sdb bs=1M status=progress | nc 192.168.0.99 9901

very strange performance out of it, and i am running with NIC of 10Gb.
If i use fio with a BS of 1M on the destination server, i have a steady 1500MB/s for more than 1min (so the cache can be excluded, i have 4GB on the hardware controller).

I don't know what the problem can be.

fabian · Jul 31, 2023

any method to transfer the contents of block devices should be fine - dd is one of them (although yes, physically transferring the controller+disks will likely be faster if that is an option). I can't really tell why the transfer starts stalling

mauro2306 · Aug 10, 2023

Hi there,

i am coming back with my issue, sorry to bother. I finally gave up on hardware raid and went with ZFS turning both raid controllers in HBA mode, i don't know what was the problem, probably consumer drives, but the performance were terrible on my new server, especially in write.
Now with ZFS, i have decent bandwidth in read and write, both using 1M and 4K chunk size, at least on pair with the previous system using enterprise grade drive.
I plugged my 2 external bays from the old system that holds a VM disk where i have my important data to my new server (2 HP MSA60) and the raid controller correctly recognized the RAID6 and the volume is visible in my new proxmox environement in command line, but not in the GUI (i guess expected).
My question is, how to safely import the LVM from another proxmox instance in this new one, here is what i have currently in the GUI and using the different CLI commands :

Code:

root@proxmox:~# vgscan
  Found volume group "Data-VM" using metadata type lvm2
root@proxmox:~# lvscan
  ACTIVE            '/dev/Data-VM/vm-101-disk-0' [33.37 TiB] inherit
root@proxmox:~# lvs
  LV            VG      Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  vm-101-disk-0 Data-VM -wi-a----- 33.37t                                                  
root@proxmox:~# vgs
  VG      #PV #LV #SN Attr   VSize   VFree
  Data-VM   1   1   0 wz--n- <36.39t <3.02t


fdisk -l

Disk /dev/sdy: 36.39 TiB, 40007305879552 bytes, 78139269296 sectors
Disk model: LOGICAL VOLUME 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 5242880 bytes


Disk /dev/mapper/Data--VM-vm--101--disk--0: 33.37 TiB, 36690831867904 bytes, 71661780992 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 5242880 bytes
Disklabel type: gpt
Disk identifier: B09118C0-AA36-45AA-8513-9E7AA4A7C1E3

Device                                      Start         End     Sectors  Size Type
/dev/mapper/Data--VM-vm--101--disk--0-part1  2048 71661778943 71661776896 33.4T Linux filesystem

Thanks again for the help, i would really appreciate to get back that volume. And if i can mount it just to get few very important folders on it, that's also good for me. My end goal would be the copy the VM disk of that volume to the new one and once i am sure all data are correctly copied remove safely the 2 external bays.

mauro2306 · Aug 10, 2023

Ok, found this thread : https://www.reddit.com/r/Proxmox/comments/w8o7va/import_old_lvm_storage_drive_to_new_proxmox/ and followed the procedure, worked, i see my data, so glad

For anyone who might step in my thread and missing the last part, as long as you can see your lvm with lvmscan and vgscan, you can go in the datacenter => storage menu of Proxmox and add a LVM, specifying the same ID as before worked for me like a charm.
Thank you all for your great help, now i just have to rebuild my system, but with the data, it should be easier

mauro2306 · Aug 27, 2023

Hi everyone, i am coming back to you because situation evolved a bit. I was planning to go with ZFS because the performance were better overall and i wouldn't have to configure my disks in RAID10 to have something decent but in RAIDZ3, but still i have strange behavior and things i can't figure it out : https://forum.proxmox.com/threads/i...s-with-import-by-scanning.132446/#post-584409
So i decided to go back in hardware RAID and will configure my volumes in RAID10, i will loose some capacity, but at least the performance were decent and i have a bit more knowledge in that area.
My only question is for the future if i want to change those 5TB 2,5" SATA consumer hard drive to something more professional, would this be a good choice for example, is that the kind of enterprise class SSD you would recommend me from the beginning ? https://download.semiconductor.sams...sheet/Samsung_SSD_PM893_Data_Sheet_Rev1.0.pdf
https://advdownload.advantech.com/p...96FD25-ST19T-M53P_Datasheet20200120180650.pdf
https://www.kingston.com/en/ssd/ser...&form factor=2.5"&use=servers and datacenters

Thanks for the feedback.

mauro2306 · Sep 1, 2023

Anyone?

[SOLVED] Proxmox host failt to boot, missing boot sector (no grub nothing)

Active Member

Active Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Attachments

Proxmox Staff Member

Active Member

Active Member

Attachments

Active Member

Proxmox Staff Member

Active Member

Attachments

Active Member

Attachments

Active Member

Active Member