[SOLVED] NVME disk "Available Spare" problem.

gfngfn256 · May 18, 2024

So you can roughly estimate when it will complete:

Code:

Time_Taken_So_Far/(Amount_Copied_So_Far/Total_Size) = Expected_Total_Time

In my experience - most zero-data is later on in the dd read - so usually its quicker than above. However your error(s) may/will impact the result.

GazdaJezda · May 18, 2024

Eh. On 206 GB. I just hope that I will have luck booting it again and then copying /etc and few other directories to safe location and write down network configuration. If I will need to set it up again, I have no clue what i have done in 2021

gfngfn256 · May 18, 2024

Well .... here is where you hope you have full & restorable backups of all VMs & LXCs. You should also at least have docs/notes on your PVE host setup - so that you can reconfigure.

Please note that AFAIK the actual reading of the NVMe in dd should not itself cause further cell degradation - its the writing to the cells that does this.

GazdaJezda · May 18, 2024

Yes, VM backups are on disc on where dd writes now. Only PVE config is not backed up. My fault completely. More than half is done (297 GB). Now copying is faster, also errors are gone - some 10 minutes without reporting anything than progress.

Ramalama · May 18, 2024

Glad god you didn't listened to me xD

GazdaJezda · May 18, 2024

Copy is over. Drive is now, after copying, seriously damaged (BIOS now warns on backup and replace):

I will first try to check if backup file can extract without errors. Then I will try to write it down to new disc and boot from it my computer first. Only then I will take server from rack and change drives. Also, I will make screen shots of net config and /etc/ of PVE host.

GazdaJezda · May 19, 2024

I transfer a disc backup file and /etc, /var dirs to extern disc. Now I will do same with VM backup files (*.vma.zst) - those are VM backup files isn't it? My VM store is on ZFS volume.

And please, what is a command to restore created gzip to new NVME drive?

I plan to do that with USB to PCIE bridge (new EVO 970 PLUS will be attached via USB) to server (I can free one slot by unpluging UPS USB cable and plug this one), then boot to Linux (from SystemRescue ISO) via BIOS, mount drive with GZIP file on it and then restore it to new NVME attached via USB to PCIE bridge. I hope it will recognize new NVME drive, othervise I do not know how to restore gzip to new drive from Windows 10.

Thank you!

gfngfn256 · May 19, 2024

GazdaJezda said:
And please, what is a command to restore created gzip to new NVME drive?

If you used the above dd structure I mentioned, then:

Code:

lsblk
#find new NVMe USB-attached identifier

gunzip -c /mnt/your_old_NVMe_dd_image.img.gz | dd of=/dev/new_NVMe_identifier

Best of luck.

Edit: Remember the new NVMe must be at least as big as the old one. If its bigger, you can enlarge partition & FSs once everything is working fine.

GazdaJezda · May 19, 2024

Ok. Just started it. There is no progress but i believe it works, will wait. Hope for the best.

zzz09700 · May 19, 2024

GazdaJezda said:
Drive is now, after copying, seriously damaged (BIOS now warns on backup and replace):

So you see read-only operations can damage a drive if the drive is in rough conditions in the first place

Read-only operations are safe only when the drive is in good condition.

zzz09700 · May 19, 2024

gfngfn256 said:
You should also at least have docs/notes on your PVE host setup - so that you can reconfigure.

Yeah, take notes

That's our last line of ESXi server configuration backup too.

Back in the days of 10K RPM HDD + hardware RAID controllers, we ran ESXi on SD cards (so we can save an extra 2.5 bay for HDD).
SD cards needs to be replaced regularly and if we are low on luck some may break with zero indication. Then we had to replace the SD card, reinstall ESXi and re-configure the server WITH NOTES so it can rejoin the cluster.

We even take notes with pencils, to stop someone from replacing our ink with the one that will disappear after some time.
I'm serious, very serious.

GazdaJezda · May 19, 2024

DD copy operation ends with:

How no space left? It's 500GB drive (970 EVO PLUS 500 GB just like original 970 EVO 500 GB).

If I try to boot from it on server (with original NVME drive still in) while it is still attached to USB PCIE Bridge and recognized as /dev/sde (not /dev/nvme01 - BIOS allows doing that) it returns:

Also, if I try to boot from original disc, it complains about some duplicate stuff and do not boot. When i pull out new disc (from USB) thern boot from original drive work as usual. Tommorow I will swap discs, so new one will be &/dev/nvme01 and try to boot from it. Today it is not possible doing that, unfortunatelly. Any hints or this is game over for me?

Thank you.

zzz09700 · May 20, 2024

GazdaJezda said:
DD copy operation ends with:

View attachment 68397

How no space left? It's 500GB drive (970 EVO PLUS 500 GB just like original 970 EVO 500 GB).

If I try to boot from it on server (with original NVME drive still in) while it is still attached to USB PCIE Bridge and recognized as /dev/sde (not /dev/nvme01 - BIOS allows doing that) it returns:

View attachment 68398

Also, if I try to boot from original disc, it complains about some duplicate stuff and do not boot. When i pull out new disc (from USB) thern boot from original drive work as usual. Tommorow I will swap discs, so new one will be &/dev/nvme01 and try to boot from it. Today it is not possible doing that, unfortunatelly. Any hints or this is game over for me?

Thank you.

You should take out the old disk.
Since you DDed the whole disk, the two disks now have the exact same content, including GUID Partition Table. Connecting both disks to the same server could be problematic for many BIOS implementations as GUID are meant to be, well, globally unique.

UdoB · May 20, 2024

zzz09700 said:
as GUID are meant to be, well, globally unique.

There are tools to newly randomize a conflicting, copied value: sgdisk --randomize-guids /dev/<partition> -g

GazdaJezda · May 20, 2024

It's ok now. I take out old disk and booting from it produced same error. Then I install PVE again, same version (6.4), update it and then import ZFS pool and slowly begin configuring it back (i make a copy of /etc/, /var/, /root/ folders and screen shooted all PVE settings). So, server is operative, VM's works, disk is now on 100%:

But I will replace that EVO 970 with some better one. This was almost heart attack condition for me, but now I'm ok. Going out to beer now

And tommorow I need to install few additional packages (for UPS monitoring), certificate, disable not needed services and maybe do some other minor configs. Time will tell what is missing...

THANKS TO EVERYONE WHO HELPED HERE! CHEERS!

zzz09700 · May 20, 2024

Cheers mate, it's good you get your PVE back online.

The heart attack situation you just had is just part of a server maintainer's regular agenda. We see this kind of shit, not so much as day to day but definitely business as usual (The broken SD card story and ESXi being so stable that it does not even need a functional system disk to keep running)

Possible data corruption on multiple drives scattered in multiple layers of storage servers...that's something we call a heart attack.

But lucky enough I haven't seen that happening with my own eyes.

GazdaJezda · Jun 7, 2024

UPDATE - Disc is definitely broken and since it is still in warranty will be replaced for free...

But still, it happened after not even 1 Tb written...

gfngfn256 · Jun 7, 2024

GazdaJezda said:
UPDATE - Disc is definitely broken

Is that the original one or the new one?

The image you show looks like Samsung Magician Software which is a Windows application. Windows won't recognize natively a Linux FS. So where do you see that "Disc is definitely broken"?

GazdaJezda · Jun 7, 2024

Disc on a picture is old one, picture was send to me by HW reseller. That disc (original EVO 970) is now officialy dead and will get new one.
Magician is Windows app yes, it was used just to confirm that BIOS and other S.M.A.R.T. readings are valid - that disc is @EOL. After 3 years and a half... I now bought a NVME cooler for new disc, but I'm not sure that it will fit in. That SuperMicro MB is very tiny / small, everything on it is quite compressed. Will see... I need to lower temp. of that disc from current 40 C degrees. I somehow still think that temperature was cause of that premature malfunction.

gfngfn256 · Jun 7, 2024

GazdaJezda said:
I need to lower temp. of that disc from current 40 C degrees

AFAIK 40c is perfectly normal for the NVMe, on the contrary its probably under-average. (On your original image of SMART data, it shows a mere 38c, if you trust those sensors!).

I don't think in general that heat is the dominant factor in disk degradation, in my opinion the disk usage is probably number one factor followed closely by manufacturing quality/flaws in production. Power issues (spikes + general fluctuations) will follow third.

As you've been told previously, if your data matters go enterprise. All the rest is pretty much the same.

[SOLVED] NVME disk "Available Spare" problem.

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Attachments

Active Member

Active Member

Well-Known Member

Active Member

Distinguished Member

Well-Known Member

Active Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

We value your privacy