[SOLVED] NVME disk "Available Spare" problem.

So you can roughly estimate when it will complete:
Code:
Time_Taken_So_Far/(Amount_Copied_So_Far/Total_Size) = Expected_Total_Time
In my experience - most zero-data is later on in the dd read - so usually its quicker than above. However your error(s) may/will impact the result.
 
Eh. On 206 GB. I just hope that I will have luck booting it again and then copying /etc and few other directories to safe location and write down network configuration. If I will need to set it up again, I have no clue what i have done in 2021 :eek:
 
Well .... here is where you hope you have full & restorable backups of all VMs & LXCs. You should also at least have docs/notes on your PVE host setup - so that you can reconfigure.

Please note that AFAIK the actual reading of the NVMe in dd should not itself cause further cell degradation - its the writing to the cells that does this.
 
Yes, VM backups are on disc on where dd writes now. Only PVE config is not backed up. My fault completely. More than half is done (297 GB). Now copying is faster, also errors are gone - some 10 minutes without reporting anything than progress.
 
Copy is over. Drive is now, after copying, seriously damaged (BIOS now warns on backup and replace):

1716066373045.png

I will first try to check if backup file can extract without errors. Then I will try to write it down to new disc and boot from it my computer first. Only then I will take server from rack and change drives. Also, I will make screen shots of net config and /etc/ of PVE host.
 
  • Like
Reactions: Ramalama
I transfer a disc backup file and /etc, /var dirs to extern disc. Now I will do same with VM backup files (*.vma.zst) - those are VM backup files isn't it? My VM store is on ZFS volume.

And please, what is a command to restore created gzip to new NVME drive?

I plan to do that with USB to PCIE bridge (new EVO 970 PLUS will be attached via USB) to server (I can free one slot by unpluging UPS USB cable and plug this one), then boot to Linux (from SystemRescue ISO) via BIOS, mount drive with GZIP file on it and then restore it to new NVME attached via USB to PCIE bridge. I hope it will recognize new NVME drive, othervise I do not know how to restore gzip to new drive from Windows 10.

Thank you!
 
Last edited:
And please, what is a command to restore created gzip to new NVME drive?
If you used the above dd structure I mentioned, then:

Code:
lsblk
#find new NVMe USB-attached identifier

gunzip -c /mnt/your_old_NVMe_dd_image.img.gz | dd of=/dev/new_NVMe_identifier

Best of luck.

Edit: Remember the new NVMe must be at least as big as the old one. If its bigger, you can enlarge partition & FSs once everything is working fine.
 
Last edited:
  • Like
Reactions: GazdaJezda
Drive is now, after copying, seriously damaged (BIOS now warns on backup and replace):
So you see read-only operations can damage a drive if the drive is in rough conditions in the first place :D
Read-only operations are safe only when the drive is in good condition.
 
You should also at least have docs/notes on your PVE host setup - so that you can reconfigure.
Yeah, take notes o_O
That's our last line of ESXi server configuration backup too.

Back in the days of 10K RPM HDD + hardware RAID controllers, we ran ESXi on SD cards (so we can save an extra 2.5 bay for HDD).
SD cards needs to be replaced regularly and if we are low on luck some may break with zero indication. Then we had to replace the SD card, reinstall ESXi and re-configure the server WITH NOTES so it can rejoin the cluster.

We even take notes with pencils, to stop someone from replacing our ink with the one that will disappear after some time.
I'm serious, very serious.
 
DD copy operation ends with:

1716145913970.png

How no space left? It's 500GB drive (970 EVO PLUS 500 GB just like original 970 EVO 500 GB).

If I try to boot from it on server (with original NVME drive still in) while it is still attached to USB PCIE Bridge and recognized as /dev/sde (not /dev/nvme01 - BIOS allows doing that) it returns:

1716146119813.png

Also, if I try to boot from original disc, it complains about some duplicate stuff and do not boot. When i pull out new disc (from USB) thern boot from original drive work as usual. Tommorow I will swap discs, so new one will be &/dev/nvme01 and try to boot from it. Today it is not possible doing that, unfortunatelly. Any hints or this is game over for me?

Thank you.
 
DD copy operation ends with:

View attachment 68397

How no space left? It's 500GB drive (970 EVO PLUS 500 GB just like original 970 EVO 500 GB).

If I try to boot from it on server (with original NVME drive still in) while it is still attached to USB PCIE Bridge and recognized as /dev/sde (not /dev/nvme01 - BIOS allows doing that) it returns:

View attachment 68398

Also, if I try to boot from original disc, it complains about some duplicate stuff and do not boot. When i pull out new disc (from USB) thern boot from original drive work as usual. Tommorow I will swap discs, so new one will be &/dev/nvme01 and try to boot from it. Today it is not possible doing that, unfortunatelly. Any hints or this is game over for me?

Thank you.
You should take out the old disk.
Since you DDed the whole disk, the two disks now have the exact same content, including GUID Partition Table. Connecting both disks to the same server could be problematic for many BIOS implementations as GUID are meant to be, well, globally unique.
 
Last edited:
It's ok now. I take out old disk and booting from it produced same error. Then I install PVE again, same version (6.4), update it and then import ZFS pool and slowly begin configuring it back (i make a copy of /etc/, /var/, /root/ folders and screen shooted all PVE settings). So, server is operative, VM's works, disk is now on 100%:

1716233349846.png

1716233304023.png


But I will replace that EVO 970 with some better one. This was almost heart attack condition for me, but now I'm ok. Going out to beer now :) And tommorow I need to install few additional packages (for UPS monitoring), certificate, disable not needed services and maybe do some other minor configs. Time will tell what is missing...

THANKS TO EVERYONE WHO HELPED HERE! CHEERS!
 
  • Like
Reactions: gfngfn256
Cheers mate, it's good you get your PVE back online.

The heart attack situation you just had is just part of a server maintainer's regular agenda. We see this kind of shit, not so much as day to day but definitely business as usual (The broken SD card story and ESXi being so stable that it does not even need a functional system disk to keep running)

Possible data corruption on multiple drives scattered in multiple layers of storage servers...that's something we call a heart attack.

But lucky enough I haven't seen that happening with my own eyes.
 
  • Like
Reactions: GazdaJezda
UPDATE - Disc is definitely broken
Is that the original one or the new one?

The image you show looks like Samsung Magician Software which is a Windows application. Windows won't recognize natively a Linux FS. So where do you see that "Disc is definitely broken"?
 
Disc on a picture is old one, picture was send to me by HW reseller. That disc (original EVO 970) is now officialy dead and will get new one.
Magician is Windows app yes, it was used just to confirm that BIOS and other S.M.A.R.T. readings are valid - that disc is @EOL. After 3 years and a half... I now bought a NVME cooler for new disc, but I'm not sure that it will fit in. That SuperMicro MB is very tiny / small, everything on it is quite compressed. Will see... I need to lower temp. of that disc from current 40 C degrees. I somehow still think that temperature was cause of that premature malfunction.
 
Last edited:
I need to lower temp. of that disc from current 40 C degrees
AFAIK 40c is perfectly normal for the NVMe, on the contrary its probably under-average. (On your original image of SMART data, it shows a mere 38c, if you trust those sensors!).

I don't think in general that heat is the dominant factor in disk degradation, in my opinion the disk usage is probably number one factor followed closely by manufacturing quality/flaws in production. Power issues (spikes + general fluctuations) will follow third.

As you've been told previously, if your data matters go enterprise. All the rest is pretty much the same.
 
  • Like
Reactions: GazdaJezda

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!