Failing SSD - Migration strategy

killer_instinct

New Member
Jun 4, 2024
7
1
3
Hello,

I run on a pve 8.2.2 where it seems my SSD is failing. I have 3 VMs and 2 containers stored in this SSD.
I have purchased a new one. What would be the best strategy to migrate those VMs and containers?
  1. Clonezilla?
  2. Backup & restore?
  3. Storage replication?
  4. Any other?
Thank you in advance.
 
How's your storage currently structured, and do you have any free NVMe or SATA ports?

If you have a free NVMe or SATA point, that would probably mean you could add the new drive to the existing system and (depending upon file system) add it as a mirror of your existing drive. With that, when the mirroring is completed (ie all data copied), then you could "break" the mirror and remove the failing drive.

Might still be doable if you only have a USB port free, but USB storage kind of makes me nervous personally unless there's no other choice. ;)



Btw, why do you reckon your existing SSD is failing? Is something throwing errors and telling you that directly, or is it a conclusion from some kind of weird behaviour, or something else? :)
 
  • Like
Reactions: santiagobiali
Like what justinclift said,

If you had access to a portable drive you could add it as storage and move the disks over. Then import the disk into "new" VMs on your new installation.
MoveStorage.png

If you want to export the entire VM. You could create a backup of each VM and CT to the portable storage. Then restore them once you have a new Proxmox installation on your new drive.

If you had access to some network storage that you could mount, You could also push the backups there :)
 
Backup + Restore is probably the "best" although it may not be the fastest. If the disk is failing, you cannot depend on data integrity for the existing virtual disks.

As long as the backups are good you should be ok, just don't delete anything on the failing drive or throw it out right away.
 
Thank you very much for your replies!

My current setup is the following:
I run proxmox on a UDOO X86 Ultra: https://shop.udoo.org/en/udoo-x86-ii-ultra-warehouse.html

So the SSD is attached to it via cable: https://shop.udoo.org/en/sata-data-and-power-cables.html

Unfortunately there is no other input, so if I am going with the mirror configuration, I have to use USB.. :(

On top of that, the new drive is +1TB (old is 1TB and the new one is 2TB) so I wonder if this mirror function will work...

On the other hand, backup and restore might take ages. Especially my Splunk installation has a lot of data!...
 
Let's assume the following scenario:
Failing SSD disk=>SATA
New SSD mounted as USB=>USB

All ext4 formatted and LVM types.

1. Move all VM/CT disks from SATA to USB.
2. Unplug SATA
3. Put USB to SATA's place
4. Change all VM/CT configurations to point to the new SATA

Shall it work?
 
There are SMART errors about bad blocks and file systems enters from time to time in a read only mode...
Ahhh. The replication related file error isn't something I've seen before, but those SMART lines doesn't look all that problematic.

Would you be ok to run lsblk -o name,maj:min,type,tran,rota,path,label,size,phy-sec and paste the output here? That'll show us how you have your storage structured, and reduce a bunch of back-and-forth questions. :D

Also, the output from smartctl -a /path/to/your/storage/device would be useful too, as that'll show the current values for all of the smart readings so we can make some educated guesses from there. :)
 
Last edited:
  • Like
Reactions: killer_instinct
I am not that familiar with the UDOO.

Where is proxmox installed? On the failing SSD or some sort of SD card/NVME drive?
It is installed in its internal storage drive that it comes with (32 GB if I remember correctly).
Ahhh. The replication related file error isn't something I've seen before, but those SMART lines doesn't look all that problematic.

Would you be ok to run lsblk -o name,maj:min,type,tran,rota,path,label,size,phy-sec and paste the output here? That'll show us how you have your storage structured, and reduce a bunch of back-and-forth questions. :D

Also, the output from smartctl -a /path/to/your/storage/device would be useful too, as that'll show the current values for all of the smart readings so we can make some educated guesses from there. :)

Code:
root@pve:~# lsblk -o name,maj:min,type,tran,rota,path,label,size,phy-seclsblk -o name,maj:min,type,tran,rota,path,label,size,phy-sec
NAME                      MAJ:MIN TYPE TRAN   ROTA PATH                              LABEL   SIZE PHY-SEC
sda                         8:0   disk sata      0 /dev/sda                                894.3G     512
└─sda1                      8:1   part           0 /dev/sda1                               894.3G     512
  ├─tank-vm--101--disk--0 252:0   lvm            0 /dev/mapper/tank-vm--101--disk--0          32G     512
  ├─tank-vm--103--disk--0 252:1   lvm            0 /dev/mapper/tank-vm--103--disk--0         100G     512
  ├─tank-vm--100--disk--1 252:2   lvm            0 /dev/mapper/tank-vm--100--disk--1         450G     512
  ├─tank-vm--102--disk--0 252:3   lvm            0 /dev/mapper/tank-vm--102--disk--0           8G     512
  ├─tank-vm--104--disk--0 252:4   lvm            0 /dev/mapper/tank-vm--104--disk--0           4M     512
  └─tank-vm--104--disk--1 252:5   lvm            0 /dev/mapper/tank-vm--104--disk--1          32G     512
mmcblk0                   179:0   disk           0 /dev/mmcblk0                             29.1G     512
├─mmcblk0p1               179:1   part           0 /dev/mmcblk0p1                           21.2G     512
├─mmcblk0p2               179:2   part           0 /dev/mmcblk0p2                              1K     512
└─mmcblk0p5               179:5   part           0 /dev/mmcblk0p5                            7.9G     512
mmcblk0boot0              179:8   disk           0 /dev/mmcblk0boot0                           4M     512
mmcblk0boot1              179:16  disk           0 /dev/mmcblk0boot1                           4M     512

Code:
root@pve:~# smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.4-3-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     OCZ/Toshiba Trion SSDs
Device Model:     OCZ-TRION150
Serial Number:    Y59B40BEK1HU
LU WWN Device Id: 5 e83a97 200331428
Firmware Version: SAFZ12.2
User Capacity:    960,197,124,096 bytes [960 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jun  7 09:27:15 2024 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 112) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (   30) seconds.
Offline data collection
capabilities:                    (0x79) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   2) minutes.
Conveyance self-test routine
recommended polling time:        (   3) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       57144
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       3748
167 SSD_Protect_Mode        0x0022   100   100   000    Old_age   Always       -       0
168 SATA_PHY_Error_Count    0x0012   100   100   000    Old_age   Always       -       1780
169 Bad_Block_Count         0x0003   000   000   010    Pre-fail  Always   FAILING_NOW 0
173 Erase_Count             0x0012   071   071   000    Old_age   Always       -       0
192 Unexpect_Power_Loss_Ct  0x0012   100   100   000    Old_age   Always       -       327
194 Temperature_Celsius     0x0023   059   043   020    Pre-fail  Always       -       41 (Min/Max 8/57)
241 Host_Writes             0x0032   100   100   000    Old_age   Always       -       5790694

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       00%     57081         6787002
# 2  Extended offline    Completed: read failure       00%     57081         6787001
# 3  Extended offline    Completed: read failure       00%     57081         6787000
# 4  Extended offline    Completed: read failure       00%     57081         6786999
# 5  Extended offline    Completed: read failure       00%     57081         6786998
# 6  Extended offline    Completed: read failure       00%     56717         6786997
# 7  Short offline       Completed: read failure       00%     56717         6786996

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
  • Like
Reactions: justinclift
the new drive is +1TB (old is 1TB and the new one is 2TB) so I wonder if this mirror function will work...
Most of the live mirroring approaches people around here will recommend won't have a problem with that. You should be able to get the mirroring bit done to ensure you don't lose data, then adjust the relevant partition (or equivalent) size so the extra space can be used.

I personally tend to use ZFS for everything, and could give you detailed instructions if that's what you were using at the moment. But... you're using LVM. Nothing wrong with that, it's just I'm less familiar with it due to not using it daily.

One of the others who've already responded should be able to help you out though. :)
 
Last edited:
@killer_instinct All that aside, what make and model is your new storage?

That'll directly have an impact on how fast the copying goes. If it's a consumer grade SSD it'll be a lot slower than something more enterprise oriented such as the Samsung PM893 series.
 
  • Like
Reactions: Kingneutron
"FAILING_NOW". Yeah, that's not good. :eek:

It always amazes me to see things like that, yet have SMART also say stuff like this at the same time:



That's like super conflicting information. :(
Yes, this is very frustrating.

@killer_instinct All that aside, what make and model is your new storage?

That'll directly have an impact on how fast the copying goes. If it's a consumer grade SSD it'll be a lot slower than something more enterprise oriented such as the Samsung PM893 series.

This is the new SSD drive: WD Red SA500 NAS SATA SSD 2.5”/7mm Cased

https://www.westerndigital.com//en-gb/products/internal-drives/wd-red-sata-2-5-ssd?sku=WDS200T2R0A
 
Hi,

As your SSD is almost in a bad condition, the best solution(my opinion), is this:

- clonezilla
- any external USB/NAS/etc

Clonezilla must be run using ddrescue(block device with recover option). So in case that ddrescue can not read a block, it will write zeroes on that block, and will continue! Then you could restore clonezilla image on a new SSD(even with bigger capacity then the source SSD). For each hour that your sistem will run with this bad SSD, will increase the chance to lose all of your data.

Good luck/Bafta !
 
  • Like
Reactions: killer_instinct

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!