[SOLVED] Proper way of migrating to new boot drive

Macsloverd

New Member
Dec 23, 2024
5
0
1
-------------- UPDATE --------------

Summary:

If anyone comes to this problem, please follow @leesteken 's method. To make it as step-by-step:

1. my VMs on the failing node are linked clones, so first do a backup of the VMs and recover them one by one. This will make them "standalone";
2. migrate all the VMs to another node;
3. remove all backups, shared storage, repication job, etc. which are related to this failing node from the cluster;
4. remove the failing node from the cluster;
6. since I was doing this through iDRAC, I have no way to physically remove the failing HDD or Disable it (HBA controller is not from Dell), I have to boot into a live CD first to delete all partitions from the the failing HDD to prevent unexpected mounting;
7. boot from Proxmox installation ISO and do a fresh install with a new host name (!important);
8. after successful installation of Proxmox, the first boot detected the ZFS pool but unable to mount it;
9. in the web GUI of Proxmox, the ZFS pool cannot be shown in "Disks->ZFS";
10. since I have migrated all VMs, I just lost interest to recover the pool. I wiped all 8 SAS HDD and created a new pool with the ORIGINAL NAME;
11. copy the "Join Information" and paste it in the freshly installed node;
12. configure the replication job, backup, etc. back to the original;
13. migrate back the VMs;
14. Done!

And finally, thanks to @leesteken 's help, I could solve this situation with a lot less trouble.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Dear All,

I have a very delicate situation and I have tried to search everywhere I can but to no avail. It now comes to the time to post a thread to ask for help, thanks in advance.

Here is my setup:
(Production server, requires minimal downtime, member of a cluster of 3)
PVE8.2.2
1 SATA HDD for PVE/boot, nothing there but the PVE itself;
8 SAS HDD in RAIDZ + 2 SATA SSD as log and cache, ZFS, for only VMs;

The problem:
The boot drive now is failing. It is still running, but ramdom things are happening. I need to replace the boot drive with a SATA SSD which is smaller than the original HDD.

1. I can't use DD or clonezilla, the original HDD refuses to read at a certain point, bad blocks;
2. I can't just copy the whole system on file level to migrate to new SSD since the PVE is in a LVM (default PVE installation);
3. I need to keep all configurations and VMs, which means no matter what I do, I need to be able to restore to current setup without losing any (cluster and VMs) data;
4. I am able to DD the boot partitions (sda1, sda2) to the new SSD and copy /etc and /var/lib/pve-cluster/config.db to a backup;

Question:
What is my best method to replace the original HDD to the new 40GB-smaller-SSD without losing any configuration? I can put in addtional drives, since there are empty slots; I can shutdown the server for a hour or so; I can do file-copys (for now) although I don't know if I can do a full copy.

I am desperate and running out of time. Please, any help would be apreciated.
 
Last edited:
Don't try to duplicate data from the old drive as it might already have (silent) data corruption and you are wearing it out faster.
It's easy to make mistakes in each of the following steps, so I advise you have backups of everything and I take no resposibility for my mistakes:
Make a copy of all the files in /etc/ (including /etc/pve/) for reference later (to a USB memory stick). Unfortunately, this needs to be done from a running Proxmox otherwise the /etc/pve/ is empty. These files are human readable and you can see corruption if it has happened.
Install a fresh Proxmox on the new drive (make sure not to wipe/override existing drives) and configure it like you did before. Don't copy the old files over the new files, just manually configure storage based on the previous configuration.
After that you can copy the VM/CT configuration files from /etc/pve/qemu-server/ and /etc/pve/lxc/ to restore your VMs/CTs.
There are probably other things like backup schedules and firewall configurations that you also need to (manually) configure again.
 
  • Like
Reactions: Kingneutron
Don't try to duplicate data from the old drive as it might already have (silent) data corruption and you are wearing it out faster.
It's easy to make mistakes in each of the following steps, so I advise you have backups of everything and I take no resposibility for my mistakes:
Make a copy of all the files in /etc/ (including /etc/pve/) for reference later (to a USB memory stick). Unfortunately, this needs to be done from a running Proxmox otherwise the /etc/pve/ is empty. These files are human readable and you can see corruption if it has happened.
Install a fresh Proxmox on the new drive (make sure not to wipe/override existing drives) and configure it like you did before. Don't copy the old files over the new files, just manually configure storage based on the previous configuration.
After that you can copy the VM/CT configuration files from /etc/pve/qemu-server/ and /etc/pve/lxc/ to restore your VMs/CTs.
There are probably other things like backup schedules and firewall configurations that you also need to (manually) configure again.
Thank you for your quick reply!

The PVE is still running, and I have copied the entire /etc to a backup.

I can do a fresh install on the new SDD, I am only not so sure about:
1. how to "re-attach" the ZFS volume;
2. how to restore/import the VMs back to the new PVE install;
3. how to restore the "member-identity" of the cluster;
 
The PVE is still running, and I have copied the entire /etc to a backup.
Make sure to check that /etc/pve/ is there also and not empty.
I can do a fresh install on the new SDD, I am only not so sure aout:
1. how to "re-attach" the ZFS volume;
Configure it as a storage, like you did the first time after creating it.
2. how to restore/import the VMs back to the new PVE install;
Copy the configuration files (as I said before) or restore from backup.
3. how to restore the "member-identity" of the cluster;
Good question, Sorry, I did not realize you were running a cluster and running it in production.

Please forget everything I said. Migrate the VM/CTs to another node (or forget about them en restore from backup). Remove the node from the cluster. Wipe the disks. Install a fresh Proxmox (and recreate and setup any storage) and add the new node (with new name and new IP address) to the cluster. Just like you would do then a node goes dead. This is the safest and supported way to do it.
 
Make sure to check that /etc/pve/ is there also and not empty.

Configure it as a storage, like you did the first time after creating it.

Copy the configuration files (as I said before) or restore from backup.

Good question, Sorry, I did not realize you were running a cluster and running it in production.

Please forget everything I said. Migrate the VM/CTs to another node (or forget about them en restore from backup). Remove the node from the cluster. Wipe the disks. Install a fresh Proxmox (and recreate and setup any storage) and add the new node (with new name and new IP address) to the cluster. Just like you would do then a node goes dead. This is the safest and supported way to do it.
You are a greate man!

So, if I understand you correctly, I should just:
1. Migrate all VMs to another node;
2. Fresh-install a new PVE on this server;
3. Re-attache the ZFS or if not possible, just break and re-do everything;
3. Add this server to the Cluster;
4. Migrate the VMs back to this node.

Am I missing anything?
 
So, if I understand you correctly, I should just:
1. Migrate all VMs to another node;
Remove the node from the cluster (while it is still running): https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node
2. Fresh-install a new PVE on this server;
Make sure to not reuse the name and IP address.
3. Re-attache the ZFS or if not possible, just break and re-do everything;
I think the information about the storage is part of the cluster. Maybe selecting the new node for the existing storage might be enough. I'm also assuming the nodes in your cluster are similar and you don't need to do anything special for this one.
3. Add this server to the Cluster;
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_join_node_to_cluster
4. Migrate the VMs back to this node.

Am I missing anything?
Did you not do this before (as practice) for when a node dies (which will happen eventually)?
 
Remove the node from the cluster (while it is still running): https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node

Make sure to not reuse the name and IP address.

I think the information about the storage is part of the cluster. Maybe selecting the new node for the existing storage might be enough. I'm also assuming the nodes in your cluster are similar and you don't need to do anything special for this one.

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_join_node_to_cluster

Did you not do this before (as practice) for when a node dies (which will happen eventually)?
I am currently running these steps, finger crossed...

No, never done this before. And I don't know if this is lucky or unlucky, For working with proxmox, I have never running into a situation that requires "replacing" a node. Power down happened and migration kicked-in automatically, that's the closest situation that I have ever encountered with.

Thank you for your help! I will report the outcome closely.