How to migrate 500 VMs from VMWare to Proxmox, with as little downtime as possible

lumix1991

New Member
Jan 9, 2024
6
2
3
Hi There,

We're currently running a VMware cluster with VSAN storage, we're running about 500 production VMs on it. We have a new Proxmox cluster with Ceph storage were we want to migrate to. I've interconnected both clusters with a 10Gbit network.

I am looking for a procedure to migrate over these VMs to ProxMox with as little downtime as possible.

I've tried the ovftool (Virtual-to-Virtual (V2V)) method described here
https://pve.proxmox.com/wiki/Migration_of_servers_to_Proxmox_VE?ref=blog.galt.me

Here i run the ovftool on one proxmox node, which is connecting to a vmware node to pull the data, after that i've run the qm importovf command to import it in proxmox. I've automated these steps with Ansible so i just can run a single playbook, enter the source vm name, and it will automatically stop the source VM, migrate it over to ProxMox, attach the networks again and boot it up.

This is working perfectly, but...

The only downside is that the whole proces is taking a lot of time where a VM has to be powered off. I've tested this on several VMs we have, variating in disk size (from 40GB to 1TB) and the overall speeds i am seeing for ovftool export is around 2GB per minute (over the 10Gbit network between a Proxmox and Vmware node). So a VM of 1TB will take up to 8.5 hours to finish the ovftool export job. After that the qm importovf command has to be executed to import it in ProxMox/Ceph which will also take a few hours to which results in a downtime of 10+ hours.

We're running a hosting business and not all VMs are running in a HA-cluster, so this kind of downtime is unacceptable.
Maybe we can optimize the above speeds/times with faster storage, or a faster network but i think this will still takes hours of downtime in all cases.

The total used space on the VMWare cluster is around 40TB, so with 2GB/minute.... this will take 340 hours..



Another approach i've thought about is:

- Setup a NAS server which can run qemu-img convert with fast storage, mount it both on VMWare and Proxmox via NFS
- Live migrate the VM on VMware to this NFS storage
- Shutdown the VM on VMware
- On proxmox: Run ovftool with the --nodisks flag, so it will only export the VMs configuration
- On the NAS server: run the qemu-img convert command, to convert the VMDK files to raw/qcow (this is locally on the NAS, because the VMDKs are there already, so should be fast???)
- On proxmox: Import the OVA file, attach the disks from the NFS share (which are on the NAS server and already converted)
- Boot up the VM on Proxmox
- When working properly, us the "Migrate Storage" disk action to migrate over the disk from the NFS share to Ceph.

I did not tried above approach yet, but i think this can work and can be automated with Ansible as well (as far as I can think right now)

Maybe there is another (fast) solution to do this? Does anyone have experience with this kind of migrations?

Thanks!
 
Last edited:
  • Like
Reactions: ubu
If you move the VM to NFS in VMware, do you get an .ovf file or the .vmdk directly?

I don't normally work with VMware, so I have to ask a few questions.
 
Personally, I think your best bet is probably to configure Proxmox to read and write your current vmdk files directly. It's kind of an icky solution, because qcow2 is a superior format, but technically you can add vmdk format disks to your virtual machines in Proxmox. Where are you storing all your VMware VM's now?
 
Where are you storing all your VMware VM's now?
We're currently running a VMware cluster with VSAN storage

Because of this
but technically you can add vmdk format disks to your virtual machines in Proxmox.
i ask this
If you move the VM to NFS in VMware, do you get an .ovf file or the .vmdk directly?
So far it reads as if he only gets an ovf file from the vSAN, but since the .vmdk is hidden in it, he would first have to resolve this file, which he had already described with his procedure:
- On the NAS server: run the qemu-img convert command, to convert the VMDK files to raw/qcow (this is locally on the NAS, because the VMDKs are there already, so should be fast???)
- On proxmox: Import the OVA file, attach the disks from the NFS share (which are on the NAS server and already converted)
 
Since OP's current storage is VSAN, its only directly accessible to ESXi servers in that cluster. So OP would need to move VMs (one by one or in batches) from VSAN to some other storage. The only storage that fits the "access by ESXi and PVE" is, in fact, NFS. The problem with any migration is the cutover moment. The data access needs to be frozen, the clients redirected. Since the target of the migration is a single large file one cant really do effective incremental catch up copies.
I think Ops goal would be to find the fastest possible storage to move to (likely not NFS) for the conversion process to take as little time as possible. Probably dedicating a compute host to execute this conversion.

I would do it in phases: do a test run, ensure VM starts on isolated network and all "i's" are dotted and "t"s are crossed. Then do the production cutover. There is likely no way to avoid asking for at least some downtime.



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
If you move the VM to NFS in VMware, do you get an .ovf file or the .vmdk directly?

I don't normally work with VMware, so I have to ask a few questions.
VMware is only using .vmdk files, the .ovf file is only for exporting purposes.
 
Personally, I think your best bet is probably to configure Proxmox to read and write your current vmdk files directly. It's kind of an icky solution, because qcow2 is a superior format, but technically you can add vmdk format disks to your virtual machines in Proxmox. Where are you storing all your VMware VM's now?
I've tought about this but i didn't know this was possible with Proxmox. This is something to consider but i think performance wise we like to migrate to qcow.

it's not that we can't afford any downtime at all, but 10/12 hours for a single VM is too much. If we can narrow this down to 1 / 2 hours per VM max, that would be great! All migrations will be done nightly within maintenance windows.
 
Last edited:
its all Linux?
if yes you can migrate with nearly zero downtime by just using rsync and some scripting.

we "once" migrated from xen to kvm did it that way. it takes time but no downtime
Yes, mostly Linux (Debian/CentOS). I think we have around 20 windows VMs running, the rest is Linux.
 
I've tought about this but i didn't know this was possible with Proxmox. This is something to consider but i think performance wise we like to migrate to qcow.
Keep in mind: QCOW2 is fine and nice, but it's a file based format, so you cannot use CEPH directly (it's a block storage, rados is the corresponding file based storage if you want to explore QCOW2 on Rados).

The upgrade path as I see it (in addition to what has already been said): use vmdk directly from NFS in PVE, check if the VM works and then online migrate the disk from vmdk to ceph as you laided out in your first post. The difference is "only" that you don't need to import or convert the disk offline on the PVE end. As long as you can migrate it online on the VMware side and online on the PVE side, you may only have a downtime from stopping and starting. At least in theory, this sounds perfect (I have not enough experience with vSAN to comment on the VMware side).
 
Keep in mind: QCOW2 is fine and nice, but it's a file based format, so you cannot use CEPH directly (it's a block storage, rados is the corresponding file based storage if you want to explore QCOW2 on Rados).

The upgrade path as I see it (in addition to what has already been said): use vmdk directly from NFS in PVE, check if the VM works and then online migrate the disk from vmdk to ceph as you laided out in your first post. The difference is "only" that you don't need to import or convert the disk offline on the PVE end. As long as you can migrate it online on the VMware side and online on the PVE side, you may only have a downtime from stopping and starting. At least in theory, this sounds perfect (I have not enough experience with vSAN to comment on the VMware side).
Your right about RBD, there is no need for qcow (and the conversion to it) then.

This week i will prep a virtual test setup for the NFS method i've described earlier and test if this is working as I (we) expected. If this is working properly, i'll deploy a physical server that will function as our NFS NAS with a fast storage backend (NVMe) so this will not be the bottleneck and try it all out.

I'll let you guys know the results ;-)

Thank you for your help/tips already!
 
  • Like
Reactions: c02
VMware is only using .vmdk files, the .ovf file is only for exporting purposes.

Then it's perfect. You move the storage to the NFS drive, stop the VM in VMware, start it in Proxmox VE and then you move the storage in the background. However, it is important that the VM is best created directly with scsi device and discard flag, otherwise it could happen that there is not enough storage space in the CEPH.
 
A better option would be to use vSphere Replication (Site Recovery Manager) and save them to your NFS share (or to any location). It maintains the same VM file structure as production, just in a powered off state and without being added to inventory. You can setup incremental replication on the group of VM's you want to migrate that night, and set a recovery point objectives as low as 5 minutes. That way you don't have to touch production. Site Recovery Manager is an easy install and easy to utilize. Its one feature we will hate to lose.

We are in the same boat though. I would be interested in your Ansible Playbook if you wouldnt mind sharing.

Thanks,

Hatter
 
Last edited:
  • Like
Reactions: LnxBil
Maybe there is another (fast) solution to do this? Does anyone have experience with this kind of migrations?


Yes, i had the same Problem with ovftool export. For me using sshfs + eatmydata heavily improved speed. You just need to enable ssh on the esxi and then:

$ ssh root@<proxmoxnode>
# sshfs root@<esxinode>:/vmfs/volumes /mnt/tmp
# cd /mnt/tmp/<datastore>/<vmname>/
# eatmydata ovftool --parallelThreads=4 <vmname>.vmx /tmp/cv-stepenitz-1
# qm importovf <new vmid> /tmp/<vmname>/<vmname>.ovf cephrbd

I guess the parameters for sshfs could be adjusted, too.

"eatmydata" is a tool to disable "sync" for ovftool which dramatically speeded this up, too.
 
  • Like
Reactions: ggallo
I can't speak directly for vSAN datastores but I imagine they are still symlinked on each ESXi host in /vmfs/volumes and from there the individual VM directories are organized in the same way as a local VMFS datastore.

On my standalone ESXi hosts, I would mount the VMFS datastores on my PVE host(s) using SSHFS. You can create a snapshot of your VMs in vSphere and do qemu-img convert on the now read-only parent VMDK into raw, qcow2, etc, or qm importdisk function can convert directly to RBD.

So I assume you will clone the parent VMDK to qcow2 on some other storage. Create a PVE VM of equivalent specs to your vSphere VM, copy the SMBIOS UUID if needed, power it up, and give it a temporary IP address that is different from the prod vSphere VM.

Remove vmware tools, install qemu-ga and get all of your paravirtual hardware working and synchronize the prod data one final time at the file/db/application layer as needed. If the qemu-img executes during a time of negligible activity, then you can skip this.

Then shut down the vSphere VM and fix the ip address of the PVE VM taking its place.

edit: I see you already have Ceph setup. So the command would be qm importdisk $newVMID $vmdkPath $poolName
and that will automatically create vm-ID-disk-0, disk-1, etc....

I have a PowerShell script that will go through an ESXi host and parse the VMX files, create an equivalent PVE VM, and then convert each VMDK to RBD. No need for any intermediate virtual disk file storage. Convert and Import is combined in one step.
 
Last edited:
"qm importdisk" is how I migrated a large number of VMs from oVirt to PVE, after adding a shared NFS server between oVirt and PVE. Made the process very simple, and had no issues with the import process to the PVE storage side regardless of the underlying storage solution.
 
Hi There,

We're currently running a VMware cluster with VSAN storage, we're running about 500 production VMs on it. We have a new Proxmox cluster with Ceph storage were we want to migrate to. I've interconnected both clusters with a 10Gbit network.

I am looking for a procedure to migrate over these VMs to ProxMox with as little downtime as possible.

I've tried the ovftool (Virtual-to-Virtual (V2V)) method described here
https://pve.proxmox.com/wiki/Migration_of_servers_to_Proxmox_VE?ref=blog.galt.me

Here i run the ovftool on one proxmox node, which is connecting to a vmware node to pull the data, after that i've run the qm importovf command to import it in proxmox. I've automated these steps with Ansible so i just can run a single playbook, enter the source vm name, and it will automatically stop the source VM, migrate it over to ProxMox, attach the networks again and boot it up.

This is working perfectly, but...

The only downside is that the whole proces is taking a lot of time where a VM has to be powered off. I've tested this on several VMs we have, variating in disk size (from 40GB to 1TB) and the overall speeds i am seeing for ovftool export is around 2GB per minute (over the 10Gbit network between a Proxmox and Vmware node). So a VM of 1TB will take up to 8.5 hours to finish the ovftool export job. After that the qm importovf command has to be executed to import it in ProxMox/Ceph which will also take a few hours to which results in a downtime of 10+ hours.

We're running a hosting business and not all VMs are running in a HA-cluster, so this kind of downtime is unacceptable.
Maybe we can optimize the above speeds/times with faster storage, or a faster network but i think this will still takes hours of downtime in all cases.

The total used space on the VMWare cluster is around 40TB, so with 2GB/minute.... this will take 340 hours..



Another approach i've thought about is:

- Setup a NAS server which can run qemu-img convert with fast storage, mount it both on VMWare and Proxmox via NFS
- Live migrate the VM on VMware to this NFS storage
- Shutdown the VM on VMware
- On proxmox: Run ovftool with the --nodisks flag, so it will only export the VMs configuration
- On the NAS server: run the qemu-img convert command, to convert the VMDK files to raw/qcow (this is locally on the NAS, because the VMDKs are there already, so should be fast???)
- On proxmox: Import the OVA file, attach the disks from the NFS share (which are on the NAS server and already converted)
- Boot up the VM on Proxmox
- When working properly, us the "Migrate Storage" disk action to migrate over the disk from the NFS share to Ceph.

I did not tried above approach yet, but i think this can work and can be automated with Ansible as well (as far as I can think right now)

Maybe there is another (fast) solution to do this? Does anyone have experience with this kind of migrations?

Thanks!
I'd love to see your ansible playbook for this if you wouldn't mind sharing ;)
 
  • Like
Reactions: c02
Yes, i had the same Problem with ovftool export. For me using sshfs + eatmydata heavily improved speed. You just need to enable ssh on the esxi and then:

$ ssh root@<proxmoxnode>
# sshfs root@<esxinode>:/vmfs/volumes /mnt/tmp
# cd /mnt/tmp/<datastore>/<vmname>/
# eatmydata ovftool --parallelThreads=4 <vmname>.vmx /tmp/cv-stepenitz-1
# qm importovf <new vmid> /tmp/<vmname>/<vmname>.ovf cephrbd

I guess the parameters for sshfs could be adjusted, too.

"eatmydata" is a tool to disable "sync" for ovftool which dramatically speeded this up, too.

I tried that and get this error:
Code:
# ~/ovftool/ovftool /mnt/esxi/<source-vm>.vmx /rpool/exported_OVFs/
Opening VMX source: <source-vm>.vmx
Error: 1634992128
Completed with errors

I can read the .vmx with cat/less.

If I use ~/ovftool/ovftool --parallelThreads=32 vi://<user>@<vcenter>?moref=vim.VirtualMachine:vm-123456 /rpool/exported_OVFs/, I get only ~32MByte/s over 1GBit-Ethernet.
With iperf3 the speed is ~120MByte/s.
 
Last edited:
I tried that and get this error:
Code:
# ~/ovftool/ovftool /mnt/esxi/<source-vm>.vmx /rpool/exported_OVFs/
Opening VMX source: <source-vm>.vmx
Error: 1634992128
Completed with errors
The ovftool is developed and supported by VMware/Broadcom corporation:
ovftool --version
VMware ovftool 4.4.3 (build-18663434)

You may want to post on their forum.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!