Import from ESXi Extremely Slow

@wbumiller Sorry to ping you, do you know if this is a known limitation with the dd command in qemu-img? Maxing out throughput at 75MB/s on NVMe storage.
 
Ahh nevermind, I didn't read the documentation. I added bs=16M and got a much better result. 973 MB/s.

It by default is doing 512 BYTE chunks, which is kind of wild. I'll see if I can make progress on the stream import. :)

Edit:


Default:
Command:
qemu-img dd -f raw -O raw osize=32212254720 if=/root/test-netcat-plaindd.raw of=/root/test-local-qemu-dd.raw

Timing
- Total Elapsed Time: 12 minutes 22.28 seconds
- User CPU Time: 283.83 seconds
- System CPU Time: 467.52 seconds
- Total CPU Time: 751.35 seconds
- CPU Usage: 101% (slightly over one CPU core)

Performance
- Data Transferred: ~32.2 GB
- Throughput: ~43.4 MB/s
- Maximum Memory Used: ~32 MB

I/O Statistics
- File System Reads: 18,259,176 operations
- File System Writes: 62,945,008 operations
- Context Switches: 251,830,712 voluntary / 14,882 involuntary
- No Page Faults Requiring I/O: 0 (all data stayed in memory)



Command: qemu-img dd -f raw -O raw bs=16M osize=32212254720 if=/root/test-netcat-plaindd.raw of=/root/test-local-qemu-dd.raw

Timing
- Total Elapsed Time: 33.09 seconds
- User CPU Time: 0.06 seconds
- System CPU Time: 23.81 seconds
- CPU Usage: 72%

Performance
- Data Transferred: ~32.2 GB
- Throughput: ~973 MB/s
- Maximum Memory Used: ~47 MB

I/O Statistics
- File System Reads: 16,802,016 operations
- File System Writes: 63,029,640 operations
- Context Switches: 12,759 voluntary / 632 involuntary
- Major Page Faults: 19 (vs 0 before)

Comparison: Default vs bs=16M



So you know, only 19000x less context switching. Lol.
 
Last edited:
So as we can see ... switching hypervisor technologie (esxi, vsphere, pve, kvm ...) or even hypervisor hw or storage hw is just a joke of work when the whole setup of vm/lxc images is on (ha-) nfs storage (normally always get full bw by just needed cp cmd's or just another mount) and all other more or less end in a migration nightmare. Hardware is around exchanging all 5 years while a hypervisor switch maybe in a >10 year rhythm and normally if not somethink get broken then in a timely manner hw must be switched immedently unprepared which makes it stressless or even stressful right depending on own decided "optimal" design.
 
Appreciate it!

We have a few 10TB+ VMs to move and they wouldn't be done within a week of starting them lol

I learned on the last large cutover that the current tool slows down over time after the first few TB. It went from roughly 30 mins for 60GB to about 60 mins for 60GB. I was praying it would be done before Monday morning and luckily it finished 10PM on Sunday. :oops:

I'll keep cracking away at it and see if I can come up with something realistic. I really need to just setup another 25GbE host so I don't have to bum others to test lol.
Why dont you use shared storage method - such as NFS. You will need to tune NFS a bit. But we managed to migrated VMs with 8/9 TB over night with this method.

Let me share some figures from our migrations.

Native ESXi importer: about 110/130 MB a sec. Which is fine for our standard VM size, about 30 min, but anything above 500GB becomes too slow.

NFS import: We have established two NFS servers, one physical box with Raid controller and SATA SSD disks and one NFS server as VM on Ceph(NVMe disks). The first step is to storage migrate VMs to NFS, you can do it Live.

By tuning NFS(I do not recall which options we have used now), ESXi was copying data with about 1.1GB/s to NFS on Ceph and about 700MB/s to physical NFS. At that stage you need to power off VM and start the import on PVE. qm disk import was running at about 330MB/s per disk. So if you have multiple disks you can achieve quite good transfer rate. Also at this import NFS on Ceph was doing way better especially importing several disks at the same time, but I do not recall actual numbers.

We have not tried running vmdk in place and import it while it is running.

I do not want to discourage this discussion to improve the performance, I just want to share a "workaround" if someone is struggling with import speed.

I wish that native import speed improves as NFS import is more complicated and we still need to migrate about 110 VMs with 50TB of data.
 
Why dont you use shared storage method - such as NFS. You will need to tune NFS a bit. But we managed to migrated VMs with 8/9 TB over night with this method.

...

I do not want to discourage this discussion to improve the performance, I just want to share a "workaround" if someone is struggling with import speed.

...

Oh we can absolutely use a different method, it's just annoying that the built in tool is so slow lol. I've already verified the NFS middle jump, attaching the iSCSI target directly, and just copying the data straight over to the new storage and then converting it all works. It's just the built in tool is lack luster in transfer speeds. It seems that's more of a by product of it being as easy to use as possible and not requiring really anything to be done on the ESXi side.
 
Why dont you use shared storage method - such as NFS. You will need to tune NFS a bit. But we managed to migrated VMs with 8/9 TB over night with this method.

Let me share some figures from our migrations.

Native ESXi importer: about 110/130 MB a sec. Which is fine for our standard VM size, about 30 min, but anything above 500GB becomes too slow.

NFS import: We have established two NFS servers, one physical box with Raid controller and SATA SSD disks and one NFS server as VM on Ceph(NVMe disks). The first step is to storage migrate VMs to NFS, you can do it Live.

By tuning NFS(I do not recall which options we have used now), ESXi was copying data with about 1.1GB/s to NFS on Ceph and about 700MB/s to physical NFS. At that stage you need to power off VM and start the import on PVE. qm disk import was running at about 330MB/s per disk. So if you have multiple disks you can achieve quite good transfer rate. Also at this import NFS on Ceph was doing way better especially importing several disks at the same time, but I do not recall actual numbers.

We have not tried running vmdk in place and import it while it is running.

I do not want to discourage this discussion to improve the performance, I just want to share a "workaround" if someone is struggling with import speed.

I wish that native import speed improves as NFS import is more complicated and we still need to migrate about 110 VMs with 50TB of data.

There are also some additional niceties afforded by the import tool. It's definitely possible to migrate using shared storage of one kind or another but it involves more steps, both on the ESX side in the form of moving the VMs around to get them onto the shared storage and on the Proxmox side in the form of needing to construct the VM manually and having to import the disks on a per-disk basis rather than importing the whole VM in one step. We're also exploring other migration options, but if it's possible to achieve reasonably performant results using the importer I think that's a nicer path than doing what the importer does for us by hand.

Other options exist and if that is your preferred method then power to you, but we're trying to determine if this tool can be modified to achieve acceptable performance.
 
Okay... Progress. I've got the file transfer and conversion to work as expected as long as you want the output format in RAW. For our purposes, that's okay. We are using iSCSI, so no big deal.

However, I'm having trouble getting the qcow2 pipeline working. qemu-img seems to get to the sparce area of the qcow2 and then goes "ALL DONE" and kills the task. So you get a 30GB of 0's with a small Windows boot sector. So, trying to make it not do that.
 
  • Like
Reactions: spencerh
Today I will be setting up a test environment with some much faster storage and networking. I'll be benchmarking the pure HTTP/2.0 method as well as the new netcat+dd method, and netcat+pigz+dd. The hardware should be closer to what you would typically see in an enterprise environment with a plethora of cores.
 
  • Like
Reactions: jdancer and waltar
Success!

Okay, lots of progress and testing to confirm.

Long story short, with a single netcat pipe we can hit around 500MiB/s. You can parallelize it to two streams and it does hit almost 1GiB/s. I've test with dd, tar, and netcat straight up for the pipe into netcat and they all hit the 500MiB/s per stream, so not sure if that's just a limit of ESXi or what.

Here is the LONG testing showing more accurate results.

https://github.com/PwrBank/pve-esxi-import-tools/blob/netcat-dd/netcat-dd-testing-results.md

So at minimum, almost 5x increase in speed. But could scale up over 10x hopefully. However, it adds a lot of complexity to manage splitting a file into multiple streams AND multiple imports from VMware at the same time.
 
This looks awesome! I'll give this a test tomorrow and post the results.

I'll work on getting the tool to automatically handle this type of import process to see how feasible it would be to do in a production environment. It looks like it will scale up with multiple stream, so I can have dd read the first 1/4, a second process doing the next 1/4, another for the next 1/4... So on. That would in theory hit my 2GiB/s limit on the SSD. The thing is having the tool keep track of the streams and ports. But maybe the complexity is worth the 5x, 10x, 15x, and 20x increase in speed...
 
  • Like
Reactions: Johannes S
Hm, netcat isn't encrypted though or am I'm missing something?

Correct, in testing the encryption process is a pretty big bottle neck. I can try it again with the more beefy hardware but the main issue is VMware has hampered the speed of which SSH can transfer files.
 
Correct, in testing the encryption process is a pretty big bottle neck. I can try it again with the more beefy hardware but the main issue is VMware has hampered the speed of which SSH can transfer files.
Well I prefer a slow transfer to one without encryption but to each their own. Imho it would be quite a bad idea to integrate a "fast migration procedure" into PVE if it needs clear-text-transfer over the network.
 
While I agree security is a good concern, in this case, is it? At least in our environment (and I know not everyone does it this way), we have all of our servers on the same L2 network dedicated to VM storage traffic, so none of it ever leaves the switch stack to be intercepted.
 
While I agree security is a good concern, in this case, is it? At least in our environment (and I know not everyone does it this way), we have all of our servers on the same L2 network dedicated to VM storage traffic, so none of it ever leaves the switch stack to be intercepted.

I tend to agree; if you're doing these transfers over an open network speed probably isn't your primary concern anyway. It would also theoretically be pretty straightforward to inject an encryption/decryption step into the pipeline before/after netcat, but at the moment the primary concern is getting closer to a theoretical maximum of throughput to demonstrate that it's possible, then we can start carving performance off again with features.
 
@spencerh
Do you mind doing a short test from me? In my testing I have ran into an interesting, but need to verify if it's a hardware/config issue or something with ESXi itself.

When running a single threaded iperf3 test from Proxmox to ESXi, it gets ~25gbps as expected. But the other way around hit's about 16gbps. After updating the Intel E810 driver it now hits ~20gbps, but still not max. Just making sure it's not effecting the transmit limit. Considering there seems to be sometype of 500MB/s limit in ESXi in a TCP stream, I don't think it'll matter, but just being sure.

I looked around the internet and there's lots of talk about people hitting some type of 500MB/s limit in ESXi. It seems a little too "clean" of a number, as it's suspiciously 4x1GbE. If that can be solved or at least increased, that would make a single import even faster.

Update on the code, I'm re-implimenting the change into a fresh copy of the 1.0.1 version of the FUSE application. Much cleaner code to work on and making sure it's done correctly with all of the checks and management to make sure this works as intended every time.
 
@spencerh
Do you mind doing a short test from me? In my testing I have ran into an interesting, but need to verify if it's a hardware/config issue or something with ESXi itself.

When running a single threaded iperf3 test from Proxmox to ESXi, it gets ~25gbps as expected. But the other way around hit's about 16gbps. After updating the Intel E810 driver it now hits ~20gbps, but still not max. Just making sure it's not effecting the transmit limit. Considering there seems to be sometype of 500MB/s limit in ESXi in a TCP stream, I don't think it'll matter, but just being sure.

I looked around the internet and there's lots of talk about people hitting some type of 500MB/s limit in ESXi. It seems a little too "clean" of a number, as it's suspiciously 4x1GbE. If that can be solved or at least increased, that would make a single import even faster.

Update on the code, I'm re-implimenting the change into a fresh copy of the 1.0.1 version of the FUSE application. Much cleaner code to work on and making sure it's done correctly with all of the checks and management to make sure this works as intended every time.

Server running on ESX, client running on Proxmox:
Code:
[root@my-esx:~] /usr/lib/vmware/vsan/bin/iperf3.copy -s -B 192.168.0.123 -p 8000
-----------------------------------------------------------
Server listening on 8000 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.0.150, port 53776
[  5] local 192.168.0.123 port 8000 connected to 192.168.0.150 port 53790
iperf3: getsockopt - Function not implemented
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   872 MBytes  7.30 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]   1.00-2.00   sec  1.07 GBytes  9.24 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]   2.00-3.00   sec  1.08 GBytes  9.24 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]   3.00-4.00   sec  1.07 GBytes  9.24 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]   4.00-5.00   sec  1.07 GBytes  9.18 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]   5.00-6.00   sec  1.09 GBytes  9.35 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]   6.00-7.00   sec  1.09 GBytes  9.35 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]   7.00-8.00   sec  1.09 GBytes  9.35 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]   8.00-9.00   sec  1.09 GBytes  9.35 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]   9.00-10.00  sec  1.09 GBytes  9.35 Gbits/sec
iperf3: getsockopt - Function not implemented
[  5]  10.00-10.01  sec  5.25 MBytes  8.99 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.01  sec  10.6 GBytes  9.10 Gbits/sec                  receiver

Code:
root@pve:~# iperf3 -c 192.168.0.123 -t 10 -i 5 -f g -p 8000
Connecting to host 192.168.0.123, port 8000
[  5] local 192.168.0.150 port 53790 connected to 192.168.0.123 port 8000
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-5.01   sec  5.15 GBytes  8.85 Gbits/sec  1289   1.49 MBytes
[  5]   5.01-10.01  sec  5.44 GBytes  9.35 Gbits/sec    0   2.00 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.01  sec  10.6 GBytes  9.10 Gbits/sec  1289            sender
[  5]   0.00-10.01  sec  10.6 GBytes  9.10 Gbits/sec                  receiver

iperf Done.

Server running on Proxmox, client running on ESX:
Code:
root@pve:~# iperf3 -s -B 192.168.0.150 -p 8000
-----------------------------------------------------------
Server listening on 8000 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.0.123, port 55395
[  5] local 192.168.0.150 port 8000 connected to 192.168.0.123 port 53091
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.07 GBytes  9.21 Gbits/sec
[  5]   1.00-2.00   sec  1.09 GBytes  9.34 Gbits/sec
[  5]   2.00-3.00   sec  1.09 GBytes  9.34 Gbits/sec
[  5]   3.00-4.00   sec  1.08 GBytes  9.31 Gbits/sec
[  5]   4.00-5.00   sec  1.09 GBytes  9.34 Gbits/sec
[  5]   5.00-6.00   sec  1.09 GBytes  9.34 Gbits/sec
[  5]   6.00-7.00   sec  1.09 GBytes  9.34 Gbits/sec
[  5]   7.00-8.00   sec  1.09 GBytes  9.34 Gbits/sec
[  5]   8.00-9.00   sec  1.09 GBytes  9.34 Gbits/sec
[  5]   9.00-10.00  sec  1.08 GBytes  9.31 Gbits/sec
[  5]  10.00-10.01  sec  4.62 MBytes  9.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.01  sec  10.9 GBytes  9.32 Gbits/sec                  receiver

Code:
[root@my-esx:~] /usr/lib/vmware/vsan/bin/iperf3.copy -c 192.168.0.150 -t 10 -i 5 -f g -p 8000
Connecting to host 192.168.0.150, port 8000
[  5] local 192.168.0.123 port 53091 connected to 192.168.0.150 port 8000
iperf3: getsockopt - Function not implemented
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-5.00   sec  5.42 GBytes  9.32 Gbits/sec  4236316672   0.00 Bytes
iperf3: getsockopt - Function not implemented
[  5]   5.00-10.00  sec  5.43 GBytes  9.33 Gbits/sec  58650624   0.00 Bytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.9 GBytes  9.32 Gbits/sec    0             sender
[  5]   0.00-10.01  sec  10.9 GBytes  9.32 Gbits/sec                  receiver

iperf Done.

TL;DR - I'm seeing basically symmetrical performance regardless of which machine is running the server.

On the topic of the code, is the netcat-dd branch the code you did the testing with? I was going to try to compile it and do some testing in the meantime but I wanted to make sure I was looking at the right thing.
 
@spencerh Yeah, the netcat-dd is the last working branch. I just uploaded a few fixes to it based on the new codebase I'm working on.

Bash:
./target/release/esxi-folder-fuse --test-fuse --use-fuse-streaming --esxi-host 10.20.30.40 --esxi-disk /vmfs/volumes/nvme-storage/windows-test/windows-test.vmdk --dest /nvme-storage/esxi-import-test/windows-test.qcow2 --dst-format qcow2 --block-size 4M

That's an example to use the FUSE based netcat import. You will want to make sure the SSH keys are setup so Proxmox can talk to ESXi to establish the netcat tunnels. This does not have compression added to it at the moment, but it is about 50% faster than the stock codebase. Compression makes it about 80% faster than that on large VMs with lots of sparse and compressible data.

I'm working on a cleaned up codebase with more logging, better error handling, compression, and hopefully improving throughput even more.