[SOLVED] Moving disk from local to shared storage too slow, fails

FuriousGeorge · May 3, 2020

I have a server that is not part of my cluster, and I'd like to move one of its VMs to the shared storage on my cluster.

As the VM is on zfs storage, and I don't have a zfs pool of sufficient size on my cluster, I've set up temp storage on one of the nodes. One of the virtual disks is 0.5 TB, and the other is 2.4 TB.

On Friday night I begin by transferring the large disk to my temporary storage:

dd if=/dev/zvol/vm-110-disk-2 bs=1M status=progress | ssh root@destination-server "dd of=/local-vmdata/temp/images/310/vm-310-disk-2.raw bs=1M"

This works fine. I nearly saturate my gbe connection, and early Saturday morning, after about 7 hours, the transfer completes.

I set up the vmid.conf file, and move the disk in the PM gui.

At first, everything appears to be working fine. I appear to be transferring at about 60 MB/s, which is slower than the first move, but not terrible.

The problem is that the move operation stays at 0%, and stayed that way for 12 hours. After 12 hours my bandwidth usage drops off significantly -- to about 15 MB/s -- but I start seeing progress on the move, faster than my bandwidth usage suggests. I start to hold out hope that after 18 hours, the disk will be moved by late Saturday night, or early Sunday morning.

At about 23% the move failed:

Code:

transferred: 555661393920 bytes remaining: 1860257710080 bytes total: 2415919104000 bytes progression: 23.00 %
qemu-img: error while writing sector 1090387709: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 3995833663488: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 3995833720832: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 4096: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 0: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 17179803648: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 17179860992: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 4096: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 0: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 2415919038464: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 2415919095808: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 4096: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 3995833663488: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 3995833720832: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 4096: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 0: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 17179803648: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 17179860992: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 4096: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 0: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 2415919038464: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 2415919095808: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 4096: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 3995833663488: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 3995833720832: Input/output error
  /dev/mapper/3600140529fca684d358ad4823daeabd7: read failed after 0 of 4096 at 4096: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 0: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 17179803648: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 17179860992: Input/output error
  /dev/HA/vm-106-disk-1: read failed after 0 of 4096 at 4096: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 0: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 2415919038464: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 2415919095808: Input/output error
  /dev/HA/vm-310-disk-0: read failed after 0 of 4096 at 4096: Input/output error
  Logical volume "vm-310-disk-0" successfully removed
TASK ERROR: storage migration failed: copy failed: command '/usr/bin/qemu-img convert -p -n -f raw -O raw /local-vmdata/temp/images/310/vm-310-disk-2.raw /dev/HA/vm-310-disk-0' failed: exit code 1

The NAS device reports no problems with the shared storage, so I assume it was some network issue that caused the failure.

Now I'm not sure how to procede. I can see in iotop while the transfer is running, and in the GUI after it failes, that PM is transferring the disks by:

Code:

/usr/bin/qemu-img convert -p -n -f raw -O raw /local-vmdata/temp/images/310/vm-310-disk-2.raw /dev/HA/vm-310-disk-0

This is taking far too long for the 60 MB/s it is consuming. At that speed it should be done in 10 hours, but progress doesn't go from 0% to 1% until about 12 hours in, at which point the progress appears to start moving at normal speeds until it fails.

Here's a screen grab from PM that illustrates the point:

The netin shows the disk being moved into the cluster's temp storage, the netout shows my attempt to move it from the temp storage to the shared storage.

My idea now is to try to use dd instead of the GUI to transfer raw disk image to the storage. Is there any reason I should not:

Code:

dd if=/local-vmdata/temp/images/310/vm-310-disk-2.raw of=/dev/HA/vm-310-disk-0 bs=1M status=progress

Thanks in advance.

EDIT 1: Just thinking out loud here... alternatively, I could create another empty disk of the same size first, and dd from one to the other. Perhaps that's a better way.

EDIT 2: Moving the small 0.5 TB disk to different volumes on the same NAS has just also failed. This worked last weekend, but I seem to recall it taking two tries that time. Again, the NAS reports the disks are healthy, but the xfer failed nonetheless:

Code:

transferred: 209487029862 bytes remaining: 327383882138 bytes total: 536870912000 bytes progression: 39.02 %
qemu-img: error while writing sector 418815901: Input/output error
  Logical volume "vm-310-disk-0" successfully removed
TASK ERROR: storage migration failed: copy failed: command '/usr/bin/qemu-img convert -p -n -f raw -O raw /local-vmdata/temp/images/310/vm-310-disk-1.raw /dev/HA_2/vm-310-disk-0' failed: exit code 1

I have two dedicated switches for SAN, connected to bonded interfaces on the servers and NAS. I'm not sure what the cause of this can be. As I'm running out of time, I'm going to try my idea to create a drive and use dd, per EDIT 1. Each failure seems to occur after enough bandwidth usage that I have to assume the whole disk was already transferred, and the phase 2 procedure was transpiring (see image above).

EDIT 3a: So far so good. Started with the small disk and close to maxing out on speed:

Code:

5799673856 bytes (5.8 GB, 5.4 GiB) copied, 53.0072 s, 109 MB/s

... vs 40-60 MB/s using the GUI by way of qemu-img convert.

EDIT 3b: Success:

Code:

536835260416 bytes (537 GB, 500 GiB) copied, 4985.01 s, 108 MB/s
512000+0 records in
512000+0 records out
536870912000 bytes (537 GB, 500 GiB) copied, 5146.66 s, 104 MB/s

... now I'm gonna try the big one.

EDIT 3c: The large disk transferred successfully as well. I was then able to boot the VM. I'd like to mark this as solved, but I'm not sure why the move operation through the GUI takes so long or fails.

oguz · May 5, 2020

hi,

thanks for the post, it might be helpful to someone

FuriousGeorge said:
I'd like to mark this as solved, but I'm not sure why the move operation through the GUI takes so long or fails.

it just runs the commands through our daemon, so it should behave the same way as if you ran the commands.. do you see anything in the syslog or journal after the failed move? perhaps the disk was too big, and process buffer couldn't hold it (just a guess)

anyway if you can reproduce the issue please make a new thread and we'll take a look
marked solved

Search

Search

[SOLVED] Moving disk from local to shared storage too slow, fails

FuriousGeorge

Renowned Member

oguz

Proxmox Retired Staff

We value your privacy