Online migration / disk move problem

Idaho

Member
Aug 27, 2013
6
0
21
I'm trying to migrate VM storage to Linstor SDS and have some odd troubles. All nodes are running PVE 7.1:

pve-manager/7.1-5/6fe299a0 (running kernel: 5.13.19-1-pve)

Linstor storage is, for now, on one host. When I create new VM on linstor it works. When I try to migrate VM from another host (and another storage) to Linstor it fails:

2021-11-22 13:06:53 starting migration of VM 116 to node 'proxmox-ve3' (192.168.8.203)
2021-11-22 13:06:53 found local disk 'local-lvm:vm-116-disk-0' (in current VM config)
2021-11-22 13:06:53 starting VM 116 on remote node 'proxmox-ve3'
2021-11-22 13:07:01 volume 'local-lvm:vm-116-disk-0' is 'linstor-local:vm-116-disk-1' on the target
2021-11-22 13:07:01 start remote tunnel
2021-11-22 13:07:03 ssh tunnel ver 1
2021-11-22 13:07:03 starting storage migration
2021-11-22 13:07:03 scsi1: start migration to nbd:unix:/run/qemu-server/116_nbd.migrate:exportname=drive-scsi1
drive mirror is starting for drive-scsi1 with bandwidth limit: 51200 KB/s
drive-scsi1: Cancelling block job
drive-scsi1: Done.
2021-11-22 13:07:03 ERROR: online migrate failure - block job (mirror) error: drive-scsi1: 'mirror' has been cancelled
2021-11-22 13:07:03 aborting phase 2 - cleanup resources
2021-11-22 13:07:03 migrate_cancel
2021-11-22 13:07:08 ERROR: migration finished with problems (duration 00:00:16)
TASK ERROR: migration problems

Linstor volumes are created during migration, no errors in it's logs. I don't know why Proxmox is cancelling this job.

When I try to move disk from NFS to Linstor (online) it fails:

create full clone of drive scsi0 (nfs-backup:129/vm-129-disk-0.qcow2)

NOTICE
Trying to create diskful resource (vm-129-disk-1) on (proxmox-ve3).
drive mirror is starting for drive-scsi0 with bandwidth limit: 51200 KB/s
drive-scsi0: Cancelling block job
drive-scsi0: Done.
TASK ERROR: storage migration failed: block job (mirror) error: drive-scsi0: 'mirror' has been cancelled


To move storage to Linstor I have first move it to NFS (online), turn off VM and move VM storage offline to Linstor. And bizzare thing is that once I do it, I can move this particular VM storage from Linstor to NFS online and from NFS to Linstor online. I can also migrate VM online, from Linstor, directly to another node and another storage without problems.

I've setup test cluster to reproduce this problem and couldn't - online migration to Linstor storage just worked. I don't know why it's not working on main cluster - any hints how to debug it?

storage.cfg:

dir: local
path /var/lib/vz
content vztmpl,iso,backup

lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images
nodes proxmox-ve5,proxmox-ve4

drbd: linstor-local
content images,rootdir
controller 192.168.8.203
resourcegroup linstor-local
preferlocal yes
nodes proxmox-ve3

zfspool: local-zfs
pool rpool/data
content images,rootdir
nodes proxmox-ve0,proxmox-ve1,proxmox-ve2
sparse 1

nfs: nfs-backup
export /data/nfs
path /mnt/pve/nfs-backup
server backup2
content rootdir,backup,images,iso,vztmpl
options vers=3
 

Fabian_E

Proxmox Staff Member
Staff member
Aug 1, 2019
1,734
300
88
Hi,
sounds like it could have the same root cause as the issue reported here. (Ping @fabian )
 

Idaho

Member
Aug 27, 2013
6
0
21
Maybe it's indeed something with VM disk size. More info:

I have two identical (in terms of size) VMs - both have 32 GB disk. The one that was not migrated to Linstor looks like this:

scsi0: local-lvm:vm-131-disk-0,cache=writeback,size=32G

The one that was migrated (offline, via NFS) to Linstor, which originally has size=32G, on Linstor looks like this:

scsi0: linstor-local:vm-125-disk-1,cache=writeback,size=33555416K

33555416 KiB is 32.0009384 GiB, slightly larger than 32 GiB.

dmesg of failed migration of VM between nodes, from thin LVM to Linstor, dmesg is from target node: https://pastebin.com/rN5ZQ8vN
This VM has two disks:

scsi0: local-lvm:vm-132-disk-0,cache=writeback,size=16M
scsi1: local-lvm:vm-132-disk-1,cache=writeback,size=10244M

This VM was at some point at Linstor, because it's size is 10244M instead of 10G (originally it was 10G).
 

danadm

New Member
Mar 27, 2021
3
0
1
72
I have also same issue, qcow2 files on NFS share were created under proxmox 5.1, now I would like to migrate them to Proxmox 7.1, I attached the NFS, started the VM's, it works fine, but live disk migration to DRBD volume isn't possible, not error show, but cancelled. New created DRBD volumes can be migrated to NFS share and back without any problem.
 

selektor

Member
Sep 24, 2019
1
0
6
42
Gentlemen, does linstor drbd work in version 7.1 proxmox? Somewhere I read that linstor supports proxmox max 7.0
 

danadm

New Member
Mar 27, 2021
3
0
1
72
Gentlemen, does linstor drbd work in version 7.1 proxmox? Somewhere I read that linstor supports proxmox max 7.0
7.1-8 works fine, I upgraded today two clusters to this version. Just the migration from NFS/QCOW2 to DRBD still fails.
 
Mar 22, 2018
32
5
13
39
This really is a shame. I can't even migrate to DRBD offline because of the inconsistency in size after migration:

On ZFS:
root@dc2:~# blockdev --getsize64 /dev/sda
2151677952

After moving to DRBD:
root@dc2:~# blockdev --getsize64 /dev/sda
2155372544 (+3694592 bytes = 3608K)

Back to ZFS:
root@dc2:~# blockdev --getsize64 /dev/sda
2155872256 (+499712 bytes = 488K, I guess it is just rounded to 4M оr)
Total roundtrip size diff: 4194304 bytes = 4M


It's disapointing, because for example GPT backup is aligned from the end of device, and it seems to be unusable if primary GPT fails. It's just an example. But think, if you should use it in production.....
 
Mar 22, 2018
32
5
13
39
Thanks for the link, Fabian,
Currently we can live with offline migration, but what really has blown my mind is that the migration process altering data! I mean disk size.

If I understand right, it has been made deliberately for debugging process, isn't it?
So it is not related to DRBD and I should not worry about it, do I?

In case it is PVE-related, when do you think it will be possible to move disk without altering it??

Thank you
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,187
1,315
164
no, it's basically because different storages have different ways of accounting overhead and/or rounding up sizes to their own granularity. it's not intentional on our side, but hard to fix (properly). we ask the source storage how big the disk is, and then tell the target storage to allocate a disk with that size, but the target storage gives us a bigger one, and Qemu doesn't like that for obvious reasons ;)
 
Mar 22, 2018
32
5
13
39
Thank you for the clarification, Fabian.
Unfortunately we can't use DRBD until the size issue is fixed. Hope it will be fixed soon.
Best regards, Albert
 
Last edited:
  • Like
Reactions: SDFka

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!