Moving disk from NFS to Ceph hangs

Important note, the disks should be connected to an HBA.
disk are connected directly to the board (ie. sata controller)

You can impose a bandwidth limit for migration, to reduce the load on the storage.
well, this worked!!! i limited the transfer speed to
Code:
/etc/pve/datacenter.cfg:
bwlimit: move=51200
with the effect, that 1. the "disk move2 works without hanging, 2. i can do 2 working transfers in parallel (my assumption is: this works, because it is a little less than max bandwith of pool, 3. adding a third transfer, again produces a "hang".

@Alwin thanks a lot for your advice and helping me to do my learings with ceph

next steps for my config:
change my 1 HDD(8T) config into a config with more HDD (well, i found 6 x 2T HDDs ... may be a start ...)
i will add a comment here of the outcome ...
 
Last edited:
Hi All,

I am also experiencing a similar problem. I have just attempted a move of a disk froim an NFS datastore to the Ceph Datastore

We have a cluster of 4 Dell R440's connected with 10Gbps switching
Each host has 1 Xeon, 128GB RAM, 1 960GB SAS SSD for WAL and 8 2TB Sata disk for OSD

Code:
root@pm01:/var/log# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

The logs for the transfer look as follows, once it got to this stage I left it for 2 hours before stopping the job


Code:
create full clone of drive scsi0 (SRV-PROD-NFS00:122/vm-122-disk-0.qcow2)
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 11927552 bytes remaining: 260086890496 bytes total: 260098818048 bytes progression: 0.00 % busy: 1 ready: 0
drive-scsi0: transferred: 42008576 bytes remaining: 260056809472 bytes total: 260098818048 bytes progression: 0.02 % busy: 1 ready: 0
drive-scsi0: transferred: 76611584 bytes remaining: 260022206464 bytes total: 260098818048 bytes progression: 0.03 % busy: 1 ready: 0
drive-scsi0: transferred: 110166016 bytes remaining: 259988652032 bytes total: 260098818048 bytes progression: 0.04 % busy: 1 ready: 0
drive-scsi0: transferred: 160497664 bytes remaining: 259938320384 bytes total: 260098818048 bytes progression: 0.06 % busy: 1 ready: 0
drive-scsi0: transferred: 194052096 bytes remaining: 259904765952 bytes total: 260098818048 bytes progression: 0.07 % busy: 1 ready: 0
drive-scsi0: transferred: 246218752 bytes remaining: 259852599296 bytes total: 260098818048 bytes progression: 0.09 % busy: 1 ready: 0
drive-scsi0: transferred: 272629760 bytes remaining: 259826188288 bytes total: 260098818048 bytes progression: 0.10 % busy: 1 ready: 0
drive-scsi0: transferred: 291110912 bytes remaining: 259807707136 bytes total: 260098818048 bytes progression: 0.11 % busy: 1 ready: 0
drive-scsi0: transferred: 306380800 bytes remaining: 259792437248 bytes total: 260098818048 bytes progression: 0.12 % busy: 1 ready: 0
drive-scsi0: transferred: 366870528 bytes remaining: 259731947520 bytes total: 260098818048 bytes progression: 0.14 % busy: 1 ready: 0
drive-scsi0: transferred: 408944640 bytes remaining: 259689873408 bytes total: 260098818048 bytes progression: 0.16 % busy: 1 ready: 0
drive-scsi0: transferred: 452263936 bytes remaining: 259646554112 bytes total: 260098818048 bytes progression: 0.17 % busy: 1 ready: 0
drive-scsi0: transferred: 472580096 bytes remaining: 259626237952 bytes total: 260098818048 bytes progression: 0.18 % busy: 1 ready: 0
drive-scsi0: transferred: 520552448 bytes remaining: 259578265600 bytes total: 260098818048 bytes progression: 0.20 % busy: 1 ready: 0
drive-scsi0: transferred: 559939584 bytes remaining: 259538878464 bytes total: 260098818048 bytes progression: 0.22 % busy: 1 ready: 0
drive-scsi0: transferred: 591069184 bytes remaining: 259507748864 bytes total: 260098818048 bytes progression: 0.23 % busy: 1 ready: 0
drive-scsi0: transferred: 612171776 bytes remaining: 259486646272 bytes total: 260098818048 bytes progression: 0.24 % busy: 1 ready: 0
drive-scsi0: transferred: 663945216 bytes remaining: 259434872832 bytes total: 260098818048 bytes progression: 0.26 % busy: 1 ready: 0
drive-scsi0: transferred: 713752576 bytes remaining: 259385065472 bytes total: 260098818048 bytes progression: 0.27 % busy: 1 ready: 0
drive-scsi0: transferred: 751173632 bytes remaining: 259347644416 bytes total: 260098818048 bytes progression: 0.29 % busy: 1 ready: 0
drive-scsi0: transferred: 778698752 bytes remaining: 259320119296 bytes total: 260098818048 bytes progression: 0.30 % busy: 1 ready: 0
drive-scsi0: transferred: 799997952 bytes remaining: 259298820096 bytes total: 260098818048 bytes progression: 0.31 % busy: 1 ready: 0
drive-scsi0: transferred: 834338816 bytes remaining: 259264479232 bytes total: 260098818048 bytes progression: 0.32 % busy: 1 ready: 0
drive-scsi0: transferred: 855310336 bytes remaining: 259243507712 bytes total: 260098818048 bytes progression: 0.33 % busy: 1 ready: 0
drive-scsi0: transferred: 906231808 bytes remaining: 259192586240 bytes total: 260098818048 bytes progression: 0.35 % busy: 1 ready: 0
drive-scsi0: transferred: 920125440 bytes remaining: 259178692608 bytes total: 260098818048 bytes progression: 0.35 % busy: 1 ready: 0
drive-scsi0: transferred: 938213376 bytes remaining: 259160604672 bytes total: 260098818048 bytes progression: 0.36 % busy: 1 ready: 0
drive-scsi0: transferred: 988020736 bytes remaining: 259110797312 bytes total: 260098818048 bytes progression: 0.38 % busy: 1 ready: 0
drive-scsi0: transferred: 1002373120 bytes remaining: 259082158080 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0
...

drive-scsi0: transferred: 1007616000 bytes remaining: 259076915200 bytes total: 260084531200 bytes progression: 0.39 % busy: 1 ready: 0

drive-scsi0: Cancelling block job

1584560381062.png

I have checked the logs and only found the following;

Code:
root@pm01:/var/log# cat syslog | grep 1611238
Mar 18 13:57:12 pm01 pvedaemon[2865]: worker 1611238 started
Mar 18 14:12:29 pm01 pvedaemon[1611238]: user config - ignore invalid priviledge 'SDN.Allocate'
Mar 18 14:12:29 pm01 pvedaemon[1611238]: user config - ignore invalid priviledge 'SDN.Audit'
Mar 18 16:31:22 pm01 pvedaemon[1611238]: <root@pam> move disk VM 122: move --disk scsi0 --storage ceph-vm
Mar 18 16:31:22 pm01 pvedaemon[1611238]: <root@pam> starting task UPID:pm01:00193E8C:0BE97E4D:5E724CDA:qmmove:122:root@pam:
Mar 18 16:57:08 pm01 pvedaemon[1611238]: worker exit
Mar 18 16:57:08 pm01 pvedaemon[2865]: worker 1611238 finished

root@pm01:/var/log# cat syslog | grep qmmove
Mar 18 16:31:22 pm01 pvedaemon[1611238]: <root@pam> starting task UPID:pm01:00193E8C:0BE97E4D:5E724CDA:qmmove:122:root@pam:

I ran the bench marks as advised by Alwin

Code:
rados -p ceph-vm bench 100 write -b 4M -t 16 --no-cleanup


Total time run:         100.583
Total writes made:      12220
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     485.964
Stddev Bandwidth:       67.3887
Max bandwidth (MB/sec): 588
Min bandwidth (MB/sec): 208
Average IOPS:           121
Stddev IOPS:            16.8472
Max IOPS:               147
Min IOPS:               52
Average Latency(s):     0.131534
Stddev Latency(s):      0.107733
Max latency(s):         1.67823
Min latency(s):         0.0432464

rados -p ceph-vm bench 600 seq -t 16

Total time run:       54.8427
Total reads made:     12220
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   891.276
Average IOPS:         222
Stddev IOPS:          18.8719
Max IOPS:             258
Min IOPS:             178
Average Latency(s):   0.071086
Max latency(s):       0.674741
Min latency(s):       0.0175976


I read the port above and they advised that limiting the bandwidth of the trasfer worked for 1 tyransfer but not if they had mulitple simultaneous jobs.
WHy would this be?

Any other ideas of what I can check? The rados bench marks look like it should be able to transfer this 250GB disk in a fairly good time.
 
Any other ideas of what I can check? The rados bench marks look like it should be able to transfer this 250GB disk in a fairly good time.
The source, there may be something in the logs.
 
Hi Alwin,

Thanks for the respoinse, as you can see I did check the syslog and kernal log searching for the various process IDs above, "ceph" and "qmmove". I did also check /var/log/ceph/ceph.log logs but could find anything specific. Could you point me to some of the other logs I could look at and any specific phrases to grep for related to disk moves which might help me to pin point the issues.

Thanks,

Duncan
 
Last edited:
Thanks for the respoinse, as you can see I did check the syslog and kernal log searching for the various process IDs above,
With the source, I meant your NFS storage.
 
Hi Alwin,
I have checked the NFS server and there is no logging turned on, by default. We need to restart the NFS deamon if we want to turn logging on. Obviously this would cause a bit of an issue with the VM's using it currently. We may turn logging on in a Maint Window but it was only a temporary setup to get our VMs off vSphere and into Proxmox.
I guess we will never know why it didn't transfer.
Thanks for your help.
Duncan
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!