Moving disk from NFS to Ceph hangs

danielb

Renowned Member
Jun 1, 2018
231
63
68
38
Bordeaux (france)
Hi.

I'm playing with PVE 6 and have 2 storages : one NFS and another Ceph (external, not managed on PVE). Both are working fine, and I can move disk from Ceph to NFS without any issue. But not the other way arround : from NFS to Ceph, the transfert starts, and then hangs arround 3% indefinitely without progress.
Moving the disk while the VM is down works as expected, only live move is affected.
Is this a known issue ?
 
Is this a known issue ?
No.

You need to provide log messages and more details about your setup. Else no one will be able to really help you.
 
Nothing fancy. PVE6 with latest updates.

Code:
root@pvo2:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
root@pvo2:~#

It's a 2 nodes cluster (just a test setup). Created 2 storages (external RBD and NFS) :

Code:
rbd: ceph-fws
        content images
        krbd 0
        monhost 10.97.131.239,10.99.191.247,10.99.187.221
        pool pvefws
        username pvefws

nfs: nfs-fws
        export /zpool-127103/storage
        path /mnt/pve/nfs-fws
        server 10.24.70.15
        content images

Both are provided by OVH, so I do not have all the details about their setup. for Ceph, link speed is 1Gbps, for NFS, it's 2Gbps. MTU 1500. Both uses the same interface on my servers.


But, while it was 100% reproducible yesterday, it seems to be working now. I need to investigate what changes could impact this. Anyway, this answers my original question : no, it's not a known issue, and it is (or was) most likely something specific to my setup
 
Both are provided by OVH, so I do not have all the details about their setup.
I thought so, as it is unusual to have such different IPs for the MONs. Is this one of their products, where you can rent your own pool? If so, those are (or were at least during my tests) a virtual ceph cluster hosted on their openstack instance and that could have all sorts of implications.

For a hosted setup with two nodes, were you don't intent to scale-out, ZFS with the storage replication might be a more suitable choice.
https://pve.proxmox.com/pve-docs/chapter-pvesr.html
 
Is this one of their products, where you can rent your own pool? If so, those are (or were at least during my tests) a virtual ceph cluster hosted on their openstack instance and that could have all sorts of implications.
Yes, I'm testing their "Ceph-as-a-service" offer to evaluate it's perf
For a hosted setup with two nodes, were you don't intent to scale-out, ZFS with the storage replication might be a more suitable choice.
https://pve.proxmox.com/pve-docs/chapter-pvesr.html
The two nodes setup is just for prototyping. If tests and stability are good enough, the plan is to scale to at least 5 or 6 nodes, so ZFS replication is not really an option for me.
 
The two nodes setup is just for prototyping. If tests and stability are good enough, the plan is to scale to at least 5 or 6 nodes, so ZFS replication is not really an option for me.
Understand. But I want to point out (maybe you don't know) that it can sync between more than one host.

Yes, I'm testing their "Ceph-as-a-service" offer to evaluate it's perf
Their Ceph nodes had UUIDs similar to openstack's images (that's how I figured they are VMs hosted). And the performance was very erratic. I hope I am wrong and they have a dedicated pool with physical hardware for you.
 
I've reproduced it once again (after several successful live disk moves back and forth between Ceph and NFS). The task starts and then just hangs indefinitely at 3.17% without any further progress :
Code:
create full clone of drive scsi0 (nfs-fws:103/vm-103-disk-0.qcow2)
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 511311872 bytes remaining: 33835384832 bytes total: 34346696704 bytes progression: 1.49 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
[...]

Nothing particular in syslog :

Code:
Aug 03 15:06:21 pvo1 pvedaemon[2028]: <root@pam> move disk VM 103: move --disk scsi0 --storage ceph-fws
Aug 03 15:06:21 pvo1 pvedaemon[2028]: <root@pam> starting task UPID:pvo1:0000081A:010AD209:5D4586CD:qmmove:103:root@pam:

Both Ceph and NFS are working fine, no error message in the logs, no network hicup, no problem with other VM on either storages. It looks like qemu isn't sending any data to mirror the drive.
 
You could try to run a rados bench on the pool and see if it keeps a steady speed.
Code:
rados -p <pool> bench 600 write -b 4M -t 16 --no-cleanup
rados -p <pool> bench 600 read -t 16
 
Here are the result (changed the second bench from read to seq as read is not valid):

Code:
Total time run:         600.668960
Total writes made:      15200
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     101.22
Stddev Bandwidth:       40.9338
Max bandwidth (MB/sec): 192
Min bandwidth (MB/sec): 0
Average IOPS:           25
Stddev IOPS:            10
Max IOPS:               48
Min IOPS:               0
Average Latency(s):     0.632279
Stddev Latency(s):      0.530079
Max latency(s):         4.13034
Min latency(s):         0.0158482
Code:
Total time run:       220.506369
Total reads made:     15200
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   275.729
Average IOPS:         68
Stddev IOPS:          25
Max IOPS:             136
Min IOPS:             10
Average Latency(s):   0.231426
Max latency(s):       2.77963
Min latency(s):       0.00613687
 
Here are the result (changed the second bench from read to seq as read is not valid):
My bad.

Min bandwidth (MB/sec): 0
How do the status lines look like? I mean those the show the current progress every second.

Max latency(s): 4.13034
This doesn't look good, it would fit to my theory that the storage bandwidth and latency is not stable. And hence, will make the storage unreliable.
 
Hit the same problem here from local xfs partition to ceph.
blockjob drive_mirror hangs indefinitly.
also, block_job_cancel has no effect.

# info block-jobs
Type mirror, device drive-virtio1: Completed 2681274368 of 8253472768 bytes, speed limit 0 bytes/s

# block_job_cancel -f drive-virtio1

# info block-jobs
Type mirror, device drive-virtio1: Completed 2681274368 of 8253472768 bytes, speed limit 0 bytes/s

Need to force stop the KVM host (sigkill) in order to be able to unlock the situation.
 
Last edited:
jumping in with the same issue .. but HCI enviroment.

my scenario is moving from a nfs based storage (separte storage system) to a HCI proxmox6 3 node system.

some VM migrated successfully to ceph storage .... but some are "hanging" as described.
(just to note: same issue,when i try to move a disk from local storage ... so may not a NFS related issue)

good news: no corupped VM so far ;-) ... worst case i had was restarting the node, which tried to migrate the VM ... so that i can remove the uncomplete disk from ceph storage.

can i provide some helpful information to resolve the issue?

How do the status lines look like? I mean those the show the current progress every second.
to answer your question: every second a status update ... just showing no progress. the "hanging ponit" is not always the same (when trying more then once)

on additional info: once a VM disk could be moved to ceph storage, it work also a 2nd, 3rd, 4th time
 
Last edited:
@patrice damezin and @akxx, as said to the OP, benchmarks are the key. Your ceph cluster seem to be to slow to get the block migration done in a reasonable time.

Code:
rados -p <pool> bench 600 write -b 4M -t 16 --no-cleanup
rados -p <pool> bench 600 read -t 16
# clean rados bench objects
rados -p <pool> bench cleanup
 
@Alwin thanks for your answer!
here are my "counters" (mixed HDD/NVMe (1HD+1NVMe):

Code:
write:
Total time run:         600.229
Total writes made:      18625
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     124.119
Stddev Bandwidth:       9.49174
Max bandwidth (MB/sec): 148
Min bandwidth (MB/sec): 48
Average IOPS:           31
Stddev IOPS:            2.37294
Max IOPS:               37
Min IOPS:               12
Average Latency(s):     0.515609
Stddev Latency(s):      0.157552
Max latency(s):         1.76039
Min latency(s):         0.204772

seq:
Total time run:       164.19
Total reads made:     18625
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   453.744
Average IOPS:         113
Stddev IOPS:          7.23857
Max IOPS:             128
Min IOPS:             90
Average Latency(s):   0.140409
Max latency(s):       1.00043
Min latency(s):       0.0163517

rand:
Total time run:       600.205
Total reads made:     68361
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   455.584
Average IOPS:         113
Stddev IOPS:          6.60818
Max IOPS:             129
Min IOPS:             80
Average Latency(s):   0.139845
Max latency(s):       0.940519
Min latency(s):       0.00286263

HM ... although i do not like to say ... this system was well doing with VMWare VSAN for a few weeks .. ok, performance is not very high
... but everything worked stable
HM2 ... there was one change in the config (VMWare is not supporting meshed networking, so i was using a 10G switch (i have no access to it now))
HM3 ... i configured the "broadcast" meshed solution ... do you see a benefit in reconfiguring it as "routed" meshed solution?)
 
M ... although i do not like to say ... this system was well doing with VMWare VSAN for a few weeks .. ok, performance is not very high
... but everything worked stable
This is subjective. ;)

HM3 ... i configured the "broadcast" meshed solution ... do you see a benefit in reconfiguring it as "routed" meshed solution?)
Yes, the bandwidth is halved with broadcast, as the objects are send twice.

here are my "counters" (mixed HDD/NVMe (1HD+1NVMe):
Two pools or is the NVME a DB/WAL device for the OSD? And how many does the cluster have, ceph osd crush tree --show-shadow?
 
Two pools or is the NVME a DB/WAL device for the OSD? And how many does the cluster have, ceph osd crush tree --show-shadow?

my wish was to configure NVME as DB/WAL
(i am new to ceph .. may not configured right ... i used the GUI for configuration ... rough overview:
3 nodes, cephs Pubilic / private on 10G , full mesh broadcast mode,
on each node: monitor, manager, 1 OSD (HDD, 8T), DB on NVMe (1T) .. WAL ??? (i did not configured it .. so hopefully it it on NVMe)
one pool (3/2, PG128) )

Code:
ID CLASS WEIGHT   TYPE NAME           
-2   hdd 24.01529 root default~hdd    
-8   hdd  8.00510     host pve1~hdd
2   hdd  8.00510         osd.2       
-6   hdd  8.00510     host pve2~hdd
1   hdd  8.00510         osd.1       
-4   hdd  8.00510     host pve3~hdd
0   hdd  8.00510         osd.0       
-1       24.01529 root default        
-7        8.00510     host pve1  
2   hdd  8.00510         osd.2       
-5        8.00510     host pve2  
1   hdd  8.00510         osd.1       
-3        8.00510     host pve3  
0   hdd  8.00510         osd.0
 
Last edited:
(i am new to ceph .. may not configured right ... i used the GUI for configuration ... rough overview:
That worked out. ;) The tree would otherwise show nvme devices.

on each node: monitor, manager, 1 OSD (HDD, 8T), DB on NVMe (1T) .. WAL ??? (i did not configured it .. so hopefully it it on NVMe)
You need more disks for performance. Otherwise the performance of 124MB/sec is good. You will only get the write speed of one disk (+nvme), as the pool has 3x copies. The latency deviation is more of an issue (smaller the better). As said more disks, more speed. Depending on the IOps of your NVMe, you can connect multiple OSDs onto the NVME.

WAL is the write-ahead log and lives with the DB (RocksDB), if not specified separately.
 
Yes, the bandwidth is halved with broadcast, as the objects are send twice.

so i changed to routed full mesh (well, i had some nice leanings on my way (ie. howto shutdown/restart a ceph cluster (;-), ...)

BUT main issue still persists: moving VM to ceph storage still not working for some VMs!!!
same issue: after starting well, no progress (ie. 12,31%, updated every second-ish). no "traffic" in background . even "over night".

@Alwin can you guide me to find better troubleshouting information (ie. logs)?

just for completness the new benchmarks (may slightly better then inbroadcast mode):
Code:
Total time run:         600.346
Total writes made:      19110
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     127.327
Stddev Bandwidth:       8.34949
Max bandwidth (MB/sec): 156
Min bandwidth (MB/sec): 96
Average IOPS:           31
Stddev IOPS:            2.08737
Max IOPS:               39
Min IOPS:               24
Average Latency(s):     0.502639
Stddev Latency(s):      0.152562
Max latency(s):         1.71553
Min latency(s):         0.229114

seq:
Total time run:       160.345
Total reads made:     19110
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   476.723
Average IOPS:         119
Stddev IOPS:          6.08173
Max IOPS:             131
Min IOPS:             99
Average Latency(s):   0.133819
Max latency(s):       0.901709
Min latency(s):       0.0153605

rand:
Total time run:       600.297
Total reads made:     74680
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   497.621
Average IOPS:         124
Stddev IOPS:          7.02153
Max IOPS:             141
Min IOPS:             99
Average Latency(s):   0.128186
Max latency(s):       0.978023
Min latency(s):       0.00244035
 
Last edited:
Bandwidth (MB/sec): 127.327
It will not get better, without more disks (OSDs).

Max latency(s): 1.71553 Min latency(s): 0.229114
Very big difference, this is most likely caused by the saturation of the disks. Important note, the disks should be connected to an HBA.

BUT main issue still persists: moving VM to ceph storage still not working for some VMs!!!
same issue: after starting well, no progress (ie. 12,31%, updated every second-ish). no "traffic" in background . even "over night".
You can impose a bandwidth limit for migration, to reduce the load on the storage.
https://pve.proxmox.com/pve-docs/datacenter.cfg.5.html

@Alwin can you guide me to find better troubleshouting information (ie. logs)?
All logs are found under /var/log. Go through the kernel.log (dmesg), the syslog and the individual logs of Ceph. This may be overwhelming at first, but with a little bit of grep you can make sense of the data. ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!