Moving disk from NFS to Ceph hangs

danielb · Aug 1, 2019

Hi.

I'm playing with PVE 6 and have 2 storages : one NFS and another Ceph (external, not managed on PVE). Both are working fine, and I can move disk from Ceph to NFS without any issue. But not the other way arround : from NFS to Ceph, the transfert starts, and then hangs arround 3% indefinitely without progress.
Moving the disk while the VM is down works as expected, only live move is affected.
Is this a known issue ?

Alwin · Aug 2, 2019

danielb said:
Is this a known issue ?

No.

You need to provide log messages and more details about your setup. Else no one will be able to really help you.

danielb · Aug 2, 2019

Nothing fancy. PVE6 with latest updates.

Code:

root@pvo2:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
root@pvo2:~#

It's a 2 nodes cluster (just a test setup). Created 2 storages (external RBD and NFS) :

Code:

rbd: ceph-fws
        content images
        krbd 0
        monhost 10.97.131.239,10.99.191.247,10.99.187.221
        pool pvefws
        username pvefws

nfs: nfs-fws
        export /zpool-127103/storage
        path /mnt/pve/nfs-fws
        server 10.24.70.15
        content images

Both are provided by OVH, so I do not have all the details about their setup. for Ceph, link speed is 1Gbps, for NFS, it's 2Gbps. MTU 1500. Both uses the same interface on my servers.

But, while it was 100% reproducible yesterday, it seems to be working now. I need to investigate what changes could impact this. Anyway, this answers my original question : no, it's not a known issue, and it is (or was) most likely something specific to my setup

Alwin · Aug 2, 2019

danielb said:
Both are provided by OVH, so I do not have all the details about their setup.

I thought so, as it is unusual to have such different IPs for the MONs. Is this one of their products, where you can rent your own pool? If so, those are (or were at least during my tests) a virtual ceph cluster hosted on their openstack instance and that could have all sorts of implications.

For a hosted setup with two nodes, were you don't intent to scale-out, ZFS with the storage replication might be a more suitable choice.
https://pve.proxmox.com/pve-docs/chapter-pvesr.html

danielb · Aug 2, 2019

Alwin said:
Is this one of their products, where you can rent your own pool? If so, those are (or were at least during my tests) a virtual ceph cluster hosted on their openstack instance and that could have all sorts of implications.

Yes, I'm testing their "Ceph-as-a-service" offer to evaluate it's perf

Alwin said:
For a hosted setup with two nodes, were you don't intent to scale-out, ZFS with the storage replication might be a more suitable choice.
https://pve.proxmox.com/pve-docs/chapter-pvesr.html

The two nodes setup is just for prototyping. If tests and stability are good enough, the plan is to scale to at least 5 or 6 nodes, so ZFS replication is not really an option for me.

Alwin · Aug 2, 2019

danielb said:
The two nodes setup is just for prototyping. If tests and stability are good enough, the plan is to scale to at least 5 or 6 nodes, so ZFS replication is not really an option for me.

Understand. But I want to point out (maybe you don't know) that it can sync between more than one host.

danielb said:
Yes, I'm testing their "Ceph-as-a-service" offer to evaluate it's perf

Their Ceph nodes had UUIDs similar to openstack's images (that's how I figured they are VMs hosted). And the performance was very erratic. I hope I am wrong and they have a dedicated pool with physical hardware for you.

danielb · Aug 3, 2019

I've reproduced it once again (after several successful live disk moves back and forth between Ceph and NFS). The task starts and then just hangs indefinitely at 3.17% without any further progress :

Code:

create full clone of drive scsi0 (nfs-fws:103/vm-103-disk-0.qcow2)
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 511311872 bytes remaining: 33835384832 bytes total: 34346696704 bytes progression: 1.49 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
drive-scsi0: transferred: 1087963136 bytes remaining: 33258733568 bytes total: 34346696704 bytes progression: 3.17 % busy: 1 ready: 0 
[...]

Nothing particular in syslog :

Code:

Aug 03 15:06:21 pvo1 pvedaemon[2028]: <root@pam> move disk VM 103: move --disk scsi0 --storage ceph-fws
Aug 03 15:06:21 pvo1 pvedaemon[2028]: <root@pam> starting task UPID:pvo1:0000081A:010AD209:5D4586CD:qmmove:103:root@pam:

Both Ceph and NFS are working fine, no error message in the logs, no network hicup, no problem with other VM on either storages. It looks like qemu isn't sending any data to mirror the drive.

Alwin · Aug 3, 2019

You could try to run a rados bench on the pool and see if it keeps a steady speed.

Code:

rados -p <pool> bench 600 write -b 4M -t 16 --no-cleanup
rados -p <pool> bench 600 read -t 16

danielb · Aug 5, 2019

Here are the result (changed the second bench from read to seq as read is not valid):

Code:

Total time run:         600.668960
Total writes made:      15200
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     101.22
Stddev Bandwidth:       40.9338
Max bandwidth (MB/sec): 192
Min bandwidth (MB/sec): 0
Average IOPS:           25
Stddev IOPS:            10
Max IOPS:               48
Min IOPS:               0
Average Latency(s):     0.632279
Stddev Latency(s):      0.530079
Max latency(s):         4.13034
Min latency(s):         0.0158482

Code:

Total time run:       220.506369
Total reads made:     15200
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   275.729
Average IOPS:         68
Stddev IOPS:          25
Max IOPS:             136
Min IOPS:             10
Average Latency(s):   0.231426
Max latency(s):       2.77963
Min latency(s):       0.00613687

Alwin · Aug 5, 2019

danielb said:
Here are the result (changed the second bench from read to seq as read is not valid):

My bad.

danielb said:
Min bandwidth (MB/sec): 0

How do the status lines look like? I mean those the show the current progress every second.

danielb said:
Max latency(s): 4.13034

This doesn't look good, it would fit to my theory that the storage bandwidth and latency is not stable. And hence, will make the storage unreliable.

patrice damezin · Nov 15, 2019

Hit the same problem here from local xfs partition to ceph.
blockjob drive_mirror hangs indefinitly.
also, block_job_cancel has no effect.

# info block-jobs
Type mirror, device drive-virtio1: Completed 2681274368 of 8253472768 bytes, speed limit 0 bytes/s

# block_job_cancel -f drive-virtio1

# info block-jobs
Type mirror, device drive-virtio1: Completed 2681274368 of 8253472768 bytes, speed limit 0 bytes/s

Need to force stop the KVM host (sigkill) in order to be able to unlock the situation.

akxx · Nov 22, 2019

jumping in with the same issue .. but HCI enviroment.

my scenario is moving from a nfs based storage (separte storage system) to a HCI proxmox6 3 node system.

some VM migrated successfully to ceph storage .... but some are "hanging" as described.
(just to note: same issue,when i try to move a disk from local storage ... so may not a NFS related issue)

good news: no corupped VM so far ;-) ... worst case i had was restarting the node, which tried to migrate the VM ... so that i can remove the uncomplete disk from ceph storage.

can i provide some helpful information to resolve the issue?

Alwin said:
How do the status lines look like? I mean those the show the current progress every second.

to answer your question: every second a status update ... just showing no progress. the "hanging ponit" is not always the same (when trying more then once)

on additional info: once a VM disk could be moved to ceph storage, it work also a 2nd, 3rd, 4th time

Alwin · Nov 25, 2019

@patrice damezin and @akxx, as said to the OP, benchmarks are the key. Your ceph cluster seem to be to slow to get the block migration done in a reasonable time.

Code:

rados -p <pool> bench 600 write -b 4M -t 16 --no-cleanup
rados -p <pool> bench 600 read -t 16
# clean rados bench objects
rados -p <pool> bench cleanup

danielb · Nov 26, 2019

I still think there's a bug. A slow ceph cluster should still be usable and not hang like this

akxx · Nov 26, 2019

@Alwin thanks for your answer!
here are my "counters" (mixed HDD/NVMe (1HD+1NVMe):

Code:

write:
Total time run:         600.229
Total writes made:      18625
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     124.119
Stddev Bandwidth:       9.49174
Max bandwidth (MB/sec): 148
Min bandwidth (MB/sec): 48
Average IOPS:           31
Stddev IOPS:            2.37294
Max IOPS:               37
Min IOPS:               12
Average Latency(s):     0.515609
Stddev Latency(s):      0.157552
Max latency(s):         1.76039
Min latency(s):         0.204772

seq:
Total time run:       164.19
Total reads made:     18625
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   453.744
Average IOPS:         113
Stddev IOPS:          7.23857
Max IOPS:             128
Min IOPS:             90
Average Latency(s):   0.140409
Max latency(s):       1.00043
Min latency(s):       0.0163517

rand:
Total time run:       600.205
Total reads made:     68361
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   455.584
Average IOPS:         113
Stddev IOPS:          6.60818
Max IOPS:             129
Min IOPS:             80
Average Latency(s):   0.139845
Max latency(s):       0.940519
Min latency(s):       0.00286263

HM ... although i do not like to say ... this system was well doing with VMWare VSAN for a few weeks .. ok, performance is not very high
... but everything worked stable
HM2 ... there was one change in the config (VMWare is not supporting meshed networking, so i was using a 10G switch (i have no access to it now))
HM3 ... i configured the "broadcast" meshed solution ... do you see a benefit in reconfiguring it as "routed" meshed solution?)

Alwin · Nov 27, 2019

akxx said:
M ... although i do not like to say ... this system was well doing with VMWare VSAN for a few weeks .. ok, performance is not very high
... but everything worked stable

This is subjective.

akxx said:
HM3 ... i configured the "broadcast" meshed solution ... do you see a benefit in reconfiguring it as "routed" meshed solution?)

Yes, the bandwidth is halved with broadcast, as the objects are send twice.

akxx said:
here are my "counters" (mixed HDD/NVMe (1HD+1NVMe):

Two pools or is the NVME a DB/WAL device for the OSD? And how many does the cluster have, ceph osd crush tree --show-shadow?

akxx · Nov 27, 2019

Alwin said:
Two pools or is the NVME a DB/WAL device for the OSD? And how many does the cluster have, ceph osd crush tree --show-shadow?

my wish was to configure NVME as DB/WAL
(i am new to ceph .. may not configured right ... i used the GUI for configuration ... rough overview:
3 nodes, cephs Pubilic / private on 10G , full mesh broadcast mode,
on each node: monitor, manager, 1 OSD (HDD, 8T), DB on NVMe (1T) .. WAL ??? (i did not configured it .. so hopefully it it on NVMe)
one pool (3/2, PG128) )

Code:

ID CLASS WEIGHT   TYPE NAME           
-2   hdd 24.01529 root default~hdd    
-8   hdd  8.00510     host pve1~hdd
2   hdd  8.00510         osd.2       
-6   hdd  8.00510     host pve2~hdd
1   hdd  8.00510         osd.1       
-4   hdd  8.00510     host pve3~hdd
0   hdd  8.00510         osd.0       
-1       24.01529 root default        
-7        8.00510     host pve1  
2   hdd  8.00510         osd.2       
-5        8.00510     host pve2  
1   hdd  8.00510         osd.1       
-3        8.00510     host pve3  
0   hdd  8.00510         osd.0

Alwin · Nov 27, 2019

akxx said:
(i am new to ceph .. may not configured right ... i used the GUI for configuration ... rough overview:

That worked out.

The tree would otherwise show nvme devices.

akxx said:
on each node: monitor, manager, 1 OSD (HDD, 8T), DB on NVMe (1T) .. WAL ??? (i did not configured it .. so hopefully it it on NVMe)

You need more disks for performance. Otherwise the performance of 124MB/sec is good. You will only get the write speed of one disk (+nvme), as the pool has 3x copies. The latency deviation is more of an issue (smaller the better). As said more disks, more speed. Depending on the IOps of your NVMe, you can connect multiple OSDs onto the NVME.

WAL is the write-ahead log and lives with the DB (RocksDB), if not specified separately.

akxx · Nov 27, 2019

Alwin said:
Yes, the bandwidth is halved with broadcast, as the objects are send twice.

so i changed to routed full mesh (well, i had some nice leanings on my way (ie. howto shutdown/restart a ceph cluster (;-), ...)

BUT main issue still persists: moving VM to ceph storage still not working for some VMs!!!
same issue: after starting well, no progress (ie. 12,31%, updated every second-ish). no "traffic" in background . even "over night".

@Alwin can you guide me to find better troubleshouting information (ie. logs)?

just for completness the new benchmarks (may slightly better then inbroadcast mode):

Code:

Total time run:         600.346
Total writes made:      19110
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     127.327
Stddev Bandwidth:       8.34949
Max bandwidth (MB/sec): 156
Min bandwidth (MB/sec): 96
Average IOPS:           31
Stddev IOPS:            2.08737
Max IOPS:               39
Min IOPS:               24
Average Latency(s):     0.502639
Stddev Latency(s):      0.152562
Max latency(s):         1.71553
Min latency(s):         0.229114

seq:
Total time run:       160.345
Total reads made:     19110
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   476.723
Average IOPS:         119
Stddev IOPS:          6.08173
Max IOPS:             131
Min IOPS:             99
Average Latency(s):   0.133819
Max latency(s):       0.901709
Min latency(s):       0.0153605

rand:
Total time run:       600.297
Total reads made:     74680
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   497.621
Average IOPS:         124
Stddev IOPS:          7.02153
Max IOPS:             141
Min IOPS:             99
Average Latency(s):   0.128186
Max latency(s):       0.978023
Min latency(s):       0.00244035

Alwin · Nov 27, 2019

akxx said:
Bandwidth (MB/sec): 127.327

It will not get better, without more disks (OSDs).

akxx said:
Max latency(s): 1.71553 Min latency(s): 0.229114

Very big difference, this is most likely caused by the saturation of the disks. Important note, the disks should be connected to an HBA.

akxx said:
BUT main issue still persists: moving VM to ceph storage still not working for some VMs!!!
same issue: after starting well, no progress (ie. 12,31%, updated every second-ish). no "traffic" in background . even "over night".

You can impose a bandwidth limit for migration, to reduce the load on the storage.
https://pve.proxmox.com/pve-docs/datacenter.cfg.5.html

akxx said:
@Alwin can you guide me to find better troubleshouting information (ie. logs)?

All logs are found under /var/log. Go through the kernel.log (dmesg), the syslog and the individual logs of Ceph. This may be overwhelming at first, but with a little bit of grep you can make sense of the data.

Moving disk from NFS to Ceph hangs

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Member

Active Member

Proxmox Retired Staff

Renowned Member

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff