disk move really slow

RobFantini

Famous Member
May 24, 2012
2,043
111
133
Boston,Mass
Lately we have seen that disk move from ceph to zfs is very slow.

I just tested trying to move a 12GB kvm disk.

the kvm was stopped prior to move.

move started at 20:14 . I am posting progress as it occurs.
20:24 13%
20:38 14%
20:43 14%
it is going very slow.

iotop does not most of the time show move activity.

log:
Code:
less UPID:pve3:002F43FB:019B5E32:5BF9F759:qmmove:2165:root*

create full clone of drive scsi0 (ceph_vm:vm-2165-disk-1)

transferred: 0 bytes remaining: 12884901888 bytes total: 12884901888 bytes progression: 0.00 %
transferred: 130137509 bytes remaining: 12754764379 bytes total: 12884901888 bytes progression: 1.01 %
transferred: 260275018 bytes remaining: 12624626870 bytes total: 12884901888 bytes progression: 2.02 %
transferred: 390412527 bytes remaining: 12494489361 bytes total: 12884901888 bytes progression: 3.03 %
transferred: 520550036 bytes remaining: 12364351852 bytes total: 12884901888 bytes progression: 4.04 %
transferred: 650687545 bytes remaining: 12234214343 bytes total: 12884901888 bytes progression: 5.05 %
transferred: 779536564 bytes remaining: 12105365324 bytes total: 12884901888 bytes progression: 6.05 %
transferred: 909674073 bytes remaining: 11975227815 bytes total: 12884901888 bytes progression: 7.06 %
transferred: 1039811582 bytes remaining: 11845090306 bytes total: 12884901888 bytes progression: 8.07 %
transferred: 1169949091 bytes remaining: 11714952797 bytes total: 12884901888 bytes progression: 9.08 %
transferred: 1300086600 bytes remaining: 11584815288 bytes total: 12884901888 bytes progression: 10.09 %
transferred: 1430224109 bytes remaining: 11454677779 bytes total: 12884901888 bytes progression: 11.10 %
transferred: 1560361618 bytes remaining: 11324540270 bytes total: 12884901888 bytes progression: 12.11 %
transferred: 1690499127 bytes remaining: 11194402761 bytes total: 12884901888 bytes progression: 13.12 %
transferred: 1820636636 bytes remaining: 11064265252 bytes total: 12884901888 bytes progression: 14.13 %

the zpool is a mirror of two good data center model intel ssd's:
Code:
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BP480G4

ceph network 10G .

the pve host has 2CPU's , 40 cores , 200GB+ memory.



ceph -s :
Code:
  cluster:
    id:     220b9a53-4556-48e3-a73c-28deff665e45
    health: HEALTH_WARN
           noout flag(s) set

  services:
    mon: 3 daemons, quorum pve3,sys8,pve10
    mgr: pve3(active), standbys: sys8, pve10
    osd: 65 osds: 65 up, 65 in
        flags noout

  data:
    pools:   2 pools, 1088 pgs
    objects: 105.93k objects, 373GiB
    usage:   1.10TiB used, 24.4TiB / 25.5TiB avail
    pgs:     1088 active+clean
  io:
    client:   194KiB/s wr, 0op/s rd, 35op/s wr

i'll try letting the move continue and post the result.

the move finished at 23:57 . 3+ hours to move 12GB - Something is wrong with our configuration or there is a bug.

Does anyone have a suggestion to get disk move back to normal speed?
 
Code:
# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-8-pve)                                               
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)                                         
pve-kernel-4.15: 5.2-11                                                                                 
pve-kernel-4.15.18-8-pve: 4.15.18-28                                                                             
pve-kernel-4.15.18-7-pve: 4.15.18-27                                                                             
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-41
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-9
pve-firewall: 3.0-14
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
pve-zsync: 1.7-1
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve2~bpo1

ceph.conf:
Code:
[global]
         auth client required = none
         auth cluster required = none
         auth service required = none
         cluster network = 10.11.12.0/24
         debug asok = 0/0
         debug auth = 0/0
         debug buffer = 0/0
         debug client = 0/0
         debug context = 0/0
         debug crush = 0/0
         debug filer = 0/0
         debug filestore = 0/0
         debug finisher = 0/0
         debug heartbeatmap = 0/0
         debug journal = 0/0
         debug journaler = 0/0
         debug lockdep = 0/0
         debug mds = 0/0
         debug mds balancer = 0/0
         debug mds locker = 0/0
         debug mds log = 0/0
         debug mds log expire = 0/0
         debug mds migrator = 0/0
         debug mon = 0/0
         debug monc = 0/0
         debug ms = 0/0
         debug objclass = 0/0
         debug objectcacher = 0/0
         debug objecter = 0/0
         debug optracker = 0/0
         debug osd = 0/0
         debug paxos = 0/0
 
         debug perfcounter = 0/0
         debug rados = 0/0
         debug rbd = 0/0
         debug rgw = 0/0
         debug throttle = 0/0
         debug timer = 0/0
         debug tp = 0/0
         fsid = xx
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd deep scrub interval = 1209600
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         osd scrub begin hour = 21
         osd scrub end hour = 6
         osd scrub sleep = 0.1
         public network = 10.11.12.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring
         osd max backfills = 1
         osd recovery max active = 1

         #
         # 2018-11-21  see https://forum.proxmox.com/threads/ceph-slow-requests-are-blocked.48955/#post-229061
         # disable throttling
         #
         objecter_inflight_ops = 10240
         objecter_inflight_op_bytes = 1048576000
         ms_dispatch_throttle_bytes = 1048576000
         osd_client_message_size_cap = 0
         osd_client_message_cap = 0
         osd_enable_op_tracker = false

[mon.pve3]
         host = pve3
         mon addr = 10.11.12.3:6789

[mon.sys8]
         host = sys8
         mon addr = 10.11.12.8:6789

[mon.pve10]
         host = pve10
         mon addr = 10.11.12.10:6789
 
I just did a lxc move from zfs to ceph , will next move same lxc back to ceph.

I was checking the log during the process.
Code:
/dev/rbd1
mke2fs 1.43.4 (31-Jan-2017)
Discarding device blocks: done                           
Creating filesystem with 1048576 4k blocks and 262144 inodes
Filesystem UUID: db048e5a-7ac5-4921-958f-f9fd5fcc7aba
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736

Allocating group tables: done                           
Writing inode tables: done                           
Creating journal (16384 blocks): done
Multiple mount protection is enabled with update interval 5 seconds.
Writing superblocks and filesystem accounting information: done

Number of files: 145,126 (reg: 102,269, dir: 22,983, link: 19,814, dev: 2, special: 58)
Number of created files: 145,124 (reg: 102,269, dir: 22,981, link: 19,814, dev: 2, special: 58)
Number of deleted files: 0
Number of regular files transferred: 30,029
Total file size: 3,600,828,164 bytes
Total transferred file size: 1,657,949,168 bytes
Literal data: 1,657,949,168 bytes
Matched data: 0 bytes
File list size: 2,948,767
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 1,671,368,068
Total bytes received: 6,830,918

sent 1,671,368,068 bytes  received 6,830,918 bytes  7,187,147.69 bytes/sec
total size is 3,600,828,164  speedup is 2.15
TASK OK

after the transfer was done - when the log wrote the ' total size ..' line, it took around 10 more minutes for the 'TASK OK' to get to the log.

during that 10 minutes - at the pve screen - the task shows in progress AND root disk showed still at zfs.
 
moving from ceph to zfs was fast / normal speed.
started at 9:02, ended at 9:03 - including done at pve screen.

Code:
Number of files: 145,125 (reg: 102,269, dir: 22,983, link: 19,814, dev: 2, special: 57)
Number of created files: 145,124 (reg: 102,269, dir: 22,982, link: 19,814, dev: 2, special: 57)
Number of deleted files: 0
Number of regular files transferred: 30,029
Total file size: 3,600,870,523 bytes
Total transferred file size: 1,657,991,527 bytes
Literal data: 1,657,991,527 bytes
Matched data: 0 bytes
File list size: 2,817,580
File list generation time: 0.012 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 1,671,415,307
Total bytes received: 6,830,971

sent 1,671,415,307 bytes  received 6,830,971 bytes  30,238,671.68 bytes/sec
total size is 3,600,870,523  speedup is 2.15

I'll repeat move to ceph then back to zfs .
 
zfs to ceph: 4 minutes.

ceph to zfs 4 minutes. note the actual transfer took a minutes. the task was stuck for 3.5 minutes at 'Removing image: 97% complete...' .

so results are not consistent .

this could be network related . we are changing from Intel 10G to Mellanox ConnectX-4 ethernet this week.
 
from a linux desktop running on ceph, i ran a disk benchmark.

it got stuck 1/3 way through.

attached is dmesg from that time

here is top part of the file:
Code:
[  187.471084] sd 0:0:0:1: [sdb] tag#10 abort
[  363.593385] INFO: task kworker/u8:0:6 blocked for more than 120 seconds.
[  363.593391]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.130-2
[  363.593393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.593395] kworker/u8:0    D    0     6      2 0x00000000
[  363.593408] Workqueue: scsi_tmf_0 scmd_eh_abort_handler [scsi_mod]
[  363.593410]  ffff94346eb27a40 0000000000000000 ffff943473762e40 ffff943485d98980
[  363.593412]  ffff9434737961c0 ffffac0a818c7cb0 ffffffffbc414219 0000000000000001
[  363.593414]  00ffac0a818c7d28 ffff943485d98980 0000001301321092 ffff943473762e40
[  363.593417] Call Trace:
[  363.593424]  [<ffffffffbc414219>] ? __schedule+0x239/0x6f0
[  363.593426]  [<ffffffffbc414702>] ? schedule+0x32/0x80
[  363.593427]  [<ffffffffbc417a8d>] ? schedule_timeout+0x1dd/0x380
[  363.593429]  [<ffffffffbc415131>] ? wait_for_completion+0xf1/0x130
[  363.593432]  [<ffffffffbbea59a0>] ? wake_up_q+0x70/0x70
[  363.593434]  [<ffffffffc031f954>] ? virtscsi_tmf.constprop.12+0xa4/0x110 [virtio_scsi]
[  363.593438]  [<ffffffffc02e44c2>] ? scmd_eh_abort_handler+0x72/0x270 [scsi_mod]
[  363.593440]  [<ffffffffbbe9421a>] ? process_one_work+0x18a/0x420
[  363.593441]  [<ffffffffbbe944fd>] ? worker_thread+0x4d/0x490
[  363.593443]  [<ffffffffbbe944b0>] ? process_one_work+0x420/0x420
[  363.593444]  [<ffffffffbbe9a569>] ? kthread+0xd9/0xf0
[  363.593447]  [<ffffffffbc4190a4>] ? __switch_to_asm+0x34/0x70
[  363.593448]  [<ffffffffbbe9a490>] ? kthread_park+0x60/0x60
[  363.593449]  [<ffffffffbc419137>] ? ret_from_fork+0x57/0x70
[  363.593472] INFO: task gnome-disks:1393 blocked for more than 120 seconds.
[  363.593473]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.130-2
[  363.593474] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.593475] gnome-disks     D    0  1393   1274 0x00000000
[  363.593477]  ffff94346e0a59c0 0000000000000000 ffff94346fe93080 ffff943485d18980
[  363.593478]  ffff943473796f40 ffffac0a82cdba90 ffffffffbc414219 0000000000000000
[  363.593480]  00ff94346ec53850 ffff943485d18980 ffff94346e996210 ffff94346fe93080
[  363.593481] Call Trace:
 

Attachments

I moved disk to zfs [ took 10 min for a 1GB disk] .

from zfs benchmark worked. 2 GB/s read, 208 MB Write.

then when done i moved the disk back to ceph. the entire process took less then 15 seconds. more like 10.
 
Last edited:
This sounds exactly like the behaviour I've been seeing. It started when I started using ZFS. I gave up.
 
I should add - I've moved back to EXT4 and the difference so far seems to be quite a lot. I would not be surprised at all to find out there is some kind of bug either in the code, or perhaps just in a default config somewhere that's causing this.
 
I should add - I've moved back to EXT4 and the difference so far seems to be quite a lot. I would not be surprised at all to find out there is some kind of bug either in the code, or perhaps just in a default config somewhere that's causing this.

So with ZFS you had the issue. Interesting... is there a thread where you already posted info on your set up? I'd like to look for similarities . thank you for the responses