Ceph Network Speeds?

devinacosta

Active Member
Aug 3, 2017
65
11
28
48
I am running a dedicated 3 node Ceph cluster that is a member of my normal Ceph cluster I just don't put any VMs on them. However Even on 10GB network my performance seems to be suffering big time. The Ceph has 15 disks with a combination of SSD and spinning disks. However I seem to still not get that great of network performance, backups still seem to go rather slow.

INFO: starting new backup job: vzdump 145 --remove 0 --compress lzo --mode snapshot --node virt01 --storage nfs-vprotect
INFO: Starting Backup of VM 145 (qemu)
INFO: status = running
INFO: update VM 145: -lock backup
INFO: VM Name: site.com
INFO: include disk 'virtio0' 'rbd_vm:vm-145-disk-1' 20G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/nfs-vprotect/dump/vzdump-qemu-145-2018_02_20-00_52_47.vma.lzo'
INFO: started backup task '48647b7f-7071-478f-b539-2489a5fcc588'
INFO: status: 0% (67108864/21474836480), sparse 0% (1007616), duration 3, read/write 22/22 MB/s
INFO: status: 1% (268435456/21474836480), sparse 0% (92766208), duration 19, read/write 12/6 MB/s
INFO: status: 2% (549453824/21474836480), sparse 1% (358797312), duration 22, read/write 93/4 MB/s
INFO: status: 8% (1849688064/21474836480), sparse 7% (1650491392), duration 25, read/write 433/2 MB/s
INFO: status: 14% (3116367872/21474836480), sparse 13% (2912976896), duration 28, read/write 422/1 MB/s
INFO: status: 15% (3300917248/21474836480), sparse 14% (3032576000), duration 31, read/write 61/21 MB/s
INFO: status: 16% (3456106496/21474836480), sparse 14% (3033239552), duration 42, read/write 14/14 MB/s
INFO: status: 17% (3657433088/21474836480), sparse 14% (3033575424), duration 46, read/write 50/50 MB/s
INFO: status: 18% (3892314112/21474836480), sparse 14% (3039993856), duration 61, read/write 15/15 MB/s
INFO: status: 19% (4089446400/21474836480), sparse 14% (3113443328), duration 67, read/write 32/20 MB/s
INFO: status: 20% (4295229440/21474836480), sparse 14% (3181645824), duration 72, read/write 41/27 MB/s
INFO: status: 21% (4534042624/21474836480), sparse 15% (3293413376), duration 77, read/write 47/25 MB/s
INFO: status: 23% (5125439488/21474836480), sparse 17% (3852619776), duration 81, read/write 147/8 MB/s
INFO: status: 33% (7176454144/21474836480), sparse 27% (5903634432), duration 84, read/write 683/0 MB/s
INFO: status: 36% (7923040256/21474836480), sparse 30% (6578941952), duration 87, read/write 248/23 MB/s
INFO: status: 37% (8019509248/21474836480), sparse 30% (6579134464), duration 90, read/write 32/32 MB/s
INFO: status: 38% (8160935936/21474836480), sparse 30% (6583504896), duration 95, read/write 28/27 MB/s
INFO: status: 39% (8376025088/21474836480), sparse 30% (6586396672), duration 111, read/write 13/13 MB/s
INFO: status: 41% (8921284608/21474836480), sparse 32% (7079772160), duration 116, read/write 109/10 MB/s
INFO: status: 49% (10703863808/21474836480), sparse 41% (8862351360), duration 119, read/write 594/0 MB/s
INFO: status: 57% (12423528448/21474836480), sparse 49% (10569449472), duration 122, read/write 573/4 MB/s
INFO: status: 58% (12482248704/21474836480), sparse 49% (10570231808), duration 125, read/write 19/19 MB/s
INFO: status: 59% (12708741120/21474836480), sparse 49% (10572955648), duration 139, read/write 16/15 MB/s
INFO: status: 60% (12914262016/21474836480), sparse 49% (10596102144), duration 150, read/write 18/16 MB/s
INFO: status: 63% (13585350656/21474836480), sparse 51% (11133800448), duration 157, read/write 95/19 MB/s
INFO: status: 72% (15573450752/21474836480), sparse 61% (13121900544), duration 160, read/write 662/0 MB/s
INFO: status: 78% (16793993216/21474836480), sparse 66% (14342443008), duration 163, read/write 406/0 MB/s
INFO: status: 79% (17024679936/21474836480), sparse 67% (14498418688), duration 166, read/write 76/24 MB/s
INFO: status: 80% (17179869184/21474836480), sparse 67% (14514917376), duration 178, read/write 12/11 MB/s
INFO: status: 81% (17397972992/21474836480), sparse 67% (14525210624), duration 190, read/write 18/17 MB/s
INFO: status: 82% (17616076800/21474836480), sparse 67% (14534901760), duration 206, read/write 13/13 MB/s
INFO: status: 84% (18056478720/21474836480), sparse 69% (14838226944), duration 216, read/write 44/13 MB/s
INFO: status: 94% (20367540224/21474836480), sparse 79% (17149288448), duration 219, read/write 770/0 MB/s
INFO: status: 100% (21474836480/21474836480), sparse 85% (18256584704), duration 221, read/write 553/0 MB/s
INFO: transferred 21474 MB in 221 seconds (97 MB/s)
INFO: archive file size: 1.39GB
INFO: Finished Backup of VM 145 (00:03:44)
INFO: Backup job finished successfully
TASK OK

I have added to my /etc/sysctl.conf
vm.swappiness = 0
# allow testing with buffers up to 64MB
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
# increase Linux autotuning TCP buffer limit to 32MB
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
# recommended default congestion control is htcp
net.ipv4.tcp_congestion_control=htcp
# recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing=1
# recommended for CentOS7/Debian8 hosts
net.core.default_qdisc = fq

Any recommendations on what I should check or look into?
 
I have a similar issue here (without Ceph involved) between our PVE host and a FreeNAS nfs share (1GB connection):

Code:
()
INFO: starting new backup job: vzdump 104 --mailto admin@imedos.de --compress lzo --mailnotification failure --mode snapshot --storage NFSFreeNAS --quiet 1
INFO: Starting Backup of VM 104 (qemu)
INFO: status = running
INFO: update VM 104: -lock backup
INFO: VM Name: sv-ts.imedos.de
INFO: include disk 'scsi0' 'local-zfs-vms:vm-104-disk-1' 4024G
INFO: include disk 'virtio0' 'local-zfs-vms:vm-104-disk-2' 2T
INFO: include disk 'efidisk0' 'local-zfs-vms:vm-104-disk-3' 128K
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating archive '/mnt/pve/NFSFreeNAS/dump/vzdump-qemu-104-2018_02_27-19_00_02.vma.lzo'
ERROR: VM 104 qmp command 'guest-fsfreeze-freeze' failed - got timeout
INFO: started backup task '85089741-095c-469a-8afd-a3aa8a1fcbb3'
INFO: status: 0% (266076160/6519760486400), sparse 0% (141086720), duration 3, read/write 88/41 MB/s
INFO: status: 1% (65364951040/6519760486400), sparse 0% (31379828736), duration 2909, read/write 22/11 MB/s
INFO: status: 2% (130441019392/6519760486400), sparse 1% (86762864640), duration 6228, read/write 19/2 MB/s
INFO: status: 3% (195660414976/6519760486400), sparse 1% (122006528000), duration 9764, read/write 18/8 MB/s
INFO: status: 4% (260985716736/6519760486400), sparse 2% (143553929216), duration 14769, read/write 13/8 MB/s
INFO: status: 5% (325988384768/6519760486400), sparse 2% (185033981952), duration 18618, read/write 16/6 MB/s
INFO: status: 6% (391352090624/6519760486400), sparse 3% (240760508416), duration 20680, read/write 31/4 MB/s
INFO: status: 7% (456691482624/6519760486400), sparse 4% (301469855744), duration 21348, read/write 97/6 MB/s
INFO: status: 8% (521608101888/6519760486400), sparse 5% (341577641984), duration 24079, read/write 23/9 MB/s
INFO: status: 9% (586992320512/6519760486400), sparse 6% (406958870528), duration 24350, read/write 241/0 MB/s
INFO: status: 10% (651991777280/6519760486400), sparse 7% (471957086208), duration 25456, read/write 58/0 MB/s
INFO: status: 11% (717237518336/6519760486400), sparse 8% (537202384896), duration 25721, read/write 246/0 MB/s
INFO: status: 12% (782666629120/6519760486400), sparse 9% (602630946816), duration 25943, read/write 294/0 MB/s
INFO: status: 13% (847817801728/6519760486400), sparse 10% (667781226496), duration 26147, read/write 319/0 MB/s
INFO: status: 14% (912877748224/6519760486400), sparse 11% (732839333888), duration 26647, read/write 130/0 MB/s
INFO: status: 15% (978179260416/6519760486400), sparse 12% (798139838464), duration 27218, read/write 114/0 MB/s
INFO: status: 16% (1043165151232/6519760486400), sparse 13% (863001923584), duration 27481, read/write 247/0 MB/s
INFO: status: 17% (1108447199232/6519760486400), sparse 14% (928275030016), duration 27722, read/write 270/0 MB/s
INFO: status: 18% (1173605842944/6519760486400), sparse 15% (993430626304), duration 27947, read/write 289/0 MB/s
INFO: status: 19% (1238876291072/6519760486400), sparse 16% (1058699649024), duration 28146, read/write 327/0 MB/s
INFO: status: 20% (1304082382848/6519760486400), sparse 17% (1123892649984), duration 28343, read/write 330/0 MB/s
INFO: status: 21% (1369291161600/6519760486400), sparse 18% (1189100740608), duration 28535, read/write 339/0 MB/s
INFO: status: 22% (1434406617088/6519760486400), sparse 19% (1254164766720), duration 28749, read/write 304/0 MB/s
INFO: status: 23% (1499561721856/6519760486400), sparse 20% (1316474826752), duration 29133, read/write 169/7 MB/s
INFO: status: 24% (1565048438784/6519760486400), sparse 21% (1381714604032), duration 29468, read/write 195/0 MB/s
INFO: status: 25% (1630081253376/6519760486400), sparse 22% (1446134788096), duration 29725, read/write 253/2 MB/s
INFO: status: 26% (1695195529216/6519760486400), sparse 23% (1503014875136), duration 30708, read/write 66/8 MB/s
INFO: status: 27% (1760335560704/6519760486400), sparse 24% (1568154906624), duration 30990, read/write 230/0 MB/s
INFO: status: 28% (1825840496640/6519760486400), sparse 25% (1633659842560), duration 31199, read/write 313/0 MB/s
INFO: status: 29% (1890905030656/6519760486400), sparse 26% (1698724376576), duration 31403, read/write 318/0 MB/s
INFO: status: 30% (1956173119488/6519760486400), sparse 27% (1763992465408), duration 31604, read/write 324/0 MB/s
INFO: status: 31% (2021264916480/6519760486400), sparse 28% (1829084262400), duration 31807, read/write 320/0 MB/s
INFO: status: 32% (2086432866304/6519760486400), sparse 29% (1894252212224), duration 32012, read/write 317/0 MB/s
INFO: status: 33% (2151784185856/6519760486400), sparse 30% (1959603531776), duration 32220, read/write 314/0 MB/s
INFO: status: 34% (2216898199552/6519760486400), sparse 31% (2024717545472), duration 32431, read/write 308/0 MB/s
INFO: status: 35% (2282061496320/6519760486400), sparse 32% (2089880842240), duration 32632, read/write 324/0 MB/s
INFO: status: 36% (2347324997632/6519760486400), sparse 33% (2155144343552), duration 32837, read/write 318/0 MB/s
INFO: status: 37% (2412609142784/6519760486400), sparse 34% (2220428488704), duration 33047, read/write 310/0 MB/s
INFO: status: 38% (2477597982720/6519760486400), sparse 35% (2285417328640), duration 33255, read/write 312/0 MB/s
INFO: status: 39% (2542908211200/6519760486400), sparse 36% (2350727557120), duration 33463, read/write 313/0 MB/s
INFO: status: 40% (2608110698496/6519760486400), sparse 37% (2415930044416), duration 33664, read/write 324/0 MB/s
INFO: status: 41% (2673340776448/6519760486400), sparse 38% (2481160122368), duration 33868, read/write 319/0 MB/s
INFO: status: 42% (2738369462272/6519760486400), sparse 39% (2546188808192), duration 34081, read/write 305/0 MB/s
INFO: status: 43% (2803759513600/6519760486400), sparse 40% (2611578859520), duration 34292, read/write 309/0 MB/s
INFO: status: 44% (2868729937920/6519760486400), sparse 41% (2676549283840), duration 34495, read/write 320/0 MB/s
INFO: status: 45% (2934177595392/6519760486400), sparse 42% (2741996941312), duration 34702, read/write 316/0 MB/s
INFO: status: 46% (2999431069696/6519760486400), sparse 43% (2807250415616), duration 34901, read/write 327/0 MB/s
INFO: status: 47% (3064596201472/6519760486400), sparse 44% (2872415547392), duration 35105, read/write 319/0 MB/s
INFO: status: 48% (3129725091840/6519760486400), sparse 45% (2937544437760), duration 35310, read/write 317/0 MB/s
INFO: status: 49% (3194802536448/6519760486400), sparse 46% (3002621882368), duration 35516, read/write 315/0 MB/s
INFO: status: 50% (3259907768320/6519760486400), sparse 47% (3067727101952), duration 35719, read/write 320/0 MB/s
INFO: status: 51% (3325383016448/6519760486400), sparse 48% (3133202350080), duration 35923, read/write 320/0 MB/s
INFO: status: 52% (3390415699968/6519760486400), sparse 49% (3198235033600), duration 36121, read/write 328/0 MB/s
INFO: status: 53% (3455606063104/6519760486400), sparse 50% (3263425396736), duration 36330, read/write 311/0 MB/s
INFO: status: 54% (3520946307072/6519760486400), sparse 51% (3328765640704), duration 36536, read/write 317/0 MB/s
INFO: status: 55% (3586092171264/6519760486400), sparse 52% (3393911504896), duration 36738, read/write 322/0 MB/s
INFO: status: 56% (3651306061824/6519760486400), sparse 53% (3459125395456), duration 36940, read/write 322/0 MB/s
INFO: status: 57% (3716322230272/6519760486400), sparse 54% (3524141563904), duration 37141, read/write 323/0 MB/s
INFO: status: 58% (3781662212096/6519760486400), sparse 55% (3589481545728), duration 37348, read/write 315/0 MB/s
INFO: status: 59% (3846694830080/6519760486400), sparse 56% (3654514163712), duration 37548, read/write 325/0 MB/s
INFO: status: 60% (3912090124288/6519760486400), sparse 57% (3719909457920), duration 37756, read/write 314/0 MB/s
INFO: status: 61% (3977285992448/6519760486400), sparse 58% (3785105326080), duration 37957, read/write 324/0 MB/s
INFO: status: 62% (4042395156480/6519760486400), sparse 59% (3850214490112), duration 38162, read/write 317/0 MB/s
INFO: status: 63% (4107522146304/6519760486400), sparse 60% (3915341479936), duration 38367, read/write 317/0 MB/s
INFO: status: 64% (4172677513216/6519760486400), sparse 61% (3980496846848), duration 38572, read/write 317/0 MB/s
INFO: status: 65% (4238146928640/6519760486400), sparse 62% (4045966262272), duration 38779, read/write 316/0 MB/s
INFO: status: 66% (4303070232576/6519760486400), sparse 63% (4110889566208), duration 38977, read/write 327/0 MB/s
INFO: status: 67% (4368240082944/6519760486400), sparse 63% (4128875204608), duration 49275, read/write 6/4 MB/s
INFO: status: 68% (4433440145408/6519760486400), sparse 63% (4129040670720), duration 62887, read/write 4/4 MB/s
INFO: status: 69% (4498636144640/6519760486400), sparse 63% (4129392726016), duration 75856, read/write 5/4 MB/s
INFO: status: 70% (4563834765312/6519760486400), sparse 63% (4129394454528), duration 89924, read/write 4/4 MB/s
INFO: status: 71% (4629032271872/6519760486400), sparse 63% (4129394507776), duration 106484, read/write 3/3 MB/s
INFO: status: 72% (4694231875584/6519760486400), sparse 63% (4129435127808), duration 120721, read/write 4/4 MB/s
INFO: status: 73% (4759426695168/6519760486400), sparse 63% (4129435299840), duration 134890, read/write 4/4 MB/s

Interestingly the backup slows down every time at about 60%. The native nfs speed between the Proxmox host and the FreeNAS nfs share I tested between 40 and 90 MB/sec during working hours, depending on the current network load.

I have nearly the exact same behavior between this PVE host and another nfs share on a different debian host (also 1GB ethernet). So I guess the issue can only be solved at the PVE site.

So the question is: what makes the VM backup dropping down to around 4 MB/sec and how can this be fixed?

My Zpool config:
Code:
 zpool status
  pool: rz2pool
 state: ONLINE
  scan: scrub repaired 0B in 169h30m with 0 errors on Sun Feb 18 01:54:30 2018
config:

        NAME                                                    STATE     READ WRITE CKSUM
        rz2pool                                                 ONLINE       0     0     0
          raidz2-0                                              ONLINE       0     0     0
            scsi-35000cca01a834670                              ONLINE       0     0     0
            scsi-35000cca01a791e24                              ONLINE       0     0     0
            scsi-35000cca01a769f54                              ONLINE       0     0     0
            scsi-35000cca01a83cd30                              ONLINE       0     0     0
          raidz2-1                                              ONLINE       0     0     0
            scsi-35000cca01a841db0                              ONLINE       0     0     0
            scsi-35000cca01a832f30                              ONLINE       0     0     0
            scsi-35000cca01a69f22c                              ONLINE       0     0     0
            scsi-35000c50041ca84ab                              ONLINE       0     0     0
        logs
          mirror-2                                              ONLINE       0     0     0
            nvme-INTEL_MEMPEK1W016GA_PHBT713102PJ016D           ONLINE       0     0     0
            nvme-INTEL_MEMPEK1W016GA_PHBT713102C5016D           ONLINE       0     0     0
        cache
          nvme-Samsung_SSD_960_EVO_250GB_S3ESNX0J315970N-part1  ONLINE       0     0     0
          nvme-Samsung_SSD_960_EVO_250GB_S3ESNX0J315973V-part1  ONLINE       0     0     0

pveperf:
Code:
pveperf /rz2pool/vm/
CPU BOGOMIPS:      165985.60
REGEX/SECOND:      947299
HD SIZE:           945.54 GB (rz2pool/vm)
FSYNCS/SECOND:     4151.47
 
INFO: transferred 21474 MB in 221 seconds (97 MB/s)
@Gerald Floßmann & @devinacosta, the MB/s (see above quote) is the important number, if you want to know the throughput per second. The numbers on ever end of the line is how much was read/written on every update of the output. As your files are spare, there is not much to read or write from them in between updates.

And for max throughput on 1GbE: 1 Gb/s ~ 120 MB/s (theoretical)
https://kb.netapp.com/app/answers/answer_view/a_id/1003832