PBS backup pool performance issues

andy77

Well-Known Member
Jul 6, 2016
248
13
58
40
Hi@all,

I do have a problem with my newly setup PBS with the performance of pools. We have two clusters with both 20 nodes. Every cluster is backing up to the PBS on a separate pool of 12 disks. The performance of the pools is quite bad in zfs. On our intial test with some windows software mirror, we got quite high write performance arround 2000MB/s. Now with the zfs pools we have just arroung 100MB/s per pool on real backup tasks, which is not enough for backups without getting "timeouts".

Here is the HW Config of the server:

16 Core, 1.8Ghz Intel Scalabe Processor
64GB RAM
3 Supermicro HBA Controller connecting 24 Toshiba Disks

2 pools, both
raidz2 - 12x Toshiba 14TB 7200 Enterprise Disks.


Any idea what we can do? Should we add a "special device" or a "ZIL"... What makes sense?


Here is the benchmark of our pools:
Code:
root@br1pxbck1:~# proxmox-backup-client benchmark --repository dc1c01n
Uploaded 871 chunks in 5 seconds.
Time per request: 5785 microseconds.
TLS speed: 724.94 MB/s
SHA256 speed: 231.14 MB/s
Compression speed: 356.38 MB/s
Decompress speed: 642.15 MB/s
AES256/GCM speed: 1328.53 MB/s
Verify speed: 172.60 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 724.94 MB/s (59%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 231.14 MB/s (11%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 356.38 MB/s (47%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 642.15 MB/s (54%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 172.60 MB/s (23%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1328.53 MB/s (36%) │
└───────────────────────────────────┴────────────────────┘
root@br1pxbck1:~# proxmox-backup-client benchmark --repository dc1c02n
Uploaded 858 chunks in 5 seconds.
Time per request: 5871 microseconds.
TLS speed: 714.32 MB/s
SHA256 speed: 233.66 MB/s
Compression speed: 369.69 MB/s
Decompress speed: 684.01 MB/s
AES256/GCM speed: 1348.39 MB/s
Verify speed: 173.01 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 714.32 MB/s (58%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 233.66 MB/s (12%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 369.69 MB/s (49%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 684.01 MB/s (57%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 173.01 MB/s (23%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1348.39 MB/s (37%) │
└───────────────────────────────────┴────────────────────┘
 
the benchmark does not test actual disk performance, just network/crypto/hashing/compression. your hashing performance is very bad, so you won't every see speeds above 170-200MB/s. your zfs pools are also using raidz, which is adding an IOPs bottleneck as well (raidz means you are at the level of a single disk IOPs wise, and PBS backups store lots of small chunks in the 1-4MB range)
 
Hi Fabian,

thx for your reply. What do you think, why the hashing performance is bad?
What would be your suggestion? What should I change?
 
there is something wrong. zfs is not so bad. how much is the single performance of one hdd?
 
Hi Fabian,

thx for your reply. What do you think, why the hashing performance is bad?
What would be your suggestion? What should I change?
usually it's the CPU ;) you didn't give a concrete model..
 
that one is not a properly released CPU (it's an engineering sample) - no idea what kind of performance it should have.
 
@fabian: the hhds support 4kn sector. maybe you can take advantage of it with zfs.

@andy: is it in use already? is it such a model?
 
Last edited:
@floh8 we allready use ashift 12 which I think is ok for 4KB sectors

Code:
rpool     ashift    12      local
storage1  ashift    12      local
storage2  ashift    12      local
 
Last edited:
Yes I know. Was laying arround here so we build the PBS with it. So you think that the CPU is our main bottleneck?
the hashing / verification benchmark only taxes the CPU, and is the lowest throughput in the synthetic benchmarks the client runs. it's possible your network or disks are slower of course (that is not covered by the benchmarks, or at least not how you ran them - the network part would be covered if you run the benchmark on your PVE node against the PBS datastore).
 
test "write"

Code:
root@br1pxbck1:~# fio --rw=write --name=/mnt/datastore/storage1/fiotest --size=4G   
/mnt/datastore/storage1/fiotest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=435MiB/s][w=111k IOPS][eta 00m:00s]
/mnt/datastore/storage1/fiotest: (groupid=0, jobs=1): err= 0: pid=2810859: Tue Aug 10 13:53:38 2021
  write: IOPS=110k, BW=428MiB/s (449MB/s)(4096MiB/9565msec); 0 zone resets
    clat (usec): min=6, max=9382, avg= 8.68, stdev=13.81
     lat (usec): min=6, max=9382, avg= 8.75, stdev=13.81
    clat percentiles (usec):
     |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
     | 30.00th=[    7], 40.00th=[    7], 50.00th=[    7], 60.00th=[    7],
     | 70.00th=[    7], 80.00th=[    7], 90.00th=[    8], 95.00th=[   10],
     | 99.00th=[   64], 99.50th=[   65], 99.90th=[   74], 99.95th=[  123],
     | 99.99th=[  190]
   bw (  KiB/s): min=397144, max=460376, per=100.00%, avg=439360.84, stdev=17203.91, samples=19
   iops        : min=99286, max=115094, avg=109840.21, stdev=4300.98, samples=19
  lat (usec)   : 10=95.13%, 20=1.88%, 50=0.01%, 100=2.91%, 250=0.07%
  lat (usec)   : 500=0.01%, 750=0.01%
  lat (msec)   : 10=0.01%
  cpu          : usr=8.69%, sys=91.04%, ctx=95, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=428MiB/s (449MB/s), 428MiB/s-428MiB/s (449MB/s-449MB/s), io=4096MiB (4295MB), run=9565-9565msec


test "randwrite"

Code:
root@br1pxbck1:~# fio --rw=randwrite --name=/mnt/datastore/storage1/fiotest --size=4G --direct=1
/mnt/datastore/storage1/fiotest: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [w(1)][96.6%][w=136MiB/s][w=34.7k IOPS][eta 00m:02s]
/mnt/datastore/storage1/fiotest: (groupid=0, jobs=1): err= 0: pid=1848925: Tue Aug 10 13:41:43 2021
  write: IOPS=18.8k, BW=73.5MiB/s (77.1MB/s)(4096MiB/55692msec); 0 zone resets
    clat (usec): min=6, max=24233, avg=51.96, stdev=107.07
     lat (usec): min=6, max=24233, avg=52.07, stdev=107.08
    clat percentiles (usec):
     |  1.00th=[    9],  5.00th=[    9], 10.00th=[    9], 20.00th=[   11],
     | 30.00th=[   12], 40.00th=[   19], 50.00th=[   64], 60.00th=[   65],
     | 70.00th=[   67], 80.00th=[   72], 90.00th=[   93], 95.00th=[  121],
     | 99.00th=[  172], 99.50th=[  184], 99.90th=[  578], 99.95th=[ 1369],
     | 99.99th=[ 4752]
   bw (  KiB/s): min=54888, max=219528, per=99.60%, avg=75013.87, stdev=19741.31, samples=111
   iops        : min=13722, max=54882, avg=18753.45, stdev=4935.32, samples=111
  lat (usec)   : 10=16.95%, 20=23.77%, 50=2.11%, 100=48.75%, 250=8.26%
  lat (usec)   : 500=0.06%, 750=0.02%, 1000=0.02%
  lat (msec)   : 2=0.03%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.75%, sys=92.92%, ctx=7069, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

 

Run status group 0 (all jobs):
  WRITE: bw=73.5MiB/s (77.1MB/s), 73.5MiB/s-73.5MiB/s (77.1MB/s-77.1MB/s), io=4096MiB (4295MB), run=55692-55692msec
 
Last edited:
you wrote you use 2 clusters.
when is the backup task scheduled? work they parallel?

run a write test from one of the PVE-Hosts to the PBS - maybe via scp p.E.
 
Via SCP we have arround 370MB/s from PVE-Host to one of the storage pools.
Iperf3 test gets arround 9.7GBit/s which is the full 10GBit/s Network adapters speed.

Running a backup task of one VM from PVE-Host to PBS results in 100MB/s speed.
 
And here the benchmark from the pve-host to pbs:
Code:
root@DC1C01N02:~# proxmox-backup-client benchmark --repository test@pbs@10.10.1.101:dc1c01d
Are you sure you want to continue connecting? (y/n): y
Uploaded 724 chunks in 5 seconds.
Time per request: 6940 microseconds.
TLS speed: 604.29 MB/s
SHA256 speed: 567.79 MB/s
Compression speed: 840.67 MB/s
Decompress speed: 1298.42 MB/s
AES256/GCM speed: 3221.27 MB/s
Verify speed: 393.63 MB/s
┌───────────────────────────────────┬─────────────────────┐
│ Name                              │ Value               │
╞═══════════════════════════════════╪═════════════════════╡
│ TLS (maximal backup upload speed) │ 604.29 MB/s (49%)   │
├───────────────────────────────────┼─────────────────────┤
│ SHA256 checksum computation speed │ 567.79 MB/s (28%)   │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 compression speed    │ 840.67 MB/s (112%)  │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 decompression speed  │ 1298.42 MB/s (108%) │
├───────────────────────────────────┼─────────────────────┤
│ Chunk verification speed          │ 393.63 MB/s (52%)   │
├───────────────────────────────────┼─────────────────────┤
│ AES256 GCM encryption speed       │ 3221.27 MB/s (88%)  │
└───────────────────────────────────┴─────────────────────┘
 
i think in point of the basis configuration for network and storage it looks good
the proxmox benchmark i'm not able to interpret
i think proxmox staff is asked here what to test next
 
Code:
Task viewer: VM/CT 250 - Backup
OutputStatus
Stop
INFO: starting new backup job: vzdump 250 --node DC1C01N02 --remove 0 --mode snapshot --storage BR1PXBCK1
INFO: Starting Backup of VM 250 (qemu)
INFO: Backup started at 2021-08-10 10:19:28
INFO: status = running
INFO: VM Name: test25
INFO: include disk 'scsi0' 'local-zfs:vm-250-disk-0' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/250/2021-08-10T08:19:28Z'
INFO: skipping guest-agent 'fs-freeze', agent configured but not running?
INFO: started backup task '2427070e-60ab-40b8-b838-2da17609282c'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (2.3 GiB of 50.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 2.3 GiB dirty of 50.0 GiB total
INFO: 26% (640.0 MiB of 2.3 GiB) in 3s, read: 213.3 MiB/s, write: 205.3 MiB/s
INFO: 38% (924.0 MiB of 2.3 GiB) in 6s, read: 94.7 MiB/s, write: 93.3 MiB/s
INFO: 50% (1.2 GiB of 2.3 GiB) in 9s, read: 90.7 MiB/s, write: 85.3 MiB/s
INFO: 60% (1.4 GiB of 2.3 GiB) in 12s, read: 77.3 MiB/s, write: 74.7 MiB/s
INFO: 73% (1.7 GiB of 2.3 GiB) in 15s, read: 106.7 MiB/s, write: 100.0 MiB/s
INFO: 84% (2.0 GiB of 2.3 GiB) in 18s, read: 84.0 MiB/s, write: 73.3 MiB/s
INFO: 97% (2.3 GiB of 2.3 GiB) in 21s, read: 104.0 MiB/s, write: 97.3 MiB/s
INFO: 100% (2.3 GiB of 2.3 GiB) in 24s, read: 20.0 MiB/s, write: 20.0 MiB/s
INFO: backup was done incrementally, reused 47.80 GiB (95%)
INFO: transferred 2.32 GiB in 26 seconds (91.2 MiB/s)
INFO: Finished Backup of VM 250 (00:00:29)
INFO: Backup finished at 2021-08-10 10:19:57
INFO: Backup job finished successfully
TASK OK
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!