PBS backing up to RBD ceph volume- Alternatives?

subookiesix · Apr 20, 2021

Hi,

I'm a first time poster, looking for some advice / guidance from the community on my issue.

We are experiencing intermittent terrible backup performance and our third party ceph consultants have been unable to identify the root cause of the issue.
When the backups are working fine, we can use PBS to backup an entire node of VMs in around 2/3 hours. When the performance takes a hit, it can take 18 hours to backup a node.

We have a 25 node production proxmox cluster, running around 375 VMs, split across 24 of the nodes, with the remaining node setup as a PBS node.:
CPU(s) 48 x AMD EPYC 7401P 24-Core Processor (1 Socket)
256GB RAM
2x 32GB Supermicro in ZFS RAID1 for OS
8x 1.92TB Samsung PM883 2.5" Enterprise SSD, SATA 3.3 (6Gb/s), TLC 3D/V-NAND, 550MB/s Read, 520MB/s Write, 98k/25k IOPS
NIC1: 1Gb (primary interface)
NIC2: 40Gb (backup interface)

We also have a 3 node ceph cluster which we are using as backup storage connected to the 40Gb network. Here is the spec of the backup nodes:
Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
128GB RAM
12x 14TB 7.2K, 6Gbps,Toshiba MG07ACA14TA

On our ceph cluster we have set up two RBD pools:
backup-drives
backup-pbs

The backup-drives pool, provides each of the ~375 VMs with a backup mount point allowing each client to its own local backups.
The backup-pbs pool is exposed to our PBS node and is used for image backups in the event of a catastrophic failure.

Our ceph consultants have spent considerable time investigating to no avail. I'm wondering about cutting our losses and seeing what the alternatives out there are and what you can suggest in terms of re-architecting the existing hardware into something more performant.
After 5 drive failures and extended periods of poor performance we are currently thinking of moving away from ceph entirely and using standalone backup nodes, though we are open to other suggestions/recommendations.

Kindest Regards,
S

aaron · Apr 26, 2021

Is PBS one VM in the PVE cluster with a large disk on the external backup-pbs pool?
The recommended way is to use a large dedicated machine for PBS. Otherwise the PBS machine will get backups over the network and needs to store them over the network. Introducing more bandwidth usage and higher latency.

subookiesix said:
When the performance takes a hit, it can take 18 hours to backup a node.

Are those backups done incrementally on the PVE side? You can check the backup logs to see if you find lines like the following:

Code:

INFO: using fast incremental mode (dirty-bitmap), 8.1 GiB dirty of 60.0 GiB total

If not, or if almost all the disk is considered dirty, this can increase the backup time considerably as large or the full VM disk needs to be read.

Other than that, regular stuff that might cause performance problems could be package loss on the network.

Having some kind of performance monitoring that stores these metrics can help as well to narrow down the cause.

subookiesix · Apr 27, 2021

Hi Aaron,

Thanks for your response.

Our PBS install is on a dedicated machine and is not a VM - sorry this was not clearer in my initial description.

Those backups which are taking 18 hours were a mixture of full and incremental. However when only performing incremental backups performance is still sometimes very poor.

Thank you for the suggestion of performance monitoring, this is something we are going to be reviewing as obviously we need some definitive metrics to quantify the performance degradation. Do you have any suggestions / advice regarding software choices when it comes to performance monitoring, this would ideally be something free.

Kindest Regards,
S

aaron · Apr 27, 2021

subookiesix said:
Do you have any suggestions / advice regarding software choices when it comes to performance monitoring, this would ideally be something free.

There are quite a few open source options available. CheckMK, Zabbix, InfluxDB2 + Telegraf, ....

wigor · Apr 27, 2021

Hey,

i don´t think, that ceph people will miss this, but taking the fact that it´s sometimes good and sometimes bad: have you checked, that there is no scrubbing active while the bad times?

subookiesix · Apr 27, 2021

Hey Wigor,

Thanks for the suggestion.

The poor performance does not seem to correlate to scrubbing or deep scrubbing, sometimes it's slow during scrubbing, sometimes it's fine during scrubbing.

Apologies, I should have been clearer in my initial post.

Kindest Regards,
S

Search

Search

PBS backing up to RBD ceph volume- Alternatives?

subookiesix

New Member

aaron

Proxmox Staff Member

subookiesix

New Member

aaron

Proxmox Staff Member

wigor

Well-Known Member

subookiesix

New Member

We value your privacy