Hi,
I'm a first time poster, looking for some advice / guidance from the community on my issue.
We are experiencing intermittent terrible backup performance and our third party ceph consultants have been unable to identify the root cause of the issue.
When the backups are working fine, we can use PBS to backup an entire node of VMs in around 2/3 hours. When the performance takes a hit, it can take 18 hours to backup a node.
We have a 25 node production proxmox cluster, running around 375 VMs, split across 24 of the nodes, with the remaining node setup as a PBS node.:
CPU(s) 48 x AMD EPYC 7401P 24-Core Processor (1 Socket)
256GB RAM
2x 32GB Supermicro in ZFS RAID1 for OS
8x 1.92TB Samsung PM883 2.5" Enterprise SSD, SATA 3.3 (6Gb/s), TLC 3D/V-NAND, 550MB/s Read, 520MB/s Write, 98k/25k IOPS
NIC1: 1Gb (primary interface)
NIC2: 40Gb (backup interface)
We also have a 3 node ceph cluster which we are using as backup storage connected to the 40Gb network. Here is the spec of the backup nodes:
Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
128GB RAM
12x 14TB 7.2K, 6Gbps,Toshiba MG07ACA14TA
On our ceph cluster we have set up two RBD pools:
backup-drives
backup-pbs
The backup-drives pool, provides each of the ~375 VMs with a backup mount point allowing each client to its own local backups.
The backup-pbs pool is exposed to our PBS node and is used for image backups in the event of a catastrophic failure.
Our ceph consultants have spent considerable time investigating to no avail. I'm wondering about cutting our losses and seeing what the alternatives out there are and what you can suggest in terms of re-architecting the existing hardware into something more performant.
After 5 drive failures and extended periods of poor performance we are currently thinking of moving away from ceph entirely and using standalone backup nodes, though we are open to other suggestions/recommendations.
Kindest Regards,
S
I'm a first time poster, looking for some advice / guidance from the community on my issue.
We are experiencing intermittent terrible backup performance and our third party ceph consultants have been unable to identify the root cause of the issue.
When the backups are working fine, we can use PBS to backup an entire node of VMs in around 2/3 hours. When the performance takes a hit, it can take 18 hours to backup a node.
We have a 25 node production proxmox cluster, running around 375 VMs, split across 24 of the nodes, with the remaining node setup as a PBS node.:
CPU(s) 48 x AMD EPYC 7401P 24-Core Processor (1 Socket)
256GB RAM
2x 32GB Supermicro in ZFS RAID1 for OS
8x 1.92TB Samsung PM883 2.5" Enterprise SSD, SATA 3.3 (6Gb/s), TLC 3D/V-NAND, 550MB/s Read, 520MB/s Write, 98k/25k IOPS
NIC1: 1Gb (primary interface)
NIC2: 40Gb (backup interface)
We also have a 3 node ceph cluster which we are using as backup storage connected to the 40Gb network. Here is the spec of the backup nodes:
Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
128GB RAM
12x 14TB 7.2K, 6Gbps,Toshiba MG07ACA14TA
On our ceph cluster we have set up two RBD pools:
backup-drives
backup-pbs
The backup-drives pool, provides each of the ~375 VMs with a backup mount point allowing each client to its own local backups.
The backup-pbs pool is exposed to our PBS node and is used for image backups in the event of a catastrophic failure.
Our ceph consultants have spent considerable time investigating to no avail. I'm wondering about cutting our losses and seeing what the alternatives out there are and what you can suggest in terms of re-architecting the existing hardware into something more performant.
After 5 drive failures and extended periods of poor performance we are currently thinking of moving away from ceph entirely and using standalone backup nodes, though we are open to other suggestions/recommendations.
Kindest Regards,
S