We have a large volume that we need to backup which contains 100.000.000 files, with a ∆ / day of about 50.000 files (400GB).
For the time being this file system is mounted directly in PBS using fuse kernel driver with
We then launch the backup script directly within the pbs and are able to backup the file system using pbs.
In order to simulate real life scenario, i have setup on one of the VM that's also connected to the CephFS a script which generates files in order to do some testing (scripts creates 50.000 files with 65% of small files [10 to 100k] and 35% of big files [1 to 50Mo].
For the time being we have about 450.000 files, we are far from the 100.000.000 files.
We then trigger the backup and collect performance data (If someone is interested, i could share this with users of the forum).
I am tring to see if we can guarantee performances of such backup and if performances are going to be linear and won't take months to be done.
Process is the following today:
launch script for file generation >>> generate 50.000 files on CephFS >>> launch backup script from pbs >>> finish backup >>> (and back to first sequence)
I am trying to evaluate the following :
For the time being this file system is mounted directly in PBS using fuse kernel driver with
mount -t ceph ip.srv.1,ip.srv.2,ip.srv.3,ipsrv.4:/ /mnt/mycephfs -o name=myname,secret=xxxxxxxx
We then launch the backup script directly within the pbs and are able to backup the file system using pbs.
In order to simulate real life scenario, i have setup on one of the VM that's also connected to the CephFS a script which generates files in order to do some testing (scripts creates 50.000 files with 65% of small files [10 to 100k] and 35% of big files [1 to 50Mo].
For the time being we have about 450.000 files, we are far from the 100.000.000 files.
We then trigger the backup and collect performance data (If someone is interested, i could share this with users of the forum).
I am tring to see if we can guarantee performances of such backup and if performances are going to be linear and won't take months to be done.
Process is the following today:
launch script for file generation >>> generate 50.000 files on CephFS >>> launch backup script from pbs >>> finish backup >>> (and back to first sequence)
I am trying to evaluate the following :
- Is it realistic to backup 100.000.000 using pbs with file level backup ?
- Is the CephFS mounting inside the PBS a solution that could be used with no problem
- What would be your advises to speedup such backup (if any) ?
- most of the time seems to be spent calculating index of files
- we are seeing an increase in time for every new batch of 50.000 files we are doing
Last edited: