Tapespeed

FelixJ · Apr 27, 2021

I'm coming from vSphere + Veeam B&R.
PVE is working like a charm as does BPS.
So far I can see 2 general disadvantages which I want to address in the hope, they can be fixed in near future:

as I migrated the vSphere 1:1 to PVE, and use the dirty-bitmap feature +dedupe of PBS, I would have expected a total disk-size for the backup close to what I had in Veeam B&R. There, I kept 30 days-snapshots, and it took 2.4TB of disk space. Now it takes currently 3.8TB (and not all VMs are migrated for 30 days now, so it will grow a little more as the snapshots are adding in). Are there any parameter I can tune to reduce the consumed space?
How fast will the tape-writing be? As the tape-backup at present is only considered as tech preview, I still use a self-written srkipt, which invokes tar to write the everything underneath my backup-repository path to tape, which takes 30 and more hours, because reading lots of those tiny chunks-files from the fs takes so much time. Veeam B&R creates a big archive-type container file, containing all snapshots from one backup-job. I did some read testing to compare the speed of .chunks and my VBK-Files. (prior to each test I cleared the caches (echo 3 > /proc/sys/vm/drop_caches) to ensure accurate measurement:
1. Tarring a 1.1 TB VBK file speeds up around 280MB/s (tar status=progress) while
2. tarring the .chunks folder runs around 65 MB/s.

So, not only, my backups consumes 1/3 more space on the disk, which is generally not my problem, I have plenty of room;-), but due to the increased size as well as the slow read-speed due to those small chunks, my backups takes now 30 hours (without validating the content (tar -tvf)) instead of average of 14 hours (incl. validation).
Can those problems be addressed in the near future?
regards,
Felix

dietmar · Apr 27, 2021

ad 2) You simply need a datastore with reasonable seek times (e.g. SSD).

t.lamprecht · Apr 27, 2021

Hi,

FelixJ said:
as I migrated the vSphere 1:1 to PVE, and use the dirty-bitmap feature +dedupe of PBS, I would have expected a total disk-size for the backup close to what I had in Veeam B&R. There, I kept 30 days-snapshots, and it took 2.4TB of disk space. Now it takes currently 3.8TB (and not all VMs are migrated for 30 days now, so it will grow a little more as the snapshots are adding in). Are there any parameter I can tune to reduce the consumed space?

There's no "save 30% space" knob, if there was we'd turn it on by default. Compression and level of the actual data could explain some of that difference, that is not exposed.

FelixJ said:
How fast will the tape-writing be?

What LTO version do you use?

FelixJ said:
As the tape-backup at present is only considered as tech preview, I still use a self-written srkipt, which invokes tar to write the everything underneath my backup-repository path to tape

That will normally be always, at least slightly, slower than what the integrated tape solution can do.

Depending on your backup strategy you can even optimize this further with PBS. One could, for example, switch a tape on the PBS tape can specify that they only want to switch the tape every week, so only newer backup indexes and new referenced chunks will need to be actually written out after the first job went through. See the docs for some more details:
https://pbs.proxmox.com/docs/tape-backup.html#media-pools

FelixJ said:
Can those problems be addressed in the near future?

Which are the problems now exactly?
We cannot "fix" how tar works (albeit, you could try telling it to iterate over files sorted by inode), FWICT you did not actually try PBS Tape Backup yet, and we will naturally continue to evaluate performance and space optimizations.

FelixJ · Apr 27, 2021

Hi Thomas!

Okay, what are the problems:
First, I think, that swapping a well and fast working HDD RAID with SSDs is not a solution but a workaround.
Secondly I think, to mitigate the seek-operations while I/O, one could simply create a file-system in a top-level container what ever sort of, linux offers a variety of them, which, to make it easier and keep the current software-design, is to be mounted into the datastore directory for I/O operations. For Tapebackups the container must then be unmounted, or at lease remounted read-only. This way, one could tar it away without considering seek times caused by the underlying hardware.

Regarding the consumed disk-space: my question is simple: why, if I have 1TB of production VM-Disks in my vSphere, which is in total, after migration also 1TB of production Data in PVE: Why is the space, required for 30-days snapshot-backup one time 2.4 TB, and on the other hand 3.8TB. This makes only little sense to me, except, as if the dirty-bitmap-function, the deduplication or compression algorithms are not (yet) working as efficient as hoped.

Please note, this is not criticism, at the contrary, I try to improve the product by raising questions and bringing in my experiences and ideas.

hope, you understand me well,
Felix

t.lamprecht · Apr 27, 2021

FelixJ said:
First, I think, that swapping a well and fast working HDD RAID with SSDs is not a solution but a workaround.

That's more something for new setups. See here for some price calculations and rationale why a SSD only setup makes sense nowadays, especially in the enterprise area and at least for up to 30 - 40 TIB of storage needs (more works too, but may need other considerations, IMO 100+ TiB SSDs are closer than 100+ TIB HDDs):
https://forum.proxmox.com/threads/garbage-collector-too-slow.86726/#post-381027

SSDs are not a workaround, they provide a fundamentally better experience due to random IOPS having a constant cost, mostly independent in which order, lower power usage and higher reliability (no moving parts is always better).

FelixJ said:
Secondly I think, to mitigate the seek-operations while I/O, one could simply create a file-system in a top-level container what ever sort of, linux offers a variety of them, which, to make it easier and keep the current software-design, is to be mounted into the datastore directory for I/O operations. For Tapebackups the container must then be unmounted, or at lease remounted read-only. This way, one could tar it away without considering seek times caused by the underlying hardware.

The .chunk store already is on a file system, so I don't understand how that design could magically do away the underlying physical limitations of spinners, which with 10 - 20 ms seek time never will really improve.
Sound rather like that would add another layer of indirection which is further away from the actual data layout on disk, and thus more random IO is done, which is what "breaks" speed on spinners. If one would need to do away layers, creating a specially CAS FS directly in the kernel layer, but in the end even there you'll have those problems (and possible much more, kernel/fs programming isn't exactly a piece of cake, and even experienced FS shops got it wrong a lot, soo...)

FelixJ said:
Regarding the consumed disk-space: my question is simple: why, if I have 1TB of production VM-Disks in my vSphere, which is in total, after migration also 1TB of production Data in PVE: Why is the space, required for 30-days snapshot-backup one time 2.4 TB, and on the other hand 3.8TB. This makes only little sense to me, except, as if the dirty-bitmap-function, the deduplication or compression algorithms are not (yet) working as efficient as hoped.

I hope for more efficiency too, and as said we're permanently evaluating that point.
Without checking the actual setup, both VM and Veam wise, it's hard to determine a reason for this specific issue, so I cannot give you the specific answer you hope for here, I'm afraid.

dietmar · Apr 27, 2021

FelixJ said:
Regarding the consumed disk-space: my question is simple: why, if I have 1TB of production VM-Disks in my vSphere, which is in total, after migration also 1TB of production Data in PVE: Why is the space, required for 30-days snapshot-backup one time 2.4 TB, and on the other hand 3.8TB. This makes only little sense to me, except, as if the dirty-bitmap-function, the deduplication or compression algorithms are not (yet) working as efficient as hoped.

Do you test with the most recent version?

FelixJ · Apr 27, 2021

t.lamprecht said:
That's more something for new setups. See here for some price calculations and rationale why a SSD only setup makes sense nowadays, especially in the enterprise area and at least for up to 30 - 40 TIB of storage needs (more works too, but may need other considerations, IMO 100+ TiB SSDs are closer than 100+ TIB HDDs):
https://forum.proxmox.com/threads/garbage-collector-too-slow.86726/#post-381027

SSDs are not a workaround, they provide a fundamentally better experience due to random IOPS having a constant cost, mostly independent in which order, lower power usage and higher reliability (no moving parts is always better).

Ok. I take that as it is - I'm used to work with what I have, and thats a HDD-RAID, that will read with DD up to 300MB/s which is, from my point of view, good enough at that point.
Though I can considere about SSDs.

t.lamprecht said:
The .chunk store already is on a file system, so I don't understand how that design could magically do away the underlying physical limitations of spinners, which with 10 - 20 ms seek time never will really improve.
Sound rather like that would add another layer of indirection which is further away from the actual data layout on disk, and thus more random IO is done, which is what "breaks" speed on spinners. If one would need to do away layers, creating a specially CAS FS directly in the kernel layer, but in the end even there you'll have those problems (and possible much more, kernel/fs programming isn't exactly a piece of cake, and even experienced FS shops got it wrong a lot, soo...)

My thinking is like this: If I can write a big (1TB) file from the same filesystem (this might be an info, which I have not given you yet) with 280MB/s to the tape which hosts the .chunks folder, the idea would be, that if the .chunks folder would be in "reality" a mounted filesystem-container, I could simply get that container and write it with 280MB/s to the tape.
There might be a slight disadvantage when the actual vm backup runs, there I give you credit for, that could have negative impact on the performance, but that also depends what the file system container would be. For example, maybe a dedicated LVM would be sufficient, then one could simply dd the lvm to the tape - of course, the recovery process is less "nice" then having real files on the tape - on the other side, a tape is per-design, due to it's streaming nature a slow media and it's primarily purpose is having cold-backups.

t.lamprecht said:
I hope for more efficiency too, and as said we're permanently evaluating that point.
Without checking the actual setup, both VM and Veam wise, it's hard to determine a reason for this specific issue, so I cannot give you the specific answer you hope for here, I'm afraid.

How ever, I think, there is some space for improvement yet!
Maybe you consider my thoughts in one of your development meetings...

dietmar · Apr 27, 2021

FelixJ said:
the idea would be, that if the .chunks folder would be in "reality" a mounted filesystem-container, I could simply get that container and write it with 280MB/s to the tape.

Our tape backup is fully deduplicated. So we normally only backup a few chunks (not the whole chunk store) to a tape. Also, we
want to be able to select single VMs for backup...

FelixJ said:
How ever, I think, there is some space for improvement yet!
Maybe you consider my thoughts in one of your development meetings...

What chunk size do you use for veeam backup? Please can you try using 4MB and compare the space usage?

FelixJ · Apr 28, 2021

Where do you want me to change the chunk-size? Veeam or in PBS?
I have no idea, where to change it in either product.
regards,
Felix

dietmar · Apr 28, 2021

FelixJ said:
Where do you want me to change the chunk-size? Veeam or in PBS?

Veeam

iprigger · Apr 28, 2021

Hi Thomas,

t.lamprecht said:
There's no "save 30% space" knob, if there was we'd turn it on by default. Compression and level of the actual data could explain some of that difference, that is not exposed.

I have a 1-Bit-Compression Algorithm in place here which compresses ANY kind of data into ONE Bit.

Works like a Charm. Only little tiny thing is: I haven't figured out a way to actually restore the data yet

SCNR

Tobias

FelixJ · Apr 29, 2021

iprigger said:
Hi Thomas,

I have a 1-Bit-Compression Algorithm in place here which compresses ANY kind of data into ONE Bit.

Works like a Charm. Only little tiny thing is: I haven't figured out a way to actually restore the data yet

SCNR

Tobias

this is rather cynic, isn't it? And will not help to address the actual problem nor improve the product, but well... sometimes cynicism is the only way to cope on things, you cannot change anyway!

t.lamprecht · Apr 29, 2021

I think it's a joke, and surely not meant in any harmful way - at least I took it that way.

FelixJ said:
Ok. I take that as it is - I'm used to work with what I have, and thats a HDD-RAID, that will read with DD up to 300MB/s which is, from my point of view, good enough at that point.
Though I can considere about SSDs.

Sure, I can totally understand that using existing HW is cheaper and also less wasteful to "throw just out", so this was mostly for meant for any new setups. And yeah, bandwidth wise those spinenrs ain't that bad, but random IO, which is what seek time limits so much, is just way worse than any flash storage can be. That's why I'd prefer QLC based SSDs over 10k spinners even if the latter would have slightly better bandwidth, as longer running storage with higher fragmentation or such things like content addressable storage make sequential IO almost impossible after a while, so those systems can profit a lot from fast seek time.

FelixJ said:
My thinking is like this: If I can write a big (1TB) file from the same filesystem (this might be an info, which I have not given you yet) with 280MB/s to the tape which hosts the .chunks folder, the idea would be, that if the .chunks folder would be in "reality" a mounted filesystem-container, I could simply get that container and write it with 280MB/s to the tape.
There might be a slight disadvantage when the actual vm backup runs, there I give you credit for, that could have negative impact on the performance, but that also depends what the file system container would be. For example, maybe a dedicated LVM would be sufficient, then one could simply dd the lvm to the tape - of course, the recovery process is less "nice" then having real files on the tape - on the other side, a tape is per-design, due to it's streaming nature a slow media and it's primarily purpose is having cold-backups.

Hmm, OK then I got you slightly wrong - sorry for that. But, the issue with your approach is that it basically undo deduplication efforts, or at least make it necessary to read all and diff between those images. Things like the aforementioned smarter media set tape allocation scheduling cannot be done any more. So it's really a trade-off, and we would like to keep the deduplication semantics for the tape too, it makes the design more consistent and features possible which otherwise would not be doable (at least not without doing more IO)

FelixJ said:
How ever, I think, there is some space for improvement yet!
Maybe you consider my thoughts in one of your development meetings...

Sure do, we actively discuss those things, not everything reaches the forum as is but we surely don't just brush off any input we get. We have one idea which is independent of the layout and which I suggested to you for the tar command, reading files sorted by inode. That gives better performance as inodes sorting and physical data location on spinning disks correlate.

iprigger · Apr 29, 2021

Hi,

t.lamprecht said:
I think it's a joke, and surely not meant in any harmful way - at least I took it that way.

Thanks. But since some people are a bit sensible I'm not going to advise linking the tape device to /dev/null here (would speed up the whole backup as well...)

Tobias

FelixJ · Apr 30, 2021

t.lamprecht said:
I think it's a joke, and surely not meant in any harmful way - at least I took it that way.

It's all good! I understood! I'm working long enough in that business!

t.lamprecht said:
Sure, I can totally understand that using existing HW is cheaper and also less wasteful to "throw just out", so this was mostly for meant for any new setups. And yeah, bandwidth wise those spinenrs ain't that bad, but random IO, which is what seek time limits so much, is just way worse than any flash storage can be. That's why I'd prefer QLC based SSDs over 10k spinners even if the latter would have slightly better bandwidth, as longer running storage with higher fragmentation or such things like content addressable storage make sequential IO almost impossible after a while, so those systems can profit a lot from fast seek time.

I have some super-duper Samsung Enterprise-Class 4TB SSDs as spares for my ceph around, which I want to "abuse" to test the difference, but that will take some time! As I migrated from vsphere+ veeam, I thought about a lot of issues that might happen, but I never thought, that the tape-runtime could be more then 24hrs... which breaks my backup-concept - for right now anyway - the neck.
So I am also busy finalizing the migration!

Then I saw, that build-in tape-backup might be possible anyway in near future - unfortunately, as you can read from my other post, my tape-drive is not properly recognized - a patch is pending and I am waiting for it to be in the stable branch, so I can use the deduped chunk-store, and hopefully increase the backup-time!
For now I cannot backup the chunk-store, it's nearly 4TB big and growing 14 more days, 'till all vm's have 31 restore points on disk. The further growth through user-datacreation is minimal.

An other question related to that would be: Will it be possible, with pbs tape-backup to write content outside of the chunks-store to the tape - like other local or remotely mounted filesystems?

t.lamprecht said:
Hmm, OK then I got you slightly wrong - sorry for that. But, the issue with your approach is that it basically undo deduplication efforts, or at least make it necessary to read all and diff between those images. Things like the aforementioned smarter media set tape allocation scheduling cannot be done any more. So it's really a trade-off, and we would like to keep the deduplication semantics for the tape too, it makes the design more consistent and features possible which otherwise would not be doable (at least not without doing more IO)

Yea, it would break the pbs concept totally, however, it must not be either one or the other, it could be both. The top-level container would simply ensure, a fast-read on "spinners" outside pbs - for example, if you want to transfer the whole chunks-store to an other set of disks would be more efficient.

t.lamprecht said:
Sure do, we actively discuss those things, not everything reaches the forum as is but we surely don't just brush off any input we get. We have one idea which is independent of the layout and which I suggested to you for the tar command, reading files sorted by inode. That gives better performance as inodes sorting and physical data location on spinning disks correlate.

I tried that. It gives a slightly better performance. Without sorting I got a read-performance about 65MB/s.
With any of those methods

Bash:

ls -U -i /var/lib/vz/backup/pbs/.chunks/| sort -k1,1 -n | cut -d' ' -f2- > ~/clist; tar -cf /dev/st0 -T ~/clist

or

Bash:

tar --sort=inode -cf /dev/st0  /var/lib/vz/backup/pbs/.chunks/

I get around 80MB/s which is definitely an improvement but far from what the tape-drive could write.

Search

Search

Tapespeed

FelixJ

Well-Known Member

dietmar

Proxmox Staff Member

t.lamprecht

Proxmox Staff Member

FelixJ

Well-Known Member

t.lamprecht

Proxmox Staff Member

dietmar

Proxmox Staff Member

FelixJ

Well-Known Member

dietmar

Proxmox Staff Member

FelixJ

Well-Known Member

dietmar

Proxmox Staff Member

iprigger

Renowned Member

FelixJ

Well-Known Member

t.lamprecht

Proxmox Staff Member

iprigger

Renowned Member

FelixJ

Well-Known Member