We have two 11TB VMs (among many other smaller VMs) backing up to a PBS server.
Admittedly, the hardware PBS is running on is not ideal. It's HDD-based. Twenty-four Seagate ST10000NM0086 10TB SATA drives operating in a RAIDZ2 datastore. There are no SSD-based zfs "special devices" and we have noticed that Proxmox's pleas to use SSDs when possible are well-founded.
However, all that aside, since I'm not sure it's related. . . my question is this:
Is there some type of verification process or some other task that PBS performs before writing to tape for each snapshot? Because we have noticed that for these two large (again, 11TB) VMs, for each of their snapshots, the PBS server stops writing to tape, the tape device (a Quantum SuperLoader 3) goes completely idle, and the PBS server just accesses the disks for 2-3 hours before writing resumes. Then this process repeats when the next snapshot for one of these large VMs is written. Obviously, this can add many hours or even days to a tape job where anything but the latest snapshots are written.
That's the prior VM finishing, then the first snapshot of one of our large VMs beginning to be written. There is a delay from 06:14:59 to when the first chunks are written to tape at 07:39:39. That's ~85 minute delay.
Its next snapshot
The prior snapshot finished at 18:56:47 and the job to write (to tape) the same VM's next snapshot begins at the same timestamp (18:56:47). But writing doesn't actually begin until 21:47:04. That's ~171 minute delay. Approaching three hours! And again, while this is happening, only disk IO is taking place with no tape drive activity.
So, to restate the question: Is this normal? Is there some type of verification/preparation job/task running on the snapshot prior to writing it to tape that could account for this? Any advice on speeding it up? We anticipate changing to a raid10-based datastore (from raidz2) once our snapshots are all on tape. So that might help. But we'd like to better understand exactly what is happening during these long pauses taking place in the middle of a tape job.
Thanks!
--Brian
Admittedly, the hardware PBS is running on is not ideal. It's HDD-based. Twenty-four Seagate ST10000NM0086 10TB SATA drives operating in a RAIDZ2 datastore. There are no SSD-based zfs "special devices" and we have noticed that Proxmox's pleas to use SSDs when possible are well-founded.
However, all that aside, since I'm not sure it's related. . . my question is this:
Is there some type of verification process or some other task that PBS performs before writing to tape for each snapshot? Because we have noticed that for these two large (again, 11TB) VMs, for each of their snapshots, the PBS server stops writing to tape, the tape device (a Quantum SuperLoader 3) goes completely idle, and the PBS server just accesses the disks for 2-3 hours before writing resumes. Then this process repeats when the next snapshot for one of these large VMs is written. Obviously, this can add many hours or even days to a tape job where anything but the latest snapshots are written.
Code:
2022-02-14T06:14:44-08:00: percentage done: 33.65% (63/190 groups, 13/14 snapshots in group #64)
2022-02-14T06:14:44-08:00: backup snapshot vm/162/2022-02-04T02:25:39Z
2022-02-14T06:14:55-08:00: wrote 487 chunks (1097.60 MB at 174.89 MB/s)
2022-02-14T06:14:59-08:00: end backup pbs1-primary:vm/162/2022-02-04T02:25:39Z
2022-02-14T06:14:59-08:00: percentage done: 33.68% (64/190 groups)
2022-02-14T06:14:59-08:00: backup snapshot vm/163/2021-11-06T17:18:10Z
2022-02-14T07:39:39-08:00: wrote 5011 chunks (4296.02 MB at 0.85 MB/s)
2022-02-14T07:40:10-08:00: wrote 1847 chunks (4296.02 MB at 140.13 MB/s)
2022-02-14T07:40:43-08:00: wrote 1810 chunks (4295.49 MB at 135.61 MB/s)
2022-02-14T07:41:21-08:00: wrote 1989 chunks (4297.06 MB at 124.38 MB/s)
2022-02-14T07:41:55-08:00: wrote 1824 chunks (4296.54 MB at 136.88 MB/s)
2022-02-14T07:42:31-08:00: wrote 1885 chunks (4297.85 MB at 129.07 MB/s)
2022-02-14T07:43:04-08:00: wrote 1793 chunks (4298.90 MB at 136.89 MB/s)
That's the prior VM finishing, then the first snapshot of one of our large VMs beginning to be written. There is a delay from 06:14:59 to when the first chunks are written to tape at 07:39:39. That's ~85 minute delay.
Its next snapshot
Code:
2022-02-14T18:55:56-08:00: wrote 2470 chunks (4297.59 MB at 111.09 MB/s)
2022-02-14T18:56:35-08:00: wrote 2032 chunks (4297.59 MB at 127.29 MB/s)
2022-02-14T18:56:41-08:00: wrote 316 chunks (696.25 MB at 178.56 MB/s)
2022-02-14T18:56:47-08:00: end backup pbs1-primary:vm/163/2021-11-06T17:18:10Z
2022-02-14T18:56:47-08:00: percentage done: 33.82% (64/190 groups, 1/4 snapshots in group #65)
2022-02-14T18:56:47-08:00: backup snapshot vm/163/2021-12-02T12:00:02Z
2022-02-14T21:47:04-08:00: wrote 2783 chunks (4298.64 MB at 0.42 MB/s)
2022-02-14T21:47:46-08:00: wrote 2045 chunks (4298.90 MB at 109.25 MB/s)
2022-02-14T21:48:30-08:00: wrote 2071 chunks (4295.75 MB at 105.60 MB/s)
The prior snapshot finished at 18:56:47 and the job to write (to tape) the same VM's next snapshot begins at the same timestamp (18:56:47). But writing doesn't actually begin until 21:47:04. That's ~171 minute delay. Approaching three hours! And again, while this is happening, only disk IO is taking place with no tape drive activity.
So, to restate the question: Is this normal? Is there some type of verification/preparation job/task running on the snapshot prior to writing it to tape that could account for this? Any advice on speeding it up? We anticipate changing to a raid10-based datastore (from raidz2) once our snapshots are all on tape. So that might help. But we'd like to better understand exactly what is happening during these long pauses taking place in the middle of a tape job.
Thanks!
--Brian
Last edited: