PBS Tape Backup job just stops in the middle - no error

Aug 12, 2019
17
2
43
43
Since last 2 PBS updates the tape job crashes randomly between job, without any report or crash error. Like something would kill the job.

Have anyone else similar issue?


Version:
proxmox-backup-server 3.4.1-1
proxmox-kernel-6.8.12-11-pve-signed 6.8.12-11
 
which version previously worked? is there anything visible in the logs at all? what about monitoring?
 
HI @fabian,

it worked in 8.3.x, but was the write speed around 30Mb, now almost 10x faster but there are not the issue.

My main problem is that there is nothing in logs, nothing in dmesg, syslog or job log (at one point no more entry, last entry is still normal). Even on monitoring i cannot anything that can me useful.

I now see that even verification job was killed with "Unknown" Status but not at the same time to say there is some correlation.

I will try to test if there is any better if i make sure that tape backup is not running at the same time as verification job. (i would take few days)
 

Attachments

  • Screenshot from 2025-06-30 15-36-30.png
    Screenshot from 2025-06-30 15-36-30.png
    16.5 KB · Views: 8
where there any service stops during that time period that would explain that? how is PBS deployed - bare metal, or as a VM? could you provide journal output covering the day of the tape job?
 
Now I have tested the case that only Tape Backup was running and no verification Job at the same time. Look much better, did not crash but not completed. At least with error that looks like TAPE timeout. It did not wait long enough for tape to be rewind. Result is that Tape is unloaded later as expected

Code:
2025-07-07T20:54:06+02:00: end backup PBS2-STORE:"vm/111111060/2025-07-06T21:55:24Z"
2025-07-07T20:54:06+02:00: percentage done: 100.00% (124/124 groups)
2025-07-07T20:54:08+02:00: append media catalog
2025-07-07T20:54:08+02:00: rewind media
2025-07-07T21:01:08+02:00: queued notification (id=91f2c4ce-299a-49e5-9751-f72dcfb95bd4)
2025-07-07T21:01:08+02:00: TASK ERROR: unload drive failed - scsi command failed: transport error

I will test with non-parallel with verification for few weeks and later try to do parallel job testing.

Update come if there is some new info form tests