ERROR: interrupted by signal

Hmm, had a quick glance at the code, but the SIGTERM to the backup job should only be send if a shutdown request is issued to pvescheduler. Not sure why this should be triggered in your case.

I will supplement another observation can somehow help in the investigation of the problem, we also noticed that if you turn off all the tasks in the /etc/pve/jobs.cfg file and start the backup that constantly gets the error error: interrupted by signal - then the backup for this machine is already made without it. Errors, however, if you then turn on all the tasks in this file after that and, for example, in a day, start the backup again, then the backup begins to get an error again, at this moment, having turned off the tasks again and repeating the backup, it is performed normally. Here is the conclusion of the LS team on this file:
-rw-r----- 1 root www-data 144K Jun 25 08:15 /etc/pve/jobs.cfg
Can you also share the /etc/pve/jobs.cfg?
 
Oh, so you have set up an individual backup job for each VM... that will most likely lead to all sorts of locking and timeout issues, as each vzdump job will try to get the global lock. And if it cannot acquire in time, the job will fail. I would recommend to group the backups for VMs to same storage and similar backup times into one job so they are executed in sequential manner without overlap. Further, if necessary you can adapt the lockwait option for vzdump, which will adapt the time a vzdump job will wait before timing out. But that is more of a last resort, restructuring the backup jobs first is recommended.
 
Oh, so you have set up an individual backup job for each VM... that will most likely lead to all sorts of locking and timeout issues, as each vzdump job will try to get the global lock. And if it cannot acquire in time, the job will fail. I would recommend to group the backups for VMs to same storage and similar backup times into one job so they are executed in sequential manner without overlap. Further, if necessary you can adapt the lockwait option for vzdump, which will adapt the time a vzdump job will wait before timing out. But that is more of a last resort, restructuring the backup jobs first is recommended.
You have correctly identified the key points. Indeed, grouping backup tasks and avoiding the concurrent execution of backup processes can significantly reduce the likelihood of the issues described.

However, in practice, such an approach is not always achievable. Users typically prefer to define their own backup schedules and retention policies, and restricting this flexibility may not be the most appropriate solution.

The backup schedule configuration file (jobs.cfg) is relatively small. According to debugging results, the pvescheduler process handles this file every 60 seconds, with execution time not exceeding 3 to 5 seconds. No global locks have been observed that could result in timeout due to blocking.

Nevertheless, in a number of cases most commonly during the parallel execution of multiple jobs there have been instances where a SIGTERM signal is sent to an active vzdump process, even though it is running correctly. As you have seen in the debug logs, this behavior is clearly observable. In my view, sending such a signal should be considered an exceptional event and must be governed by well-defined logic within the codebase.

Given this, it seems more appropriate to focus on identifying the specific conditions under which pvescheduler decides to send a SIGTERM signal, and to explore possible ways to adjust its behavior in order to prevent such incidents. You may consider escalating this issue to the development team, as it is highly likely that other users of your product are experiencing similar problems.
 
Hello Chris,
Is there any news regarding the transfer of this issue to the development department?
It seems to me that what is written above makes sense.
 
Parallel backup and locking is already tracked by this issue, please subscribe to it in order to get status updates on progress https://bugzilla.proxmox.com/show_bug.cgi?id=3347
As far as I can see, this discussion focuses on enabling parallel execution of backup jobs, which is somewhat different from the issue we are experiencing.


In our case, the backup process terminates unexpectedly, and we do not fully understand the cause or what triggers this behavior.
In my opinion, this issue requires a separate investigation.


Could you open a ticket based on this discussion, or would you prefer that I open one myself?