Problem with worker VM start initiated by Veeam on PVE

Jan 29, 2025
9
0
1
tltsolutions.ch
Hi,

I'm struggling with a problem regarding backups of Proxmox VE using Veeam Backup & Replication Community Edition for a few months now. Actually, the problem seems to be related to the start of the Veeam-Worker VM initiated by the Veeam Backup & Replication solution.
I believe the problem may have started after upgrading Proxmox VE from 8.2 to 8.3, but I'm not entirely sure as the problem isn't permanently there.
There is a daily backup job. It can work several days in a row without any problem. Then, one day I get a backup failed message and when I check the PVE webgui, I can see a still running task for starting VM 101, which is the Veeam-Worker VM. The task may run for 10's of hours without actually starting the VM. In the VM overview, the VM remains in a stopped state. Subsequent Veeam backups will fail until the task is stopped manually. Once stopped manually the output in the task window is as follows:
Code:
generating cloud-init ISO
malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Tools.pm line 1073, <GEN12937> chunk 1.
TASK ERROR: start failed: interrupted by unexpected signal

And in the system log:
Code:
pvedaemon[1335357]: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Tools.pm line 1073, <GEN19863> chunk 1.

If I test the worker VM in Veeam, the test will be successful every time (each time the worker VM is started successfully, tested and then stopped). But planned and manual backups tend to fail possibly in about 20 to 30 % of all cases. The PVE server is far away from running on it's ressource limits (average CPU <10 %, max CPU <30 %, RAM and disk usage <50 %).
Regarding VM ressource allocations for Veeam and worker VM, the recommendations have been applied.

To overcome this situation I have tried various steps:
  • Update newly appearing PVE/Debian packages several times. right now I'm on PVE 8.3.3 with all available updates from enterprise repository applied as of today.
  • Reboot the whole PVE server.
  • Update Veeam Backup & Replication Community Edition to latest available version 12.3.0.310 dated December 3, 2024 (so after PVE 8.3 release).
  • Delete and let Veeam recreate the worker VM with latest Veeam version.
  • Update Windows 11 Pro 23H2 to 24H2 - this is the OS where the Veeam software is installed on.
I don't know what else to look for. To me it seems like some sort of incompatibility between Veeam and Proxmox when it comes to the point to start the worker VM, but I'm surprised I wasn't able to find anyone else having the same issue.
Any help is be very appreciated and I'm happy to provide any output that may be helpful to find the issue.

Just in case the question arises as to why not use Proxmox Backup Server: According to the best practices described for PBS, our network infrastructure doesn't meet several of the criterias. I tested it and it was much too slow unfortunately.
 
Last edited:
We are currently facing somthing like this, but we've experienced that the ram comsumtion from the server escalates until 98% when with all the VMs from the host doesn't even reach the 70%.
The task can be manually sttoped but we have to restart the server in order to nomalize the consumption.
 
We are currently facing somthing like this, but we've experienced that the ram comsumtion from the server escalates until 98% when with all the VMs from the host doesn't even reach the 70%.
The task can be manually sttoped but we have to restart the server in order to nomalize the consumption.
Thanks for your reply.
I haven't noticed excessive RAM consumption so far - I'll have a closer look next time the VM start process hangs (last night it worked fine).
At least looking at the maximum memory usage from last week, I don't see any excessive peaks.
 

Attachments

  • Screenshot 2025-02-11 083246.png
    Screenshot 2025-02-11 083246.png
    118.4 KB · Views: 29
So it hung up again yesterday and as you can see in the day max view, there is no change in RAM occupation.
The VM 101 start task started more than 13 hours ago. I manually stopped the task and received the error as shown.
 

Attachments

  • Screenshot 2025-02-17 113855.png
    Screenshot 2025-02-17 113855.png
    186 KB · Views: 39
  • Screenshot 2025-02-17 113935.png
    Screenshot 2025-02-17 113935.png
    16.8 KB · Views: 39
  • Screenshot 2025-02-17 113951.png
    Screenshot 2025-02-17 113951.png
    18.6 KB · Views: 37
I had this too and decices that VEEAM is not ready for production, not even close. The deeper I looked into this, the more red flags occured.
We, was the Proxmox VE community cannot help. Please contact VEEAMs support forums for issues with VEEAM.
 
I had the same issue with Veeam this morning. It adds to my frustration about the fact that Veeam needs SSH root access to PVE with a password (no private keys possible) and root access to the API. We've been happy with Veeam for vmWare, but it's not as smooth with PVE.

I opened a case with Veeam Support today, let's see what they find out.
 
Last edited:
We are also evaluating Proxmox with Veeam as a replacement for our VMware+Veeam stack.
Unfortunately, we also have the some problem, the worker VM sometimes does not start and the start-task is stuck.
When stopped, the same error as in the original post is logged.
I've also remove the Cloud-Init drive/image, but this does not seem to help.
If there are any ideas or tests to do, I'm happy to try it in our lab and give feedback.
 
We are also evaluating Proxmox with Veeam as a replacement for our VMware+Veeam stack.
Unfortunately, we also have the some problem, the worker VM sometimes does not start and the start-task is stuck.
When stopped, the same error as in the original post is logged.
I've also remove the Cloud-Init drive/image, but this does not seem to help.
If there are any ideas or tests to do, I'm happy to try it in our lab and give feedback.
So, I did exactly as you mentioned: I updated Proxmox and Veeam to the latest available version, but without success.
 
I had the same issue with Veeam this morning. It adds to my frustration about the fact that Veeam needs SSH root access to PVE with a password (no private keys possible) and root access to the API. We've been happy with Veeam for vmWare, but it's not as smooth with PVE.

I opened a case with Veeam Support today, let's see what they find out.
Did they ever find any solution or help?

I had opened a support ticket for this awhile ago but their support staff was so bad for Proxmox, I had to close it early after two weeks of emails and logs because the support agent I had assigned to me was completely lost and couldn't understand basic information. I want to open another one but feel like it will be more time wasted. I have just been stopping the worker agent manually everyday if it gets stuck, there's no patterns, everything is on the latest version etc.

Although I never get a json error, it just stuck at generating cloud-init ISO and will hang for hours or days until I hit Stop manually.
 
Did they ever find any solution or help?

I had opened a support ticket for this awhile ago but their support staff was so bad for Proxmox, I had to close it early after two weeks of emails and logs because the support agent I had assigned to me was completely lost and couldn't understand basic information. I want to open another one but feel like it will be more time wasted. I have just been stopping the worker agent manually everyday if it gets stuck, there's no patterns, everything is on the latest version etc.

Although I never get a json error, it just stuck at generating cloud-init ISO and will hang for hours or days until I hit Stop manually.
I had a case opened a while ago too, but it was just closed afer a few days because of "no ressources". I tried to escalate it, but the escalation team basically told the same. I asked why they can't just keep it open until someone has got the time to process it? - the answer was again similar. In short: if you use the Community Edition, don't expect any form of support - they did't even accept it as a bug report...
 
Thanks for the info. We also have a paid support contract with them. I did end up opening another ticket for the issue today. I got their first tier 1 response on how to stop and restart the agent. The problem is it happens multiple times a week and on vastly different hardware configs. So, I'll keep pushing them to fix the main problem which is fails to start to often.

Honestly, we might need a script or something made just to say, if worker agent is stalled for X amount of time, terminate and hopefully it will launch successfully next incremental.
 
Investing same problem, i found that my workers VMs are created with 6 sockets (1 core each), editing processors settings says maximum number is 4 sockets.
I changed it to 1 socket(6 cores), after initiating a veeam backup job, Worker powers on but processor returns to 6 sockets.
Maybe that causes the problem ?

Edit:
Veeam advanced settings of worker have by default 6 vCPUs.
I'm changing to 4vCPUs and see if problem persists.
 

Attachments

  • veeamworker.png
    veeamworker.png
    6.9 KB · Views: 26
  • veeamworkerCPU.png
    veeamworkerCPU.png
    8 KB · Views: 25
Last edited:
  • Like
Reactions: justanotherITdude
I got info from Veeam support and we can actually leave the Veeam worker just in an idle/powered on state.

C:\Program Files\Veeam\Plugins\PVE\Service\
edit the appsettings.json
Under "Workers" you can change KeepTurnedOn from false to true
"KeepTurnedOn": true

Save and reboot server or Veeam PVE Service.

I am going to try this out for awhile and see how it does. It looks to barely use any cpu resources while idle but will eat up some ram but less risky then having it not power up correctly and miss checkpoints on VMs.

I did test what happens if you reboot the Veeam server with worker powered on, the Veeam worker will stay powered on, once a backup is started it resets the Veeam worker and you can see the uptime clock reset.

Seems good so far, will let it go for the rest of the week and see if it's more stable.
 
Last edited: