[SOLVED] Pvescheduler is dead and won't start.

Guillaume Soucy

Well-Known Member
Oct 20, 2017
70
5
48
30
L'Orignal, Canada
guillaumesoucy.com
Hello,

I notice that my automated backups wasn't going through. After reading, it is related to Pvescheduler who wasn't running. The problem is I can't start the service with:

Code:
systemctl start pvescheduler.service

Using the WebGUI doesn't work either. It loads forever:

Screenshot at 2023-01-10 03-20-24.png

It can hang for hours. Is there a way to start it without trying to reboot the host?

I'm running PVE 7.2-14

Thank-you,

Guillaume
 
Hi,
you can check the status of the pvescheduler service with systemctl status pvescheduler.service.
Also check the logs for the service journalctl -b -u pvescheduler.service.
 
Good morning,

Code:
journalctl -b -u pvescheduler.service
returns

Code:
journalctl -b -u pvescheduler.service
-- Journal begins at Fri 2022-04-15 12:31:55 EDT, ends at Tue 2023-01-10 04:54:52 EST. --
-- No entries --

and

Code:
systemctl status pvescheduler.service
reports the service as dead.

Code:
systemctl status pvescheduler.service
● pvescheduler.service - Proxmox VE scheduler
     Loaded: loaded (/lib/systemd/system/pvescheduler.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

Guillaume
 
Are the PVE services/targets the pvescheduler.service depends on all active? Try to check with systemctl status pve-storage.target pve-cluster.service pve-guests.service
 
pve-guests.service seem not running.

Code:
● pve-guests.service - PVE guests
     Loaded: loaded (/lib/systemd/system/pve-guests.service; enabled; vendor preset: enabled)
     Active: activating (start) since Thu 2023-01-05 15:48:51 EST; 4 days ago
    Process: 1144 ExecStartPre=/usr/share/pve-manager/helpers/pve-startall-delay (code=exited, status=0/SUCCESS)
   Main PID: 1145 (pvesh)
      Tasks: 2 (limit: 38338)
     Memory: 116.5M
        CPU: 999ms
     CGroup: /system.slice/pve-guests.service
             └─1145 /usr/bin/perl /usr/bin/pvesh --nooutput create /nodes/localhost/startall
 
Okay, this might be the reason why the pvescheduler service will not start, since it is waiting for its dependency. It seems pve-guests.service is hanging in pve-startall-delay.

Do you see any errors related to pve-guests.service in the journal, run journalctl -b -u pve-guests.service
 
Okay, this might be the reason why the pvescheduler service will not start, since it is waiting for its dependency. It seems pve-guests.service is hanging in pve-startall-delay.

Do you see any errors related to pve-guests.service in the journal, run journalctl -b -u pve-guests.service

No, I see no errors. Just a thing, when booting the host I had to stop the starts of the VMs which normally starts by themself with a delay of 60 seconds in interval. Is the issue can be caused because I interrupted the starting sequence of the VMs?

Guillaume
 
Yes, this most likely is related... Can you provide:
  • A complete journal since boot, journalctl -b
  • Output of ps auxwf
  • A strace for the hanging pvesh command, strace -yyttT -f -s 512 -p 1145
 
  • Like
Reactions: Stoiko Ivanov
Thanks, although it seems you accidentally linked two times to the ps output, could you fix the link to the journal output?
 
Okay,
so the interrupt when starting the VMs after boot seems to have produced a zombie process.
Try kill -9 1145 1146, then check the status of your services again systemctl status pve-guests pvescheduler.
 
Okay,
so the interrupt when starting the VMs after boot seems to have produced a zombie process.
Try kill -9 1145 1146, then check the status of your services again systemctl status pve-guests pvescheduler.
Yes it works now, the backup process kick in by itself.

Also, I had to stop backup process on another host and now the host's WebGUI seem to hang, how to kill the backup process completely, I did the
Code:
kill PID
command but it seem to be stuck.

Thank-you,

Guillaume
 
Yes it works now, the backup process kick in by itself.

Also, I had to stop backup process on another host and now the host's WebGUI seem to hang, how to kill the backup process completely, I did the
Code:
kill PID
command but it seem to be stuck.

Thank-you,

Guillaume
What do you mean exactly? You tried to kill the backup job by sending a SIGTERM to the process?

In order to stop a task from the CLI you can use the pvesh by invoking pvesh delete /nodes/{node}/tasks/{upid}, the tasks UPID you can find from the task list pvesh get /nodes/{node}/tasks pvesh get /cluster/tasks.

What errors are you seeing in the WebUI? Is it timing out?
 
Last edited:
What do you mean exactly? You tried to kill the backup job by sending a SIGTERM to the process?

In order to stop a task from the CLI you can use the pvesh by invoking pvesh delete /nodes/{node}/tasks/{upid}, the tasks UPID you can find from the task list pvesh get /nodes/{node}/tasks.

What errors are you seeing in the WebUI? Is it timing out?
Yes, I've got timeouts.

pvesh get /nodes/{node}/tasks

The command seem returning no UPID. I attach the output to this message.

Thank-you,

Guillaume
 

Attachments

Your output is cut off, which does not matter as I was wrong, because /nodes/{node}/tasks only includes finished tasks. pvesh get /cluster/tasks --noborder should include also running tasks. Maybe filter it also by VMID and for vzdump using grep.

Regarding the issue with the WebUI, can you check for errors which might give a clue on what is not working? Is the pveproxy.service active?
 
Your output is cut off, which does not matter as I was wrong, because /nodes/{node}/tasks only includes finished tasks. pvesh get /cluster/tasks --noborder should include also running tasks. Maybe filter it also by VMID and for vzdump using grep.

Regarding the issue with the WebUI, can you check for errors which might give a clue on what is not working? Is the pveproxy.service active?
The task seem still running:

Code:
            │ pve-004-dc │ 3566291 │ 339398990 │ 1673514005 │ vzdump    │ UPID:pve-004-dc:00366AD3:143AD14E:63BFCC15:vzdump::root@pam:            │ root@pam │ 1673533773 │ unexpected status

Code:
pvesh delete /nodes/pve-004-dc/tasks/UPID:pve-004-dc:00366AD3:143AD14E:63BFCC15:vzdump::root@pam:
Is not stopping it.

For the WebGUI, it's only some parts who not working but, I think it's related to the backup task who is stuck with unexpected status. Probably if we successfully killing that backup task the GUI will get back to normal.
 
The task seem still running:

Code:
│ pve-004-dc │ 3566291 │ 339398990 │ 1673514005 │ vzdump │ UPID:pve-004-dc:00366AD3:143AD14E:63BFCC15:vzdump::root@pam: │ root@pam │ 1673533773 │ unexpected status
As I see from the output you posted the task has an endtime timestamp, so while it has an unexpected status, the tasks seems to be terminated.

What makes you believe that this task is still running? Also, what exactly is not working in the WebUI? Please provide more information.
 
The task seem still running:

Code:
            │ pve-004-dc │ 3566291 │ 339398990 │ 1673514005 │ vzdump    │ UPID:pve-004-dc:00366AD3:143AD14E:63BFCC15:vzdump::root@pam:            │ root@pam │ 1673533773 │ unexpected status

Code:
pvesh delete /nodes/pve-004-dc/tasks/UPID:pve-004-dc:00366AD3:143AD14E:63BFCC15:vzdump::root@pam:
Is not stopping it.

For the WebGUI, it's only some parts who not working but, I think it's related to the backup task who is stuck with unexpected status. Probably if we successfully killing that backup task the GUI will get back to normal.
As I see from the output you posted the task has an endtime timestamp, so while it has an unexpected status, the tasks seems to be terminated.

What makes you believe that this task is still running? Also, what exactly is not working in the WebUI? Please provide more information.

I thought that it was still running as it's still showing in the webUI.

Screenshot at 2023-01-12 11-23-59.png

And still tried to stop the backup process from the WebUI, some part aren't loading:

Screenshot at 2023-01-12 11-24-22.png
And it lead to communication failure. But, the WebUI is not down, I can access some other parts.
Screenshot at 2023-01-12 11-24-54.png

For the backup task this is what I've got:

Screenshot at 2023-01-12 11-24-42.png
The output table is empty.

Normally I get this fixed with a reboot but, It should have a better way to do so.

Thank-you very much for you contiguous help.

Guillaume
 
The task you tried to stop with the pvesh command is not the one which is stuck (at least the starttime timestamps don't) match.

Check the tasks in the cluster with pvesh get /cluster/tasks --output-format json-pretty

Regarding your WebUI, it seems that there is an issue with fetching the status of one of your VMs (the one with blocked backup I assume). Is this VM still running? Maybe thats why your backup job ended up in this state, because the VM is acting up? Check the journal for errors.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!