Backup job is stuck and I cannot stop it or even kill it

cym

Member
Dec 5, 2018
3
0
21
61
Paris, France
www.cym.fr
Hello to all,

for yesterday January 9 a backup job is stuck and I cannot stop it or even kill it.
The backup job contains 3 containers and 1 VM. The job started at 03:00:04.
In the back office when I go to 'Task viewer : Backup Job' when I press the button 'Stop' nothing happens.

In the console when I check the vzdump process
Bash:
# ps awxf | grep vzdump

I get the process ID :
Bash:
2444287 ?        Ds     0:00 task UPID:server_name:00254BFF:06D36C31:63BB7524:vzdump::root@pam:

and when I try to kill this process
Bash:
# kill -9 2444287

nothing happens, the process is still there.

I cannot see anything special in /var/log/syslog around the time the job was launched :
Bash:
# cat syslog | grep 'Jan  9 03:'

What is the best option, trying not to shutdown the whole node and/or lxc ?
I use proxmox ve 7.3-4 updated/upgraded

Many thanks in advance for your help ;-)
 
Just had this happen. Ideally you run

vzdump -stop
But that didn't stop it for me. Neither did kill, at least not immediately.

kill -9 <process id>

worked, but it took about 2 minutes for the backup to die.
 
  • Like
Reactions: cym
Just had this happen. Ideally you run

vzdump -stop
But that didn't stop it for me. Neither did kill, at least not immediately.

kill -9 <process id>

worked, but it took about 2 minutes for the backup to die.
Many thanks Lymond but it did'nt work :

```
vzdump -stop
stopping backup process 2444287 failed
```
What are the other options ?
Cheers,
Cyril
 
same kind of issue here, launch backup (4TB but just 576 GB content in real), after 50mn, backup task disapear from bottom task tab ...
but always running ... : what must i do now ?
Code:
root@lab:~# ps awxf | grep vzdump
1787358 pts/0    S+     0:00                      \_ grep vzdump
1693255 ?        Ss     0:09 task UPID:lab666:0019D647:23DC0012:63DF68C9:vzdump:101:root@pam:
 
Last edited:
same kind of issue here, launch backup (4TB but just 576 GB content in real), after 50mn, backup task disapear from bottom task tab ...
but always running ... : what must i do now ?
Code:
root@lab:~# ps awxf | grep vzdump
1787358 pts/0    S+     0:00                      \_ grep vzdump
1693255 ?        Ss     0:09 task UPID:lab666:0019D647:23DC0012:63DF68C9:vzdump:101:root@pam:
Has this ever been solved?
 
Hi,
I don't think that both situations are the same here: Note that the process state in the original post was Ds, which indicates that the process was in an uninterruptible sleep (waiting for IO). This state cannot be (as the term already suggests) interrupted by signals, so reboot or waiting for IO to complete (if it ever will) are the only ways out.

The Ss state however is a interruptible sleep, therefore can be interrupted by e.g. sending the KILL signal.

For details please see the PROCESS STATE CODES in man ps.

What is the exact issue? Is the backup task hanging?
 
Hi,
I don't think that both situations are the same here: Note that the process state in the original post was Ds, which indicates that the process was in an uninterruptible sleep (waiting for IO). This state cannot be (as the term already suggests) interrupted by signals, so reboot or waiting for IO to complete (if it ever will) are the only ways out.

The Ss state however is a interruptible sleep, therefore can be interrupted by e.g. sending the KILL signal.

For details please see the PROCESS STATE CODES in man ps.

What is the exact issue? Is the backup task hanging?
Yes backup-task hangs and does not stop.

Code:
root@AH-PVE-03:~# ps awxf | grep vzdump
1165356 pts/0    S+     0:00              \_ grep vzdump
3370202 ?        Ds     0:01 task UPID:AH-PVE-03:00336CDA:0805D6B4:64F00232:vzdump::root@pam:
 
Yes backup-task hangs and does not stop.

Code:
root@AH-PVE-03:~# ps awxf | grep vzdump
1165356 pts/0    S+     0:00              \_ grep vzdump
3370202 ?        Ds     0:01 task UPID:AH-PVE-03:00336CDA:0805D6B4:64F00232:vzdump::root@pam:
Ds unfortunately means you will have to reboot, as the process is not killable in this case. Did you backup to a remote share and lost connection?
 
Ds unfortunately means you will have to reboot, as the process is not killable in this case. Did you backup to a remote share and lost connection?
Yes, something like that... experimented with the idea to have a guest share an RDX-Drive by SMB to have a target for 3 PVEs in a cluster.
Well turned out, this gets disconnected under heavy write load..... already installed a PBS onsite, but struggling with this "hanging" backup job.

Ok reboot should fix, we will schedule this....
 
I have a similar (I think) issue. Often when backing up an LXC that is using a nfs mountpount passed through from the host PVE, the backup job gets stuck at "INFO: create storage snapshot 'vzdump' and the entire node ends up with a question mark on:

1696238299815.png

It seems Proxmox does not handle mount points very well, as this happens almost every other week.

The lovely thing is even killing the backup job leaves my node in the above state.
 
Last edited:
Same issue here. At backup time one of the servers died. The thing is this server was the one that shared its storage via NFS and the backups were taken on that NFS share (mounted with the "hard" option.
Now we have 2 backup processes in the GUI log, one was running on the server that died, the other on one of the remaining servers.
The dead server's backup job has no data, stop button greyed out in the GUI.
The live server's does show data, frozen at the stage it was when the NFS storage crapped out.
I unlocked the VM that was on the live server, unmounted the NFS share with the -l (lazy) option but still the process in the Ds state and just hangs.
So, there are no solutions other than reboot? Is there any safe option to mount a NFS share without the "hard" option?
 
Same issue here. At backup time one of the servers died. The thing is this server was the one that shared its storage via NFS and the backups were taken on that NFS share (mounted with the "hard" option.
Now we have 2 backup processes in the GUI log, one was running on the server that died, the other on one of the remaining servers.
The dead server's backup job has no data, stop button greyed out in the GUI.
The live server's does show data, frozen at the stage it was when the NFS storage crapped out.
I unlocked the VM that was on the live server, unmounted the NFS share with the -l (lazy) option but still the process in the Ds state and just hangs.
So, there are no solutions other than reboot? Is there any safe option to mount a NFS share without the "hard" option?
No, unfortunately you will have to reboot in order to get rid of the processes stuck in uninterruptible sleep state.

There is the possibility to set the soft mount option in the config, it should however only be used with read only NFS shares, as otherwise you risk data corruption. See https://pve.proxmox.com/wiki/Storage:_NFS
 
Is your NFS server back online? Sometimes it just works after rebooting the NFS server so that the NFS is working again.
I am afraid that since he unmounted the share and the processes still hold and are blocking on the file handles created on the previous mount, this will not recover the processes in uninterruptible sleep state.

as stated in the man page for umount:
Code:
       -l, --lazy
           Lazy unmount. Detach the filesystem from the file hierarchy now, and clean up all references to this filesystem as soon as it is not
           busy anymore.

           A system reboot would be expected in near future if you’re going to use this option for network filesystem or local filesystem with
           submounts. The recommended use-case for umount -l is to prevent hangs on shutdown due to an unreachable network share where a normal
           umount will hang due to a downed server or a network partition. Remounts of the share will not be possible
 
I am afraid that since he unmounted the share and the processes still hold and are blocking on the file handles created on the previous mount, this will not recover the processes in uninterruptible sleep state.
Yes, I know. I just wanted to be sure that it at least was tried before. NFS is still a nightmare with respect to this outages. Had this kind of error just this week and prohibited myself from running the lazy umount and just waited ... was worth it.
 
  • Like
Reactions: Chris
Is your NFS server back online? Sometimes it just works after rebooting the NFS server so that the NFS is working again.
The hosting server thrown a CPU error according to the ilo IML logs, it cannot be powered on from ilo in this state, and nobody is on the site until sunday to try a hard reset.
 
Yes, I know. I just wanted to be sure that it at least was tried before. NFS is still a nightmare with respect to this outages. Had this kind of error just this week and prohibited myself from running the lazy umount and just waited ... was worth it.
Luckily we have the resources to move the running VMs to other servers in the cluster. I think still needs some quorum trickery to prevent loss as it is a 4 node cluster.
 
Luckily we have the resources to move the running VMs to other servers in the cluster. I think still needs some quorum trickery to prevent loss as it is a 4 node cluster.
If you just reboot one (best to reset, because on reboot, the hanging process can still not be killed), it should not be a problem, 3 out of 4 is still quorated.
 
And related to the phantom process that we see in the GUI that belonged to the server that is in error.
What can be done about that? It is not present on any physical server, can it be removed somehow?
 
And related to the phantom process that we see in the GUI that belonged to the server that is in error.
What can be done about that? It is not present on any physical server, can it be removed somehow?
It should switch to failed if the hanging process is resolved, so after the reboot, it should be OK.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!