Backup job is stuck and I cannot stop it or even kill it

cym · Jan 10, 2023

Hello to all,

for yesterday January 9 a backup job is stuck and I cannot stop it or even kill it.
The backup job contains 3 containers and 1 VM. The job started at 03:00:04.
In the back office when I go to 'Task viewer : Backup Job' when I press the button 'Stop' nothing happens.

In the console when I check the vzdump process

Bash:

# ps awxf | grep vzdump

I get the process ID :

Bash:

2444287 ?        Ds     0:00 task UPID:server_name:00254BFF:06D36C31:63BB7524:vzdump::root@pam:

and when I try to kill this process

Bash:

# kill -9 2444287

nothing happens, the process is still there.

I cannot see anything special in /var/log/syslog around the time the job was launched :

Bash:

# cat syslog | grep 'Jan  9 03:'

What is the best option, trying not to shutdown the whole node and/or lxc ?
I use proxmox ve 7.3-4 updated/upgraded

Many thanks in advance for your help ;-)

Lymond · Jan 10, 2023

Just had this happen. Ideally you run

vzdump -stop
But that didn't stop it for me. Neither did kill, at least not immediately.

kill -9 <process id>

worked, but it took about 2 minutes for the backup to die.

cym · Jan 11, 2023

Lymond said:
Just had this happen. Ideally you run

vzdump -stop
But that didn't stop it for me. Neither did kill, at least not immediately.

kill -9 <process id>

worked, but it took about 2 minutes for the backup to die.

Many thanks Lymond but it did'nt work :

```
vzdump -stop
stopping backup process 2444287 failed
```
What are the other options ?
Cheers,
Cyril

ledufakademy · Feb 5, 2023

same kind of issue here, launch backup (4TB but just 576 GB content in real), after 50mn, backup task disapear from bottom task tab ...
but always running ... : what must i do now ?

Code:

root@lab:~# ps awxf | grep vzdump
1787358 pts/0    S+     0:00                      \_ grep vzdump
1693255 ?        Ss     0:09 task UPID:lab666:0019D647:23DC0012:63DF68C9:vzdump:101:root@pam:

itNGO · Aug 31, 2023

ledufakademy said:
same kind of issue here, launch backup (4TB but just 576 GB content in real), after 50mn, backup task disapear from bottom task tab ...
but always running ... : what must i do now ?

Code:

root@lab:~# ps awxf | grep vzdump 1787358 pts/0 S+ 0:00 \_ grep vzdump 1693255 ? Ss 0:09 task UPID:lab666:0019D647:23DC0012:63DF68C9:vzdump:101:root@pam:

Has this ever been solved?

Chris · Aug 31, 2023

Hi,
I don't think that both situations are the same here: Note that the process state in the original post was Ds, which indicates that the process was in an uninterruptible sleep (waiting for IO). This state cannot be (as the term already suggests) interrupted by signals, so reboot or waiting for IO to complete (if it ever will) are the only ways out.

The Ss state however is a interruptible sleep, therefore can be interrupted by e.g. sending the KILL signal.

For details please see the PROCESS STATE CODES in man ps.

What is the exact issue? Is the backup task hanging?

itNGO · Aug 31, 2023

Chris said:
Hi,
I don't think that both situations are the same here: Note that the process state in the original post was Ds, which indicates that the process was in an uninterruptible sleep (waiting for IO). This state cannot be (as the term already suggests) interrupted by signals, so reboot or waiting for IO to complete (if it ever will) are the only ways out.

The Ss state however is a interruptible sleep, therefore can be interrupted by e.g. sending the KILL signal.

For details please see the PROCESS STATE CODES in man ps.

What is the exact issue? Is the backup task hanging?

Yes backup-task hangs and does not stop.

Code:

root@AH-PVE-03:~# ps awxf | grep vzdump
1165356 pts/0    S+     0:00              \_ grep vzdump
3370202 ?        Ds     0:01 task UPID:AH-PVE-03:00336CDA:0805D6B4:64F00232:vzdump::root@pam:

Chris · Aug 31, 2023

itNGO said:

Yes backup-task hangs and does not stop.

Code:

root@AH-PVE-03:~# ps awxf | grep vzdump
1165356 pts/0    S+     0:00              \_ grep vzdump
3370202 ?        Ds     0:01 task UPID:AH-PVE-03:00336CDA:0805D6B4:64F00232:vzdump::root@pam:

Ds unfortunately means you will have to reboot, as the process is not killable in this case. Did you backup to a remote share and lost connection?

itNGO · Aug 31, 2023

Chris said:
Ds unfortunately means you will have to reboot, as the process is not killable in this case. Did you backup to a remote share and lost connection?

Yes, something like that... experimented with the idea to have a guest share an RDX-Drive by SMB to have a target for 3 PVEs in a cluster.
Well turned out, this gets disconnected under heavy write load..... already installed a PBS onsite, but struggling with this "hanging" backup job.

Ok reboot should fix, we will schedule this....

timdonovan · Oct 2, 2023

I have a similar (I think) issue. Often when backing up an LXC that is using a nfs mountpount passed through from the host PVE, the backup job gets stuck at "INFO: create storage snapshot 'vzdump' and the entire node ends up with a question mark on:

It seems Proxmox does not handle mount points very well, as this happens almost every other week.

The lovely thing is even killing the backup job leaves my node in the above state.

gradinaruvasile · Oct 13, 2023

Same issue here. At backup time one of the servers died. The thing is this server was the one that shared its storage via NFS and the backups were taken on that NFS share (mounted with the "hard" option.
Now we have 2 backup processes in the GUI log, one was running on the server that died, the other on one of the remaining servers.
The dead server's backup job has no data, stop button greyed out in the GUI.
The live server's does show data, frozen at the stage it was when the NFS storage crapped out.
I unlocked the VM that was on the live server, unmounted the NFS share with the -l (lazy) option but still the process in the Ds state and just hangs.
So, there are no solutions other than reboot? Is there any safe option to mount a NFS share without the "hard" option?

Chris · Oct 13, 2023

gradinaruvasile said:
Same issue here. At backup time one of the servers died. The thing is this server was the one that shared its storage via NFS and the backups were taken on that NFS share (mounted with the "hard" option.
Now we have 2 backup processes in the GUI log, one was running on the server that died, the other on one of the remaining servers.
The dead server's backup job has no data, stop button greyed out in the GUI.
The live server's does show data, frozen at the stage it was when the NFS storage crapped out.
I unlocked the VM that was on the live server, unmounted the NFS share with the -l (lazy) option but still the process in the Ds state and just hangs.
So, there are no solutions other than reboot? Is there any safe option to mount a NFS share without the "hard" option?

No, unfortunately you will have to reboot in order to get rid of the processes stuck in uninterruptible sleep state.

There is the possibility to set the soft mount option in the config, it should however only be used with read only NFS shares, as otherwise you risk data corruption. See https://pve.proxmox.com/wiki/Storage:_NFS

LnxBil · Oct 13, 2023

Is your NFS server back online? Sometimes it just works after rebooting the NFS server so that the NFS is working again.

Chris · Oct 13, 2023

LnxBil said:
Is your NFS server back online? Sometimes it just works after rebooting the NFS server so that the NFS is working again.

I am afraid that since he unmounted the share and the processes still hold and are blocking on the file handles created on the previous mount, this will not recover the processes in uninterruptible sleep state.

as stated in the man page for umount:

Code:

       -l, --lazy
           Lazy unmount. Detach the filesystem from the file hierarchy now, and clean up all references to this filesystem as soon as it is not
           busy anymore.

           A system reboot would be expected in near future if you’re going to use this option for network filesystem or local filesystem with
           submounts. The recommended use-case for umount -l is to prevent hangs on shutdown due to an unreachable network share where a normal
           umount will hang due to a downed server or a network partition. Remounts of the share will not be possible

LnxBil · Oct 13, 2023

Chris said:
I am afraid that since he unmounted the share and the processes still hold and are blocking on the file handles created on the previous mount, this will not recover the processes in uninterruptible sleep state.

Yes, I know. I just wanted to be sure that it at least was tried before. NFS is still a nightmare with respect to this outages. Had this kind of error just this week and prohibited myself from running the lazy umount and just waited ... was worth it.

gradinaruvasile · Oct 13, 2023

LnxBil said:
Is your NFS server back online? Sometimes it just works after rebooting the NFS server so that the NFS is working again.

The hosting server thrown a CPU error according to the ilo IML logs, it cannot be powered on from ilo in this state, and nobody is on the site until sunday to try a hard reset.

gradinaruvasile · Oct 13, 2023

LnxBil said:
Yes, I know. I just wanted to be sure that it at least was tried before. NFS is still a nightmare with respect to this outages. Had this kind of error just this week and prohibited myself from running the lazy umount and just waited ... was worth it.

Luckily we have the resources to move the running VMs to other servers in the cluster. I think still needs some quorum trickery to prevent loss as it is a 4 node cluster.

LnxBil · Oct 13, 2023

gradinaruvasile said:
Luckily we have the resources to move the running VMs to other servers in the cluster. I think still needs some quorum trickery to prevent loss as it is a 4 node cluster.

If you just reboot one (best to reset, because on reboot, the hanging process can still not be killed), it should not be a problem, 3 out of 4 is still quorated.

gradinaruvasile · Oct 13, 2023

And related to the phantom process that we see in the GUI that belonged to the server that is in error.
What can be done about that? It is not present on any physical server, can it be removed somehow?

LnxBil · Oct 13, 2023

gradinaruvasile said:
And related to the phantom process that we see in the GUI that belonged to the server that is in error.
What can be done about that? It is not present on any physical server, can it be removed somehow?

It should switch to failed if the hanging process is resolved, so after the reboot, it should be OK.

Backup job is stuck and I cannot stop it or even kill it

Member

Renowned Member

Member

Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Active Member

Renowned Member

Proxmox Staff Member

Distinguished Member

Proxmox Staff Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

We value your privacy