SMB/CIFS Backup job stuck

peterge

Well-Known Member
Apr 3, 2019
50
2
48
25
Harperscheid
blog.peterge.de
Hey, we have an interesting problem in our proxmox cluster. Two nodes have a backup job set up, which does back up to a Windows Veeam server in a remote datacenter every night. On the proxmox4 node the backup does run just fine, but on the proxmox2 node we sometimes face a problem during the backup job. It did happen at the beginning of febuary (like 6th?) and yesterday night.

journalctl prints these kernel log lines: https://pastebin.com/2rht2iKa (full log)
Code:
...
Feb 22 01:59:40 proxmox2 kernel: INFO: task kworker/29:6:2639382 blocked for more than 362 seconds.
Feb 22 01:59:40 proxmox2 kernel:       Tainted: P           O      5.15.102-1-pve #1
Feb 22 01:59:40 proxmox2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...

The Backup Job shows these lines before freezing:
Code:
...
INFO:  89% (223.8 GiB of 251.0 GiB) in 18m 8s, read: 1.3 GiB/s, write: 96.2 MiB/s
INFO:  91% (230.0 GiB of 251.0 GiB) in 18m 11s, read: 2.1 GiB/s, write: 99.8 MiB/s
INFO:  92% (232.2 GiB of 251.0 GiB) in 18m 14s, read: 737.0 MiB/s, write: 167.5 MiB/s
INFO:  93% (233.5 GiB of 251.0 GiB) in 18m 17s, read: 451.9 MiB/s, write: 161.1 MiB/s
INFO:  96% (243.3 GiB of 251.0 GiB) in 18m 21s, read: 2.4 GiB/s, write: 3.0 KiB/s
INFO: 100% (251.0 GiB of 251.0 GiB) in 18m 24s, read: 2.6 GiB/s, write: 0 B/s
INFO: backup is sparse: 54.25 GiB (21%) total zero data
INFO: transferred 251.00 GiB in 1104 seconds (232.8 MiB/s)

Then vzdump enters the Ds state, which means the only way to solve the stuck job is to reboot the host:
Code:
root@proxmox2:~# ps -aux | grep vzdump
root      655590  0.0  0.0   6244   648 pts/0    S+   12:03   0:00 grep vzdump
root     2428713  0.0  0.0 350112 110952 ?       Ds   01:30   0:02 task UPID:proxmox2:00250F29:03B03D45:65D6958A:vzdump::root@pam:

Since we are not running an up to date kernel, will upgrading fix this?
Code:
root@proxmox2:~# uname -a
Linux proxmox2 5.15.102-1-pve #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) x86_64 GNU/Linux
Code:
root@proxmox4:~# uname -a
Linux proxmox4 5.15.102-1-pve #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) x86_64 GNU/Linux

But the things I am not understanding are:
Why does this only show up on proxmox2?
Why is proxmox4 unaffected?
Why does it work on some days and fail on other days?
Is there an other solution besides upgrading?

IMHO this is hard to troubleshoot because the configuration is the same on 2 nodes but only one is facing this problem every now and then...
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!