Hey, we have an interesting problem in our proxmox cluster. Two nodes have a backup job set up, which does back up to a Windows Veeam server in a remote datacenter every night. On the proxmox4 node the backup does run just fine, but on the proxmox2 node we sometimes face a problem during the backup job. It did happen at the beginning of febuary (like 6th?) and yesterday night.
journalctl prints these kernel log lines: https://pastebin.com/2rht2iKa (full log)
The Backup Job shows these lines before freezing:
Then vzdump enters the Ds state, which means the only way to solve the stuck job is to reboot the host:
Since we are not running an up to date kernel, will upgrading fix this?
But the things I am not understanding are:
Why does this only show up on proxmox2?
Why is proxmox4 unaffected?
Why does it work on some days and fail on other days?
Is there an other solution besides upgrading?
IMHO this is hard to troubleshoot because the configuration is the same on 2 nodes but only one is facing this problem every now and then...
journalctl prints these kernel log lines: https://pastebin.com/2rht2iKa (full log)
Code:
...
Feb 22 01:59:40 proxmox2 kernel: INFO: task kworker/29:6:2639382 blocked for more than 362 seconds.
Feb 22 01:59:40 proxmox2 kernel: Tainted: P O 5.15.102-1-pve #1
Feb 22 01:59:40 proxmox2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...
The Backup Job shows these lines before freezing:
Code:
...
INFO: 89% (223.8 GiB of 251.0 GiB) in 18m 8s, read: 1.3 GiB/s, write: 96.2 MiB/s
INFO: 91% (230.0 GiB of 251.0 GiB) in 18m 11s, read: 2.1 GiB/s, write: 99.8 MiB/s
INFO: 92% (232.2 GiB of 251.0 GiB) in 18m 14s, read: 737.0 MiB/s, write: 167.5 MiB/s
INFO: 93% (233.5 GiB of 251.0 GiB) in 18m 17s, read: 451.9 MiB/s, write: 161.1 MiB/s
INFO: 96% (243.3 GiB of 251.0 GiB) in 18m 21s, read: 2.4 GiB/s, write: 3.0 KiB/s
INFO: 100% (251.0 GiB of 251.0 GiB) in 18m 24s, read: 2.6 GiB/s, write: 0 B/s
INFO: backup is sparse: 54.25 GiB (21%) total zero data
INFO: transferred 251.00 GiB in 1104 seconds (232.8 MiB/s)
Then vzdump enters the Ds state, which means the only way to solve the stuck job is to reboot the host:
Code:
root@proxmox2:~# ps -aux | grep vzdump
root 655590 0.0 0.0 6244 648 pts/0 S+ 12:03 0:00 grep vzdump
root 2428713 0.0 0.0 350112 110952 ? Ds 01:30 0:02 task UPID:proxmox2:00250F29:03B03D45:65D6958A:vzdump::root@pam:
Since we are not running an up to date kernel, will upgrading fix this?
Code:
root@proxmox2:~# uname -a
Linux proxmox2 5.15.102-1-pve #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) x86_64 GNU/Linux
Code:
root@proxmox4:~# uname -a
Linux proxmox4 5.15.102-1-pve #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) x86_64 GNU/Linux
But the things I am not understanding are:
Why does this only show up on proxmox2?
Why is proxmox4 unaffected?
Why does it work on some days and fail on other days?
Is there an other solution besides upgrading?
IMHO this is hard to troubleshoot because the configuration is the same on 2 nodes but only one is facing this problem every now and then...
Last edited: