Intermittent backup problems, need some hints on how to debug

Ovidiu

Renowned Member
Apr 27, 2014
326
13
83
hi there,

1 Proxmox server and 1 PBS with different providers. Both servers are up to date.
Backups worked fine through tailscale (wireguard based VPN) then went from taking on average 10 minutes to 5-8 hours with similar backup sizes.
Switched back to connecting to PBS via public IP, same problems. Eventually backups worked fine again with public IP and VPN.
A few weeks later they completely stopped working in any way.

tried this from PX to PBS:

proxmox-backup-client benchmark --repository root@pam@PBS-public-IP:datastore-name

and that command just freezes - I aborted after 15 minutes. The task shows up on the PBS but nothing happens.
Tried a few times consecutively and eventually it worked and gave these numbers:

Code:
Uploaded 10 chunks in 14 seconds.
Time per request: 1432043 microseconds.
TLS speed: 2.93 MB/s
SHA256 speed: 1579.20 MB/s
Compression speed: 590.72 MB/s
Decompress speed: 1048.82 MB/s
AES256/GCM speed: 2739.27 MB/s
Verify speed: 613.92 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 2.93 MB/s (0%)     │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 1579.20 MB/s (78%) │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 590.72 MB/s (79%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 1048.82 MB/s (88%) │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 613.92 MB/s (81%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 2739.27 MB/s (75%) │
└───────────────────────────────────┴────────────────────┘

trying a few times again, the benchmark simply freezes again with nothing happening on the PBS aka no CPU/DISK/network load.

Code:
apt-cache policy proxmox-backup-client
proxmox-backup-client:
  Installed: 1.1.6-2
  Candidate: 1.1.6-2

I really need a hint at where to look? Network problems? origin or target? Proxmox problems? PBS problems? I'm a bit lost right now :-(
 
aside from it being just *very* slow you could check what the process does

when the benchmark hangs, do
Code:
ps ax | grep proxmox-backup-client

this gives you the PID in the first column
with the pid do
Code:
grep -r '' /proc/<PID>/task/*/stack

edit: the output of the second command might not help though.., you could also run it with 'strace', like 'strace -f -yy proxmox-backup-client ....'
 
Last edited:
thanks. Does this help?

Code:
ps ax | grep proxmox-backup-client
30426 pts/2    Sl+    0:00 proxmox-backup-client benchmark --repository root@pam@PBS-public-IP:datastore
30816 pts/3    S+     0:00 grep proxmox-backup-client


grep -r '' /proc/30426/task/*/stack
/proc/30426/task/30426/stack:[<0>] futex_wait_queue_me+0xb9/0x120
/proc/30426/task/30426/stack:[<0>] futex_wait+0x151/0x270
/proc/30426/task/30426/stack:[<0>] do_futex+0x32c/0xb80
/proc/30426/task/30426/stack:[<0>] __x64_sys_futex+0x143/0x180
/proc/30426/task/30426/stack:[<0>] do_syscall_64+0x57/0x190
/proc/30426/task/30426/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
/proc/30426/task/30431/stack:[<0>] futex_wait_queue_me+0xb9/0x120
/proc/30426/task/30431/stack:[<0>] futex_wait+0x151/0x270
/proc/30426/task/30431/stack:[<0>] do_futex+0x32c/0xb80
/proc/30426/task/30431/stack:[<0>] __x64_sys_futex+0x143/0x180
/proc/30426/task/30431/stack:[<0>] do_syscall_64+0x57/0x190
/proc/30426/task/30431/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
/proc/30426/task/30432/stack:[<0>] ep_poll+0x40a/0x430
/proc/30426/task/30432/stack:[<0>] do_epoll_wait+0xb7/0xd0
/proc/30426/task/30432/stack:[<0>] __x64_sys_epoll_wait+0x1e/0x30
/proc/30426/task/30432/stack:[<0>] do_syscall_64+0x57/0x190
/proc/30426/task/30432/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
/proc/30426/task/30433/stack:[<0>] futex_wait_queue_me+0xb9/0x120
/proc/30426/task/30433/stack:[<0>] futex_wait+0x151/0x270
/proc/30426/task/30433/stack:[<0>] do_futex+0x32c/0xb80
/proc/30426/task/30433/stack:[<0>] __x64_sys_futex+0x143/0x180
/proc/30426/task/30433/stack:[<0>] do_syscall_64+0x57/0x190
/proc/30426/task/30433/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
/proc/30426/task/30434/stack:[<0>] futex_wait_queue_me+0xb9/0x120
/proc/30426/task/30434/stack:[<0>] futex_wait+0x151/0x270
/proc/30426/task/30434/stack:[<0>] do_futex+0x32c/0xb80
/proc/30426/task/30434/stack:[<0>] __x64_sys_futex+0x143/0x180
/proc/30426/task/30434/stack:[<0>] do_syscall_64+0x57/0x190
/proc/30426/task/30434/stack:[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
 
edit: the output of the second command might not help though.., you could also run it with 'strace', like 'strace -f -yy proxmox-backup-client ....'
Shall I pipe it to a log file and attach it here or what am I looking for?

something like
Code:
strace -f -yy proxmox-backup-client benchmark --repository root@pam@PBS-public-IP:datastore > logfile.log

btw. I've just had 5-10 benchmarks working consecutively. I have no idea what is going on since all I did so far was execute the commands mentioned in this thread :-(
 
oh, I have news. The latest frozen benchmark just produced an error:

Code:
proxmox-backup-client benchmark --repository root@pam@PBS-public-IP:datastore


Message from syslogd@proxmox-host at May 26 16:33:46 ...
 kernel:[112192.699946] watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [kvm:14958]

btw. proxmox is installed on a VPS with nested virtualization enabled in case it matters. This is just a test-case currently, no mission-critical production server I just needed a few small VMs and the VPS provider did not have any "fitting" options so I took a bigger one and created 3 small VMs with Proxmox. There is no load, there are basically 3 idle Ubuntu-based VMs running on this Proxmox instance.
 
mhmm ok, from the strace looks like the connections works, but very slowly (there is a 'Resouce temporarily unavailable' error every few kb of data sent) maybe the connection/server is simply overloaded? broken cable/disk?

is there anything in the syslog/journal on the client and/or server side?
 
talking about ressources there is almost no noticable usage. I am analyzing both server's stats with netdata and I see no peak during the benchmarks. at all. they are idle.

checked the logs and didn't find much on both.

on the target (PBS) I see a few failed benchmarks which say "broken pipe"

Code:
May 26 14:59:20 ict proxmox-backup-proxy[997]: starting new benchmark on datastore 'hyper': "host/benchmark/2021-05-26T12:59:20Z"
May 26 14:59:29 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 14:59:30 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 14:59:31 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 14:59:32 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 14:59:33 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 14:59:33 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 14:59:34 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 14:59:34 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 14:59:35 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 14:59:35 ict proxmox-backup-proxy[997]: benchmark finished successfully
May 26 14:59:35 ict proxmox-backup-proxy[997]: removing backup snapshot "/backup/pbs/hyper/host/benchmark/2021-05-26T12:59:20Z"
May 26 14:59:35 ict proxmox-backup-proxy[997]: TASK OK
May 26 14:59:57 ict proxmox-backup-proxy[997]: starting new benchmark on datastore 'hyper': "host/benchmark/2021-05-26T12:59:56Z"
May 26 15:03:14 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:06:24 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:07:28 ict proxmox-backup-proxy[997]: benchmark finished successfully
May 26 15:07:28 ict proxmox-backup-proxy[997]: TASK OK
May 26 15:07:28 ict proxmox-backup-proxy[997]: Upload error: error reading a body from connection: broken pipe
May 26 15:07:28 ict proxmox-backup-proxy[997]: removing backup snapshot "/backup/pbs/hyper/host/benchmark/2021-05-26T12:59:56Z"
May 26 15:07:28 ict proxmox-backup-proxy[997]: starting new benchmark on datastore 'hyper': "host/benchmark/2021-05-26T13:07:27Z"
May 26 15:08:45 ict proxmox-backup-proxy[997]: benchmark finished successfully
May 26 15:08:45 ict proxmox-backup-proxy[997]: Upload error: error reading a body from connection: broken pipe
May 26 15:08:45 ict proxmox-backup-proxy[997]: removing backup snapshot "/backup/pbs/hyper/host/benchmark/2021-05-26T13:07:27Z"
May 26 15:08:45 ict proxmox-backup-proxy[997]: TASK OK
May 26 15:17:01 ict CRON[13604]: pam_unix(cron:session): session opened for user root by (uid=0)
May 26 15:17:01 ict CRON[13605]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 26 15:17:01 ict CRON[13604]: pam_unix(cron:session): session closed for user root
May 26 15:41:06 ict proxmox-backup-proxy[997]: starting new benchmark on datastore 'hyper': "host/benchmark/2021-05-26T13:41:06Z"
May 26 15:41:14 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:41:16 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:41:17 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:41:18 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:41:19 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:41:19 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:41:20 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:41:20 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:41:21 ict proxmox-backup-proxy[997]: UPLOAD END 4194304 bytes
May 26 15:41:21 ict proxmox-backup-proxy[997]: benchmark finished successfully
May 26 15:41:21 ict proxmox-backup-proxy[997]: removing backup snapshot "/backup/pbs/hyper/host/benchmark/2021-05-26T13:41:06Z"
May 26 15:41:21 ict proxmox-backup-proxy[997]: TASK OK
May 26 15:41:30 ict proxmox-backup-proxy[997]: starting new benchmark on datastore 'hyper': "host/benchmark/2021-05-26T13:41:30Z"
May 26 15:41:44 ict proxmox-backup-proxy[997]: benchmark finished successfully
May 26 15:41:44 ict proxmox-backup-proxy[997]: Upload error: error reading a body from connection: broken pipe
May 26 15:41:44 ict proxmox-backup-proxy[997]: removing backup snapshot "/backup/pbs/hyper/host/benchmark/2021-05-26T13:41:30Z"
May 26 15:41:44 ict proxmox-backup-proxy[997]: TASK OK

while on the source (PX) the only noteworthy entry was about that watchdog kernel error I mentioned earlier but that only occured once so I does seem unlikely to be related to often failing benchmarks.

I've already tried a few mtr tests and I see no packet loss. Proxmox does hardly use any ressoruces, like I said its an idle test system.

Any other pointers about where to look for the cause of the problem? Since the benchmark tests work fine from another PX server I had guessed the problem was with the server where I am running the benchmark tests and not with the PBS system.
 
Any other pointers about where to look for the cause of the problem?
for now it looks like a network issue... how does your network look like between the client/server? anything unusual in between? maybe anything that could intervene in tls/http2 traffic?
 
for now it looks like a network issue... how does your network look like between the client/server? anything unusual in between? maybe anything that could intervene in tls/http2 traffic?
nothing special. PX is on a VPS from a German provider, PBS is on a KVM from a US provider.
tried mtr - no issues found.
tried iperf3 between the two and TCP is ok, plenty of lost packets when using udp hence I switched back from using the wireguard VPN solution to plain simply connecting to the public IP of the PBS.

:-( ok thanks for taking the time to look into it, I guess I'll have to do a few more test from different server s and see whether I can figure out if the problem lies with the PX or the PBS servers or their providers.