promox-backup-client hangs after several minutes

Emilio González Montaña · Mar 3, 2024

Hi, I'm having troubles since 2 weeks ago, I've a cronjob to backup CephFS in a PBS node directly (I've tried to run the job from other nodes with same results), the node has CephFS mounted by Proxmox at /mnt/pve/ceph-fs, I run a job like this:

Bash:

PBS_REPOSITORY=root@pam\!ceph-backup@x.y.z.w:backup \
PBS_PASSWORD=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
timeout 12h \
proxmox-backup-client backup cephfs.pxar:/mnt/pve/ceph-fs \
2>&1

Backup job seems to be hanged, no log for hours once it's hanged, last lines of job log are OK... but it stops logging also:

Code:

2024-03-03T22:44:33+01:00: POST /dynamic_chunk
2024-03-03T22:44:33+01:00: POST /dynamic_chunk
2024-03-03T22:44:33+01:00: upload_chunk done: 2878875 bytes, a5d7d707915c4562fb55ae8924b1f12431f09af8ae2edd95f5e0bb1ead55d416
2024-03-03T22:44:33+01:00: upload_chunk done: 3038671 bytes, b5af9ef89aea9c2e3ac7337a318972bf9d2360c17702d0790ac13816e04fb492
2024-03-03T22:44:34+01:00: POST /dynamic_chunk
2024-03-03T22:44:34+01:00: upload_chunk done: 4613948 bytes, da4cd3be4de71ca80544f9365bbcef0659e73681f14b1307beb10d8e8b40ecca
2024-03-03T22:44:34+01:00: POST /dynamic_chunk
2024-03-03T22:44:34+01:00: upload_chunk done: 9188955 bytes, 6418997950e24f4ebd40fb1a7707545d5cc917e2722ff7b0323beebad556605d
2024-03-03T22:44:34+01:00: POST /dynamic_chunk
2024-03-03T22:44:34+01:00: upload_chunk done: 12660942 bytes, bcbfade0a11d51863d88182458f35897601e6057512d341cb6c7d8defc5e252b
2024-03-03T22:44:35+01:00: POST /dynamic_chunk

CPU usage & IO/delay makes clear something wrong is happening at the time the backup hangs, due to activity drop:

I've tried several things without any change in the hanging up behaviour:

Clean up space in backup datastore (it was 85%, now it is 73%).
Datastore is supported by 3 SSD disks in ZFS-1 pool, all disks have good S.M.A.R.T values and ZFS status is also OK.
System is fully updated with no-subscription repos.
I've restarted several times.
PBS node is also PVE (it's a home lab) but all VMs are shutdown (several tries with stopped VMs to ensure CPU, RAM & disk access is 100% available for CephFS backup).
No errors on Ceph side.
No special big changes in CephFS since last day it worked properly...
strace command on backup client PID renders (yes without the last parenthesis...): futex(0x7d4367cd20f0, FUTEX_WAIT_PRIVATE, 1, NULL

The problem is not affecting regular VM backups.
Any recommendation?
Thanks in advance

Emilio González Montaña · Mar 3, 2024

I found one a clue at:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/EV67LFOJVSLIL7Z4HXJ7RN355PWCCOKO/
I've restarted CephFS MDS active service and a running hanged CephFS backup has continue working...
Any suggestion? Thanks

Search

Search

promox-backup-client hangs after several minutes

Emilio González Montaña

Renowned Member

Emilio González Montaña

Renowned Member