Hi, I'm having troubles since 2 weeks ago, I've a cronjob to backup CephFS in a PBS node directly (I've tried to run the job from other nodes with same results), the node has CephFS mounted by Proxmox at /mnt/pve/ceph-fs, I run a job like this:
Backup job seems to be hanged, no log for hours once it's hanged, last lines of job log are OK... but it stops logging also:
CPU usage & IO/delay makes clear something wrong is happening at the time the backup hangs, due to activity drop:
I've tried several things without any change in the hanging up behaviour:
Any recommendation?
Thanks in advance
Bash:
PBS_REPOSITORY=root@pam\!ceph-backup@x.y.z.w:backup \
PBS_PASSWORD=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
timeout 12h \
proxmox-backup-client backup cephfs.pxar:/mnt/pve/ceph-fs \
2>&1
Code:
2024-03-03T22:44:33+01:00: POST /dynamic_chunk
2024-03-03T22:44:33+01:00: POST /dynamic_chunk
2024-03-03T22:44:33+01:00: upload_chunk done: 2878875 bytes, a5d7d707915c4562fb55ae8924b1f12431f09af8ae2edd95f5e0bb1ead55d416
2024-03-03T22:44:33+01:00: upload_chunk done: 3038671 bytes, b5af9ef89aea9c2e3ac7337a318972bf9d2360c17702d0790ac13816e04fb492
2024-03-03T22:44:34+01:00: POST /dynamic_chunk
2024-03-03T22:44:34+01:00: upload_chunk done: 4613948 bytes, da4cd3be4de71ca80544f9365bbcef0659e73681f14b1307beb10d8e8b40ecca
2024-03-03T22:44:34+01:00: POST /dynamic_chunk
2024-03-03T22:44:34+01:00: upload_chunk done: 9188955 bytes, 6418997950e24f4ebd40fb1a7707545d5cc917e2722ff7b0323beebad556605d
2024-03-03T22:44:34+01:00: POST /dynamic_chunk
2024-03-03T22:44:34+01:00: upload_chunk done: 12660942 bytes, bcbfade0a11d51863d88182458f35897601e6057512d341cb6c7d8defc5e252b
2024-03-03T22:44:35+01:00: POST /dynamic_chunk
I've tried several things without any change in the hanging up behaviour:
- Clean up space in backup datastore (it was 85%, now it is 73%).
- Datastore is supported by 3 SSD disks in ZFS-1 pool, all disks have good S.M.A.R.T values and ZFS status is also OK.
- System is fully updated with no-subscription repos.
- I've restarted several times.
- PBS node is also PVE (it's a home lab) but all VMs are shutdown (several tries with stopped VMs to ensure CPU, RAM & disk access is 100% available for CephFS backup).
- No errors on Ceph side.
- No special big changes in CephFS since last day it worked properly...
strace
command on backup client PID renders (yes without the last parenthesis...):futex(0x7d4367cd20f0, FUTEX_WAIT_PRIVATE, 1, NULL
Any recommendation?
Thanks in advance