Hi,
We are trying new S3 support for syncing local datastore to S3 backend (CEPH cluster/Rados GW). The problem is some times S3 endpoint respond with HTTP 503 "SlowDown" and PBS S3 client immediately retry that chunk. But when S3 endpoint respond with multiple SlowDown's PBS S3 client does not back-off and retry later, it just gives up trying to send that VM snapshot and continue with the next snapshot.
I tried the new rate limiting functions (the ones on WEB GUI) but it does not solve the issue because it is not about rate limiting trafic MiB. I tried "put-rate-limit" in cfg file. When I use 1 request per second, it works without any 503 SlowDown errors. 3 request per minutes also seems to work but 5 and above causes SlowDown responses and if the VM snaphot is big (1 TB in our case), S3 client can not complete sending all chunks.
Is there any way to improve this? Hard limiting put requests per second is not an option because it's too slow. Does PBS S3 client has some kind of back-off algorithm?
Regards,
Rahman
We are trying new S3 support for syncing local datastore to S3 backend (CEPH cluster/Rados GW). The problem is some times S3 endpoint respond with HTTP 503 "SlowDown" and PBS S3 client immediately retry that chunk. But when S3 endpoint respond with multiple SlowDown's PBS S3 client does not back-off and retry later, it just gives up trying to send that VM snapshot and continue with the next snapshot.
I tried the new rate limiting functions (the ones on WEB GUI) but it does not solve the issue because it is not about rate limiting trafic MiB. I tried "put-rate-limit" in cfg file. When I use 1 request per second, it works without any 503 SlowDown errors. 3 request per minutes also seems to work but 5 and above causes SlowDown responses and if the VM snaphot is big (1 TB in our case), S3 client can not complete sending all chunks.
Is there any way to improve this? Hard limiting put requests per second is not an option because it's too slow. Does PBS S3 client has some kind of back-off algorithm?
Regards,
Rahman
Code:
2025-11-28T08:35:59+03:00: Starting datastore sync job '-:default:ulaknetfkm:default:s-4d4c0e22-da65'
2025-11-28T08:35:59+03:00: sync datastore 'ulaknetfkm' from 'default'
2025-11-28T08:35:59+03:00: ----
2025-11-28T08:35:59+03:00: Syncing datastore 'default', namespace 'default' into datastore 'ulaknetfkm', namespace 'default'
2025-11-28T08:35:59+03:00: found 127 groups to sync (out of 127 total)
2025-11-28T08:35:59+03:00: skipped: 1 snapshot(s) (2024-05-04T05:00:08Z) - older than the newest snapshot present on sync target
2025-11-28T08:35:59+03:00: re-sync snapshot vm/103/2024-06-01T05:00:05Z
2025-11-28T08:35:59+03:00: no data changes
2025-11-28T08:35:59+03:00: percentage done: 0.79% (1/127 groups)
2025-11-28T08:35:59+03:00: re-sync snapshot vm/106/2024-01-06T05:00:17Z
2025-11-28T08:35:59+03:00: no data changes
2025-11-28T08:35:59+03:00: percentage done: 1.57% (2/127 groups)
2025-11-28T08:35:59+03:00: skipped: 1 snapshot(s) (2025-10-04T05:00:01Z) - older than the newest snapshot present on sync target
2025-11-28T08:35:59+03:00: re-sync snapshot vm/107/2025-11-01T04:00:04Z
2025-11-28T08:35:59+03:00: no data changes
2025-11-28T08:35:59+03:00: percentage done: 2.36% (3/127 groups)
2025-11-28T08:35:59+03:00: skipped: 1 snapshot(s) (2025-11-16T04:00:04Z) - older than the newest snapshot present on sync target
2025-11-28T08:35:59+03:00: re-sync snapshot vm/109/2025-11-23T04:00:10Z
2025-11-28T08:35:59+03:00: no data changes
2025-11-28T08:35:59+03:00: percentage done: 3.15% (4/127 groups)
2025-11-28T08:35:59+03:00: skipped: 1 snapshot(s) (2025-10-04T05:00:46Z) - older than the newest snapshot present on sync target
2025-11-28T08:35:59+03:00: re-sync snapshot vm/110/2025-11-01T04:00:52Z
2025-11-28T08:35:59+03:00: no data changes
2025-11-28T08:35:59+03:00: percentage done: 3.94% (5/127 groups)
2025-11-28T08:36:00+03:00: skipped: 1 snapshot(s) (2025-10-04T05:00:03Z) - older than the newest snapshot present on sync target
2025-11-28T08:36:00+03:00: re-sync snapshot vm/111/2025-11-01T04:00:01Z
2025-11-28T08:36:00+03:00: no data changes
2025-11-28T08:36:00+03:00: percentage done: 4.72% (6/127 groups)
2025-11-28T08:36:00+03:00: skipped: 1 snapshot(s) (2025-10-04T05:01:03Z) - older than the newest snapshot present on sync target
2025-11-28T08:36:00+03:00: re-sync snapshot vm/112/2025-11-01T04:01:00Z
2025-11-28T08:36:00+03:00: no data changes
2025-11-28T08:36:00+03:00: percentage done: 5.51% (7/127 groups)
2025-11-28T08:36:00+03:00: re-sync snapshot vm/113/2025-10-04T05:01:47Z
2025-11-28T08:36:00+03:00: sync archive drive-scsi0.img.fidx
2025-11-28T09:38:05+03:00: percentage done: 5.91% (7/127 groups, 1/2 snapshots in group #8)
2025-11-28T09:38:05+03:00: sync group vm/113 failed - failed to upload chunk to s3 backend - upload failed: unexpected status code 503 Service Unavailable
2025-11-28T09:38:05+03:00: skipped: 1 snapshot(s) (2025-10-04T05:07:25Z) - older than the newest snapshot present on sync target
2025-11-28T09:38:06+03:00: re-sync snapshot vm/114/2025-11-01T04:07:33Z
2025-11-28T09:38:06+03:00: no data changes
2025-11-28T09:38:06+03:00: percentage done: 7.09% (9/127 groups)
2025-11-28T09:38:06+03:00: skipped: 1 snapshot(s) (2025-11-16T04:00:56Z) - older than the newest snapshot present on sync target
2025-11-28T09:38:06+03:00: re-sync snapshot vm/115/2025-11-23T04:01:30Z
2025-11-28T09:38:06+03:00: no data changes
Last edited: