ProxMox VM Process Hung tasks during AWS S3 datastore backups

dek

Member
May 5, 2023
3
0
6
Hello all,

I have noticed occasional issues with ubuntu ProxMox VE VMs running on 8.4.14 VE servers. VMs and lagged and report operational and I/O issues with disk / filesystem access resulting in CPU hung tasks during backups on a 4.1.1 ProxMox Backup Server using S3 datastores

On the VM I see the following errors:

[6635504.997115] INFO: task kworker/u4:0:2245157 blocked for more than 120 seconds.
[6635504.998911] Tainted: G W 5.15.0-161-generic #171-Ubuntu
[6635505.000654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[6635505.002548] task:kworker/u4:0 state:D stack: 0 pid:2245157 ppid: 2 flags:0x00004000
[6635505.002552] Workqueue: writeback wb_workfn (flush-253:2)
[6635505.002556] Call Trace:
[6635505.002557] <TASK>
[6635505.002558] __schedule+0x24e/0x590
[6635505.002561] schedule+0x69/0x110
[6635505.002562] io_schedule+0x46/0x80
[6635505.002564] ? wbt_cleanup_cb+0x20/0x20
[6635505.002566] rq_qos_wait+0xd0/0x170
[6635505.002568] ? wbt_rqw_done+0x110/0x110
[6635505.002570] ? sysv68_partition+0x280/0x280
[6635505.002572] ? wbt_cleanup_cb+0x20/0x20
[6635505.002574] wbt_wait+0x9f/0xf0
[6635505.002576] __rq_qos_throttle+0x28/0x40
[6635505.002578] blk_mq_submit_bio+0x127/0x610
[6635505.002581] __submit_bio+0x1ee/0x220
[6635505.002584] __submit_bio_noacct+0x85/0x200
[6635505.002586] submit_bio_noacct+0x4e/0x120
[6635505.002588] ? unlock_page_memcg+0x46/0x80
[6635505.002592] ? __test_set_page_writeback+0x75/0x2d0
[6635505.002595] submit_bio+0x4a/0x130
[6635505.002607] iomap_submit_ioend+0x53/0x90
[6635505.002609] iomap_writepage_map+0x1fa/0x370
[6635505.002611] iomap_do_writepage+0x6e/0x110
[6635505.002613] write_cache_pages+0x1a6/0x460
[6635505.002615] ? iomap_writepage_map+0x370/0x370
[6635505.002618] iomap_writepages+0x21/0x40
[6635505.002619] xfs_vm_writepages+0x84/0xc0 [xfs]
[6635505.002679] do_writepages+0xd7/0x200
[6635505.002682] ? check_preempt_curr+0x61/0x70
[6635505.002685] ? ttwu_do_wakeup+0x1c/0x170
[6635505.002687] __writeback_single_inode+0x44/0x290
[6635505.002690] writeback_sb_inodes+0x22a/0x500
[6635505.002692] __writeback_inodes_wb+0x56/0xf0
[6635505.002695] wb_writeback+0x1cc/0x290
[6635505.002697] wb_do_writeback+0x1a0/0x280
[6635505.002699] wb_workfn+0x77/0x260
[6635505.002701] ? psi_task_switch+0xc6/0x220
[6635505.002703] ? raw_spin_rq_unlock+0x10/0x30
[6635505.002705] ? finish_task_switch.isra.0+0x7e/0x280
[6635505.002708] process_one_work+0x22b/0x3d0
[6635505.002710] worker_thread+0x53/0x420
[6635505.002711] ? process_one_work+0x3d0/0x3d0
[6635505.002712] kthread+0x12a/0x150
[6635505.002714] ? set_kthread_struct+0x50/0x50
[6635505.002717] ret_from_fork+0x22/0x30
[6635505.002720] </TASK>
And dmesg shows disk access issues
[6635676.000802] sd 2:0:0:0: [sda] tag#227 timing out command, waited 180s
[6635676.063181] sd 2:0:0:0: [sda] tag#227 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=399s
[6635676.063185] sd 2:0:0:0: [sda] tag#227 Sense Key : Aborted Command [current]
[6635676.063187] sd 2:0:0:0: [sda] tag#227 Add. Sense: I/O process terminated
[6635676.063196] sd 2:0:0:0: [sda] tag#227 CDB: Write(10) 2a 00 02 21 99 b8 00 00 08 00
[6635676.063197] blk_update_request: I/O error, dev sda, sector 35756472 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
[6635676.066252] dm-2: writeback error on inode 166, offset 36864, sector 98744
[6635676.066262] sd 2:0:0:0: [sda] tag#233 timing out command, waited 180s
[6635676.069684] sd 2:0:0:0: [sda] tag#233 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=399s
[6635676.069685] sd 2:0:0:0: [sda] tag#233 Sense Key : Aborted Command [current]
[6635676.069687] sd 2:0:0:0: [sda] tag#233 Add. Sense: I/O process terminated
[6635676.069688] sd 2:0:0:0: [sda] tag#233 CDB: Write(10) 2a 00 02 21 a3 f8 00 00 08 00
[6635676.069689] blk_update_request: I/O error, dev sda, sector 35759096 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
[6635676.072322] dm-2: writeback error on inode 166, offset 1380352, sector 101368
[6635676.072326] sd 2:0:0:0: [sda] tag#234 timing out command, waited 180s
[6635676.077039] sd 2:0:0:0: [sda] tag#234 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=399s
[6635676.077041] sd 2:0:0:0: [sda] tag#234 Sense Key : Aborted Command [current]
[6635676.077042] sd 2:0:0:0: [sda] tag#234 Add. Sense: I/O process terminated
[6635676.077044] sd 2:0:0:0: [sda] tag#234 CDB: Write(10) 2a 00 02 21 b0 88 00 00 08 00



In the proxmox VE back up task window I would have to interrupt as you see below due to the backup stuck on not progressing past 3% for some time :


INFO: starting new backup job: vzdump 163 --notification-mode auto --remove 0 --notes-template '{{guestname}}' --mode snapshot --storage s3-store2 --mailto ops@openanswers.co.uk --node hlvbp011
INFO: Starting Backup of VM 163 (qemu)
INFO: Backup started at 2026-01-20 14:02:43
INFO: status = running
INFO: VM Name: TestVM
INFO: include disk 'scsi0' 'VolGroup01:vm-163-disk-0' 40G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/163/2026-01-20T14:02:43Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'cc5ced35-2950-4d0d-b952-c8513ce512c8'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: 0% (172.0 MiB of 40.0 GiB) in 3s, read: 57.3 MiB/s, write: 57.3 MiB/s
INFO: 1% (412.0 MiB of 40.0 GiB) in 7s, read: 60.0 MiB/s, write: 60.0 MiB/s
INFO: 2% (824.0 MiB of 40.0 GiB) in 1m 10s, read: 6.5 MiB/s, write: 4.0 MiB/s
INFO: 3% (1.3 GiB of 40.0 GiB) in 1m 48s, read: 12.1 MiB/s, write: 5.9 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 163 failed - interrupted by signal
INFO: Failed at 2026-01-20 14:09:04
ERROR: Backup job failed - interrupted by signal



The PBS task would be stuck at a 'caching chunk' message

My questions are;

when we see the INFO:resuming VM again message ; does that mean that a lock on a disk resource was held or is this relating to the VM lock file in /var/lock/qemu-server ?

Are there known issues with ProxMox VE backup processes holding access /lock on disks resource when backups stall? Even the ProxMox VE Server will CPU hung and disk IO messages too (relating to VM disk image).

I am used to seeing the INFO:resuming VM again message before a backup is started but is the VM hang related to seeing this message again on failed or interrupted backups .

If a backup is successful we never see the resuming VM again message at the end in the backup task log such as ie:

INFO: 99% (40.0 GiB of 40.0 GiB) in 2m 51s, read: 230.7 MiB/s, write: 0 B/s
INFO: 100% (40.0 GiB of 40.0 GiB) in 2m 52s, read: 33.4 MiB/s, write: 8.0 KiB/s
INFO: backup is sparse: 22.28 GiB (55%) total zero data
INFO: transferred 40.00 GiB in 172 seconds (238.1 MiB/s)
INFO: archive file size: 8.66GB
INFO: adding notes to backup
INFO: prune older backups with retention: keep-last=1
INFO: removing backup 'backup:backup/vzdump-qemu-166-2026_01_24-01_00_02.vma.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: Finished Backup of VM 166 (00:02:54)
INFO: Backup finished at 2026-01-31 01:03:02
INFO: Backup job finished successfully
INFO: notified via target `<ops@openanswers.co.uk>`
TASK OK

Is the s3 datastore still under development (Technical Preview?) Would this have the knock on effect to stall a ProxMox VE VM ?

Any info help or hints would be greatly appreciated?
Regards Dek
 
Last edited:
Hi Chris,

Many thanks for your reply.

When you state share the exact error messages, there aren't any . The backup just stops and I have to stop the backup process due to the VM complaining re IO but I will provide some more context about the S3 access .

I suspect that the access issue could be one of our 2 Internet Service Provider links not being able to reach S3. What I see is outbound traffic to an S3 host IP stop, PBS then appears to request a different S3 IP address and the backup can sometimes resume again. Once the S3 connection is reestablished I can sometimes see the error messages below in the PBS task Look at the output there is a gap of 15 mins then the traffic and messages start:

2026-01-21T16:54:56+00:00: Upload new chunk d357544a09793ebf1bfe410fbca310b4a6a48f5f287d4d5d8e5a35a803d34a56
2026-01-21T16:54:56+00:00: Upload new chunk f766df2a87cee51350dd285a1599ddd8dcf630335f2053386a1227475f9d1ad0
2026-01-21T16:54:56+00:00: Caching of chunk f766df2a87cee51350dd285a1599ddd8dcf630335f2053386a1227475f9d1ad0
2026-01-21T16:54:56+00:00: Caching of chunk fb6d3c80e273ae675dcfcdc8fe9c892c5a2fdfb380b075bf1032a35f26617cd5
2026-01-21T16:54:56+00:00: Caching of chunk 69d58581bcca03f8834f6d621eda620bd811337385afb85d784a44a1c2b8150b
2026-01-21T16:54:56+00:00: Caching of chunk 3a9470fa8c2cc3f8fa94dad7e148666f9a5ed434923a0bcf9837f22666145a90
2026-01-21T16:54:56+00:00: Caching of chunk 4bdd5a7d0d296604d2378290d1e5dcc5e65b28eb2deacfdebbdd50ea9bdab0cd
2026-01-21T16:54:56+00:00: Caching of chunk c49e94471408bc4645b51eed06a925e0f32548a3bfbf9e984c5e4d265d55fd63
2026-01-21T16:54:56+00:00: Caching of chunk 1473241ce351a430184394bc1ece910437a4b143ef6caff03a915e2cd6cddd3c
2026-01-21T16:54:56+00:00: Caching of chunk 2bca83629747e9fe8776fbcd222de2ca3d42a89c06d5d74b926c6078d1ea5410
2026-01-21T16:54:57+00:00: Caching of chunk 646f55f87235c6670a7b453b18a64412c1db472752dea187d7758f18bcec2209
2026-01-21T16:54:57+00:00: Caching of chunk a1aba463822ff8d525e1f8aa78828c54c4c4b028b5187259f2407668baf786ba
2026-01-21T16:54:57+00:00: Caching of chunk 79f334b483b02b814015fcfc3119d9b6b7f8aed8ea8c155e43aefe5be9b8fea1
2026-01-21T16:54:57+00:00: Caching of chunk 633b475c07865e358ebde027ff3da0d5ec470b945ffaee76fe528efb35bd297b
2026-01-21T16:54:57+00:00: Caching of chunk e85a075d67a5ee3ba2f5eb20fe5501986a962667cde037438f093dec0e690916
2026-01-21T16:54:57+00:00: Caching of chunk d357544a09793ebf1bfe410fbca310b4a6a48f5f287d4d5d8e5a35a803d34a56
2026-01-21T16:54:57+00:00: Caching of chunk dfaf47ae6505b23c6b86b6cc1e5d113dca67c140418ea50dcb5947ee9fe0a301
2026-01-21T16:54:57+00:00: Caching of chunk 42fe8d573d5a7c870d4f8ac29171142a36a341d1d3b93fd9e13ad4a6efa0b7ff
2026-01-21T16:54:57+00:00: Caching of chunk d474ca5f4a28bde32902cb64620b8459fe71ed8a2ec76e376b00cd579bf04fac
2026-01-21T16:54:57+00:00: Caching of chunk e6c3519bbc2a00a89f66a49809e66790932f0bfbb6c9424b27f0c3339bc21faa
2026-01-21T16:54:57+00:00: Caching of chunk 0f96007e14311276dadece58ea51f555237f7c3c1cdf6bb9761c24f9d3b5a608
2026-01-21T16:54:59+00:00: Caching of chunk 34541326e60775cb3e30301466f2a0e282b60b176ebb76c5dd8195df6b5ab69a


< NO messages or TCP traffic seen with tcpdump running on the PBS console at this point> then; a new AWS S3 IP is used and then the PBS task window shows the following


2026-01-21T17:11:01+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165455Z</RequestTime><ServerTime>2026-01-21T17:11:02Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>NWKQSQFY5PN91RRT</RequestId><HostId>KCNxiwSuYF9lSHYD0onJzMQA6BHdlva2UL3NHCXd55rQHVF+gL4lD/lSWbKuwbWBUEmspravdiW2EVRAG+dowfiJ4qhofFi2</HostId></Error>
2026-01-21T17:11:02+00:00: Caching of chunk fd15ef79c2745f37be7a3b07b75113dca69868764837185a255255a07cbcfa74
2026-01-21T17:11:05+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165455Z</RequestTime><ServerTime>2026-01-21T17:11:06Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>6103GVJX7F833QTR</RequestId><HostId>Keq0bkqwZ42Ib4JaPJyx9j+dDBrCVt5sV7ukP2MSShMNT+eG0C9ynjZ9VR8Kz+N3xnNEW9gezrU=</HostId></Error>
2026-01-21T17:11:05+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165456Z</RequestTime><ServerTime>2026-01-21T17:11:06Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>6103XH1QFRYZXCK4</RequestId><HostId>fjj45R+oP+hp2sVVaGyiLPwMsgw4TV5N/68Du7iGQrbHsIwqnDXsVcS6EpzU0iUBPMh96XMSD4M=</HostId></Error>
2026-01-21T17:11:06+00:00: Caching of chunk 31a7bfd2517fd410616f73b89b49b9f80c0bb80208a656c64126bc75dd9bc85e
2026-01-21T17:11:07+00:00: Caching of chunk 9858af89ab9435fcd70aaefc7f46c0d8db9f107e3af5124060f3b9cc8f5d5b35
2026-01-21T17:11:07+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165456Z</RequestTime><ServerTime>2026-01-21T17:11:08Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>1YF803HJRN3YGD5R</RequestId><HostId>nM5gd0GZPvBNgZxjmnhzLiNdGGM46zSvObUDLdxce64OnrnNnnrvi9BLwi8kIs4Rh+0Hlcbufb0=</HostId></Error>
2026-01-21T17:11:07+00:00: Caching of chunk d9a98ce51588bb6332bf7055f4ed5d5129527a0bb46f749188f5406958d86e22
2026-01-21T17:11:13+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165455Z</RequestTime><ServerTime>2026-01-21T17:11:14Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>ZQ90A54HYFSVNGSD</RequestId><HostId>CwCTtCedYDw44oQ7e2ARiXE2gPOL5gcTSSegnt6vxEExsyDrF7+T0rA/7I0adxJXxy5I9C08dymidmveIPnbZ4oYpu47ElY9</HostId></Error>
2026-01-21T17:11:14+00:00: Caching of chunk f336500109b1af3fe838f78024170600d838b98939a4e14bd33840d7c32461d5
2026-01-21T17:11:26+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165455Z</RequestTime><ServerTime>2026-01-21T17:11:27Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>1X2SWA1H3CHM5QFX</RequestId><HostId>C16csJLb2kyGfk4faDrGfJ8uHckSl2vzVOKL5aucNiujMgwc2ovBQiETwJybwM502tsmhb0DYGrXQNTrBXeeet3a63e9+UdN</HostId></Error>
2026-01-21T17:11:27+00:00: Caching of chunk a4d0947214486b8e03d5f93bce7f8318e7c3ddc99cc4d2a7d5e30a18b0283248
2026-01-21T17:11:29+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165456Z</RequestTime><ServerTime>2026-01-21T17:11:30Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>JRQ6Q20Q78X3KSN0</RequestId><HostId>mIIGgoQspEkPAjPSWG+AgVq7niXYPLS1PlL4fTfTNCV92nKxILT6wSEFUvl2ptEeuCtOLnG+2lk=</HostId></Error>
2026-01-21T17:11:29+00:00: Caching of chunk 5a97c9f8d6612de3e014784c990f05d03f2a6d4f1a7e8170b670bdba2b9bb6a3
2026-01-21T17:11:30+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165456Z</RequestTime><ServerTime>2026-01-21T17:11:31Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>MGXCT6X6975JE6TF</RequestId><HostId>5QoT0pJy3QiJ5KBcYWYr/vVsHqgES/d0f9mjtwfQsAKNjZoRQSTXhJnHeQmibVhmbqYXTZ0U5G7Txv6KQFLsEg7Gh+3Y4poZ</HostId></Error>
2026-01-21T17:11:30+00:00: Caching of chunk 4fe3c458276a52c7723927e96b76588593c046661fb50afd92e54433bc107a58
2026-01-21T17:11:30+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165455Z</RequestTime><ServerTime>2026-01-21T17:11:31Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>MGXE9KRNRM3MPM2Z</RequestId><HostId>06HBO6O2mJTy8eScU2xTNkjOJvUJvX/7+CzenE5EiaSZHDIAce5Me11LRs6Ul3+KjsK1I2kVZDU=</HostId></Error>
2026-01-21T17:11:31+00:00: Caching of chunk a0e1fe118e69cf6b299f2f1984c464378ec6d2522581bbb898d1b3350606abf4
2026-01-21T17:11:33+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165455Z</RequestTime><ServerTime>2026-01-21T17:11:34Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>YXVP4A1C24MB4AQZ</RequestId><HostId>NFxV5WnTYERb8zUE9Pn9PesZ/RrAkIl/97XHDEUJZ5vMqO7Mo00vNmg6CtHEdLteNjTEJVPIhZ8=</HostId></Error>
2026-01-21T17:11:35+00:00: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20260121T165455Z</RequestTime><ServerTime>2026-01-21T17:11:36Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><RequestId>S2MR1SQW0Y6V1X4C</RequestId><HostId>ZGO++zfTtuUyEog7tv12Tol3SAJeS8DdyCTxTNmETD9YrX7mZXCLaAncAMYBgQq7e5fadkDqPhzB3Dom7JWGmHCAykB/mqHc</HostId></Error>
2026-01-21T17:11:35+00:00: Caching of chunk 116a7ec835a33aec8b04c3aeb1236247fc3281f82f0bd6d437adae182ef8899f
2026-01-21T17:11:36+00:00: Caching of chunk 15a52a41b45ce2a4a2a1847fd0232845513f8ff0875d12b8c72fa8151c16ea1


The PBS server is running NTP with a local Chrony server so the time is correct but it is maybe a timestamp in packets which were not sent.

Sometimes we see traffic hit our firewall and being sent to AWS S3 but with no reply from the S3 server, or the traffic hitting the internal firewall (Fortigate) interface but not leaving the external interface , so I am wondering if the traffic is now out of context as far as the firewall is concerned and therefore then being dropped..

The aspect with internet outbound traffic, I can investigate with more tracing (as this only happens when we route the S3 traffic over one of our ISP links and not the other) but the issue I was more concerned with is the VM IO errors (after the puase) and so I will look at the fleecing aspect of this to see if this fixes the problem. The chain of events seems to start with the data not going to S3 and then this appears to have a knock on effect with the VM having I/O errors. It can get to the point where the IO issue forces the unmount of the VMs XFS filesystems and processes hanging then dying or being killed due the VM thinking that there is a problem with the filesystem. I dont see these issues if the datastore is a local disk resource in PBS. It's only with the s3 datastore upload stalling when being routed via one of our ISP links


Thanks again

Dek
 
<Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message>
Seems to be either a time sync issue or the requests being extremely delayed between being generated and send on the PBS and being received on the S3 api. But the errors seem to be intermitten, the backup task itself does not fail due to these requests being retried.

The chain of events seems to start with the data not going to S3 and then this appears to have a knock on effect with the VM having I/O errors. It can get to the point where the IO issue forces the unmount of the VMs XFS filesystems and processes hanging then dying or being killed due the VM thinking that there is a problem with the filesystem. I dont see these issues if the datastore is a local disk resource in PBS. It's only with the s3 datastore upload stalling when being routed via one of our ISP links
Yes, this is expected. The VM backup is done by a copy on write principle for consistency, meaning that the blocks the VM writes to while being backed up are uploaded to the PBS first, only then writing the contents for that block to the disk. If the chunk upload is however delayed because of the S3 backend connection issue, your VM has to wait until the chunk is being uploaded, leading to the observed IO issues in the VM. Therefore you will either have to use backup fleecing or use a local datastore as intermitten target, only syncing the backup snapshots to S3 in a second step using a push/pull sync job.
 
Hi Chris,

Many thanks for that explanation. Makes perfect sense as to what I was observing. I shall look at fleecing today and report back.

Thanks again

Regards

Dek