sync-level to none

Leprelnx · Jun 18, 2024

Hi everyone,

I have a PBS formatted with 30 rotating disks in raid 10, I have
sporadic problems during the sync at the end of the backup of which I report at
the end some examples of reports in dmesg over time.

What do I risk by setting sync-level to none? Is it worth trying?

Thanks

INFO: task tokio-runtime-w:158563 blocked for more than 122 seconds.
Tainted: P O 6.8.4-3-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tokio-runtime-w state

stack:0 pid:158563 tgid:13502 ppid:1 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x401/0x15e0
? __queue_delayed_work+0x83/0xf0
schedule+0x33/0x110
wb_wait_for_completion+0x89/0xc0
? __pfx_autoremove_wake_function+0x10/0x10
__writeback_inodes_sb_nr+0x9d/0xd0
writeback_inodes_sb+0x3c/0x60
sync_filesystem+0x3d/0xb0
__x64_sys_syncfs+0x49/0xb0
x64_sys_call+0x12b4/0x24b0
do_syscall_64+0x81/0x170
? do_syscall_64+0x8d/0x170
? putname+0x5b/0x80
? __x64_sys_statx+0x71/0x90
? syscall_exit_to_user_mode+0x86/0x260
? do_syscall_64+0x8d/0x170
? exc_page_fault+0x94/0x1b0
entry_SYSCALL_64_after_hwframe+0x78/0x80
RIP: 0033:0x76120931dc57
RSP: 002b:00007611af9fe258 EFLAGS: 00000202 ORIG_RAX: 0000000000000132
RAX: ffffffffffffffda RBX: 00007611af9fe2a8 RCX: 000076120931dc57
RDX: 000076161115e280 RSI: 0000000000000007 RDI: 000000000000001e
RBP: 000000000000001e R08: 0000000000000007 R09: 0000761170030790
R10: 207f8db92f223b7c R11: 0000000000000202 R12: 0000000000000001
R13: 000076111c015a20 R14: 000000000000000b R15: 0000761170030790
</TASK>
INFO: task tokio-runtime-w:265427 blocked for more than 122 seconds.
Tainted: P O 6.8.4-3-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tokio-runtime-w state

stack:0 pid:265427 tgid:13502 ppid:1 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x401/0x15e0
? __queue_delayed_work+0x83/0xf0
schedule+0x33/0x110
wb_wait_for_completion+0x89/0xc0
? __pfx_autoremove_wake_function+0x10/0x10
__writeback_inodes_sb_nr+0x9d/0xd0
writeback_inodes_sb+0x3c/0x60
sync_filesystem+0x3d/0xb0
__x64_sys_syncfs+0x49/0xb0
x64_sys_call+0x12b4/0x24b0
do_syscall_64+0x81/0x170
? do_syscall_64+0x8d/0x170
? syscall_exit_to_user_mode+0x86/0x260
? do_syscall_64+0x8d/0x170
? do_syscall_64+0x8d/0x170
? irqentry_exit+0x43/0x50
? exc_page_fault+0x94/0x1b0
entry_SYSCALL_64_after_hwframe+0x78/0x80
RIP: 0033:0x76120931dc57
RSP: 002b:00007611f75fe258 EFLAGS: 00000202 ORIG_RAX: 0000000000000132
RAX: ffffffffffffffda RBX: 00007611f75fe2a8 RCX: 000076120931dc57
RDX: 0000761615175079 RSI: 0000000000000006 RDI: 000000000000001e
RBP: 000000000000001e R08: 0000000000000007 R09: 0000761174009f90
R10: 207f8db92f223b7c R11: 0000000000000202 R12: 0000000000000001
R13: 000076114c011ec0 R14: 000000000000000b R15: 0000761174009f90
</TASK>
INFO: task tokio-runtime-w:265366 blocked for more than 122 seconds.
Tainted: P O 6.8.4-3-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tokio-runtime-w state

stack:0 pid:265366 tgid:13502 ppid:1 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x401/0x15e0
? __queue_delayed_work+0x83/0xf0
schedule+0x33/0x110
wb_wait_for_completion+0x89/0xc0
? __pfx_autoremove_wake_function+0x10/0x10
__writeback_inodes_sb_nr+0x9d/0xd0
writeback_inodes_sb+0x3c/0x60
sync_filesystem+0x3d/0xb0
__x64_sys_syncfs+0x49/0xb0
x64_sys_call+0x12b4/0x24b0
do_syscall_64+0x81/0x170
? do_syscall_64+0x8d/0x170
? syscall_exit_to_user_mode+0x86/0x260
? do_syscall_64+0x8d/0x170
? irqentry_exit+0x43/0x50
? exc_page_fault+0x94/0x1b0
entry_SYSCALL_64_after_hwframe+0x78/0x80
RIP: 0033:0x76120931dc57
RSP: 002b:00007611f49fe258 EFLAGS: 00000202 ORIG_RAX: 0000000000000132
RAX: ffffffffffffffda RBX: 00007611f49fe2a8 RCX: 000076120931dc57
RDX: 00007616251514f6 RSI: 0000000000000005 RDI: 000000000000001e
RBP: 000000000000001e R08: 0000000000000007 R09: 00007611440068c0
R10: 207f8db92f223b7c R11: 0000000000000202 R12: 0000000000000001
R13: 000076114c00e1e0 R14: 000000000000000b R15: 00007611440068c0
</TASK>
perf: interrupt took too long (3163 > 3152), lowering kernel.perf_event_max_sample_rate to 63000
perf: interrupt took too long (3957 > 3953), lowering kernel.perf_event_max_sample_rate to 50000
INFO: task tokio-runtime-w:440481 blocked for more than 122 seconds.
Tainted: P O 6.8.4-3-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tokio-runtime-w state

stack:0 pid:440481 tgid:13502 ppid:1 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x401/0x15e0
? __queue_delayed_work+0x83/0xf0
schedule+0x33/0x110
wb_wait_for_completion+0x89/0xc0
? __pfx_autoremove_wake_function+0x10/0x10
__writeback_inodes_sb_nr+0x9d/0xd0
writeback_inodes_sb+0x3c/0x60
sync_filesystem+0x3d/0xb0
__x64_sys_syncfs+0x49/0xb0
x64_sys_call+0x12b4/0x24b0
do_syscall_64+0x81/0x170
? irqentry_exit_to_user_mode+0x7b/0x260
? irqentry_exit+0x43/0x50
? exc_page_fault+0x94/0x1b0
entry_SYSCALL_64_after_hwframe+0x78/0x80
RIP: 0033:0x76120931dc57
RSP: 002b:00007611f4dfe258 EFLAGS: 00000202 ORIG_RAX: 0000000000000132
RAX: ffffffffffffffda RBX: 00007611f4dfe2a8 RCX: 000076120931dc57
RDX: 0000761671154f76 RSI: 0000000000000005 RDI: 000000000000001f
RBP: 000000000000001f R08: 0000000000000007 R09: 0000761110006780
R10: 207f8db92f223b7c R11: 0000000000000202 R12: 0000000000000001
R13: 00007610f802f9a0 R14: 000000000000000b R15: 0000761110006780
</TASK>

mgabriel · Jun 18, 2024

If you set the sync-level to none, you might end up with broken backups in case of a power loss. It should only affect newly written chunks, so there should be nearly no risk for existing/older backups.

It seems your backup storage is too slow, which is hard to believe with 30 disks in a raid10.

Leprelnx · Jun 19, 2024

mgabriel said:
If you set the sync-level to none, you might end up with broken backups in case of a power loss. It should only affect newly written chunks, so there should be nearly no risk for existing/older backups.

It seems your backup storage is too slow, which is hard to believe with 30 disks in a raid10.

Hi @mgabriel.

You're right! I'm going to investigate further. It seems there are issues with
writeback during synchronization, suggesting the storage is the bottleneck.

Here's my setup for reference:

- pool 30 HDDs (HUH721212ALE600 sata 7200 rpm ) in an mdadm RAID configuration
- Broadcom/LSI SAS3008 flashed in IT mode(HBA)
- 128GB of RAM

I'll be running fio and proxmox-backup-client benchmarks as
soon as possible to gather more data.

Regarding the "sync-level to none" setting, the machine has redundant
hot-swappable power supplies and is located in a data center with a UPS.
Additionally, if I recall correctly, corrupted chunks should be flagged and
replaced, is that right?

Luca

mgabriel · Jun 19, 2024

Leprelnx said:
- pool 30 HDDs (HUH721212ALE600 sata 7200 rpm ) in an mdadm RAID configuration
- Broadcom/LSI SAS3008 flashed in IT mode(HBA)
- 128GB of RAM

I'd recommend using a ZFS RAID10 pool on top of your HBA, not a mdadm RAID configuration. Besides mdadm isn't supported by Proxmox, mdadm has drawbacks when it comes to ops/sec and other issues. It may still be a valid approach for a file server, but it's not the best thing for a PBS or any other ops-related workload.

Leprelnx said:
Regarding the "sync-level to none" setting, the machine has redundant
hot-swappable power supplies and is located in a data center with a UPS.
Additionally, if I recall correctly, corrupted chunks should be flagged and
replaced, is that right?

That's right. The key is that you need to set up regular verify jobs, prune and garbage collection jobs to find the corrupted chunks. Then, corrupted chunks will be transferred again if they are still available on the source side (e.g. they were not changed since then).

Leprelnx · Jun 19, 2024

mgabriel said:
I'd recommend using a ZFS RAID10 pool on top of your HBA, not a mdadm RAID configuration. Besides mdadm isn't supported by Proxmox, mdadm has drawbacks when it comes to ops/sec and other issues. It may still be a valid approach for a file server, but it's not the best thing for a PBS or any other ops-related workload.

You're absolutely right! Unfortunately, I don't have enough ram to create a ZFS pool for all my disks.
On a different note, I'm curious about average benchmark values for fio and proxmox-backup-client. Do you have any details?

mgabriel · Jun 19, 2024

Leprelnx said:
On a different note, I'm curious about average benchmark values for fio and proxmox-backup-client. Do you have any details?

Unfortunately not, but I want to thank you in advance for sharing as soon as you have some

.

Leprelnx · Jun 19, 2024

Sure!!!

Thank you.

Search

Search

sync-level to none

Leprelnx

Member

mgabriel

Renowned Member

Leprelnx

Member

mgabriel

Renowned Member

Leprelnx

Member

mgabriel

Renowned Member

Leprelnx

Member

We value your privacy