Proxmox Backup Server

devis

Member
Mar 2, 2023
31
2
8
Hello, I am using Proxmox VE with Proxmox Backup Server
My backup process is built on an internal 10 gigabit network, I also use ZFS for Proxmox Backup Server with 32GB memory allocation for ARC
Storage capacity 47 TB of which 40TB is occupied
RAM 64GB
CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
SATA Disks for Storage (Seagate Exos 7E8) 12x8TB
VM 821 Groups, 1681 Snapshots

However, I am seeing the following problems:
1. I use a scheduler to verify all backups, however, this process is extremely long and can take up to a week, during this process there are problems with communication and getting backups when I try to load a list of backups of any machine on the hypervisor, the errors that I observe are "Connection timeout (596)" or "Communication failure (0)"
2. During the operation of the backup verification service, the proxmox backup server interface also slows down

Can you please tell me how to fix these problems?
 
Last edited:
~# pveversion --verbose
proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.4-1
pve-kernel-helper: 7.3-3
pve-kernel-5.15.104-1-pve: 5.15.104-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
proxmox-backup-manager versions --verbose
proxmox-backup 2.4-1 running kernel: 5.15.107-2-pve
proxmox-backup-server 2.4.2-2 running version: 2.4.2
pve-kernel-5.15 7.4-4
pve-kernel-5.15.107-2-pve 5.15.107-2
pve-kernel-5.15.74-1-pve 5.15.74-1
ifupdown2 3.1.0-1+pmx4
libjs-extjs 7.0.0-1
proxmox-backup-docs 2.4.2-1
proxmox-backup-client 2.4.2-1
proxmox-mail-forward 0.1.1-1
proxmox-mini-journalreader 1.2-1
proxmox-offline-mirror-helper unknown
proxmox-widget-toolkit 3.7.3
pve-xtermjs 4.16.0-2
smartmontools 7.2-pve3
zfsutils-linux 2.1.11-pve1
 
ZFS Subsystem Report Wed Jul 05 21:10:31 2023
Linux 5.15.107-2-pve 2.1.11-pve1
Machine: bkp5 (x86_64) 2.1.11-pve1

ARC status: HEALTHY
Memory throttle count: 0

ARC size (current): 100.2 % 30.1 GiB
Target size (adaptive): 100.0 % 30.0 GiB
Min size (hard limit): 100.0 % 30.0 GiB
Max size (high water): 1:1 30.0 GiB
Most Frequently Used (MFU) cache size: 61.2 % 7.0 GiB
Most Recently Used (MRU) cache size: 38.8 % 4.4 GiB
Metadata cache size (hard limit): 100.0 % 30.0 GiB
Metadata cache size (current): 94.5 % 28.3 GiB
Dnode cache size (hard limit): 100.0 % 30.0 GiB
Dnode cache size (current): 35.3 % 10.6 GiB

ARC hash breakdown:
Elements max: 3.5M
Elements current: 26.5 % 920.0k
Collisions: 98.1M
Chain max: 7
Chains: 46.8k

ARC misc:
Deleted: 624.1M
Mutex misses: 5.6M
Eviction skips: 99.0M
Eviction skips due to L2 writes: 0
L2 cached evictions: 0 Bytes
L2 eligible evictions: 69.8 TiB
L2 eligible MFU evictions: 2.9 % 2.1 TiB
L2 eligible MRU evictions: 97.1 % 67.7 TiB
L2 ineligible evictions: 4.4 TiB

ARC total accesses (hits + misses): 2.1G
Cache hit ratio: 81.1 % 1.7G
Cache miss ratio: 18.9 % 388.0M
Actual hit ratio (MFU + MRU hits): 81.1 % 1.7G
Data demand efficiency: 48.1 % 378.3M
Data prefetch efficiency: < 0.1 % 89.4M

Cache hits by cache type:
Most frequently used (MFU): 68.1 % 1.1G
Most recently used (MRU): 31.9 % 532.0M
Most frequently used (MFU) ghost: 4.0 % 66.7M
Most recently used (MRU) ghost: 1.3 % 22.4M

Cache hits by data type:
Demand data: 10.9 % 182.1M
Prefetch data: < 0.1 % 7.9k
Demand metadata: 89.1 % 1.5G
Prefetch metadata: < 0.1 % 549.6k

Cache misses by data type:
Demand data: 50.6 % 196.2M
Prefetch data: 23.0 % 89.4M
Demand metadata: 26.4 % 102.3M
Prefetch metadata: < 0.1 % 85.0k

DMU prefetch efficiency: 27.0M
Hit ratio: 49.1 % 13.3M
Miss ratio: 50.9 % 13.8M

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
spl_hostid 0
spl_hostid_path /etc/hostid
spl_kmem_alloc_max 1048576
spl_kmem_alloc_warn 65536
spl_kmem_cache_kmem_threads 4
spl_kmem_cache_magazine_size 0
spl_kmem_cache_max_size 32
spl_kmem_cache_obj_per_slab 8
spl_kmem_cache_reclaim 0
spl_kmem_cache_slab_limit 16384
spl_max_show_tasks 512
spl_panic_halt 0
spl_schedule_hrtimeout_slack_us 0
spl_taskq_kick 0
spl_taskq_thread_bind 0
spl_taskq_thread_dynamic 1
spl_taskq_thread_priority 1
spl_taskq_thread_sequential 4



VDEV cache disabled, skipping section

ZIL committed transactions: 15.2M
Commit requests: 170.2k
Flushes to stable storage: 170.1k
Transactions to SLOG storage pool: 0 Bytes 0
Transactions to non-SLOG storage pool: 6.7 GiB 184.0k
 
Jup. You get what you pay for. HDDs are probably bottlenecking as they can't handle the random reads fast enough.
 
zfs raidz1 ? z2 ?
imo it's excepted on hdd only.
Bash:
zpool status
  pool: Backups-Storage1
 state: ONLINE
config:

    NAME                                  STATE     READ WRITE CKSUM
    Backups-Storage1                      ONLINE       0     0     0
      mirror-0                            ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2KGQE  ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2MY6A  ONLINE       0     0     0
      mirror-1                            ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2N2DB  ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2NVWF  ONLINE       0     0     0
      mirror-2                            ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2PKLH  ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2PQQN  ONLINE       0     0     0
      mirror-3                            ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2Q4Q6  ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2SQCH  ONLINE       0     0     0
      mirror-4                            ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2SQG0  ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2V7EB  ONLINE       0     0     0
      mirror-5                            ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2VBDD  ONLINE       0     0     0
        ata-ST8000NM000A-2KE101_WKD2W94Y  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
config:

    NAME                                                  STATE     READ WRITE CKSUM
    rpool                                                 ONLINE       0     0     0
      mirror-0                                            ONLINE       0     0     0
        ata-INTEL_SSDSC2BB120G4_BTWL428306LR120LGN-part3  ONLINE       0     0     0
        ata-INTEL_SSDSC2BB120G4_BTWL42740501120LGN-part3  ONLINE       0     0     0

errors: No known data errors
 
I would at least add a mirror of enterprise SSDs (for example 2x 500 GB) as special devices so the metadata don't have to be read/written from/to the slow HDDs. Would make the GC magnitude faster and verify/backup/restore performance also a bit.
 
Last edited:
Do I understand correctly, it is necessary to make L2 ARC from the existing system disk right?
 
I would at least add a mirror of enterprise SSDs (for example 2x 500 GB) as special devices so the metadata don't have to be read/written from/to the slow HDDs. Would make the GC magnitude faster and verify/backup/restore performance also a bit.
But after all, if I'm not mistaken, part of the metadata is in my memory due to the use of ARC

GC, in principle, I won’t say that it works for a long time, more time is spent on the verification process, and when this process goes on, I get errors when I access backups of a specific VM from the hypervisor
 
Last edited:
But after all, if I'm not mistaken, part of the metadata is in my memory due to the use of ARC
Yes, but ARC and L2ARC are read cache only. Special devices are no cache. Without them data+metadata will be stored on the HDDs. With special devices metdata will be stored on SSDs and data on HDDs. Without special devices all those metadata writes will still hit your slow HDDs, no matter how big your ARC or L2RC are.
PBS stores everything as chunk files of max 4 MB (in practice more like 2MB because of compression). So with 96TB of backup storage you got like 48 million chunk files. When doing a GC, it needs to read+write the atime (so metadata) of all those 48 million files. Way faster when a pair of SSDs is doing the millions of random read+write IO.
 
Hello everyone, I installed SSD drives and the problem disappeared. Topic can be closed
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!