Backup restore on ZFS produces high load

HBO

Active Member
Dec 15, 2014
274
15
38
Germany
I've got a dual 6 core E5-2600v3 System with 64gb ram, LSI 9300-8i 12GB/s SATA/SAS HBA, 4 Intel DC SSD in ZFS Raid10 with 1 Intel DC SSD for log/cache.

On restore i got a system load up to 40-50 with many zfs processes in "top". No Change between restore by using nfs or local zfs dir.

Is there any possibility to reduce load for restore job? On time to time the whole system hangs for 1-2 seconds with a running restore job.

Running following Version:
Code:
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
 
No Change between restore by using nfs or local zfs dir
Really "dir"? So you are no using zvol (default)? Please post some informations:
Code:
pvesm status
zpool list
zpool status
zfs list
pveperf /mountpoint-zpool
 
Yeah "dir" ;) /var/lib/vz/dump/ to restore the backup not from nfs.

Code:
root@proxnode3:~# pvesm status
Name             Type     Status           Total            Used       Available        %
backup            nfs     active      5759822848      3848716288      1911090176   66.82%
local             dir     active       206999552         9813888       197185664    4.74%
local-zfs     zfspool     active       464767584       267581852       197185732   57.57%
Code:
root@proxnode3:~# zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool   476G   266G   210G         -    15%    55%  1.00x  ONLINE  -
root@proxnode3:~# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool   476G   266G   210G         -    15%    55%  1.00x  ONLINE  -
  mirror   238G   133G   105G         -    14%    56%
    sda2      -      -      -         -      -      -
    sdb2      -      -      -         -      -      -
  mirror   238G   132G   106G         -    16%    55%
    sdc      -      -      -         -      -      -
    sdd      -      -      -         -      -      -
log      -      -      -         -      -      -
  sde1  59.5G  1.59M  59.5G         -     0%     0%
cache      -      -      -         -      -      -
  sde2   178G   134G  44.8G         -     0%    74%
Code:
root@proxnode3:~# zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool                      273G   188G   104K  /rpool
rpool/ROOT                9.36G   188G    96K  /rpool/ROOT
rpool/ROOT/pve-1          9.36G   188G  9.36G  /
rpool/data                 255G   188G    96K  /rpool/data
rpool/data/vm-109-disk-1  53.5G   188G  53.5G  -
rpool/data/vm-203-disk-1  35.3G   188G  35.3G  -
rpool/data/vm-300-disk-1  12.8G   188G  12.8G  -
rpool/data/vm-300-disk-2  48.6M   188G  48.6M  -
rpool/data/vm-300-disk-3  56.5G   188G  56.5G  -
rpool/data/vm-303-disk-1  97.1G   188G  97.1G  -
rpool/swap                8.50G   196G   950M  -
Code:
root@proxnode3:~# pveperf /rpool
CPU BOGOMIPS:      115205.76
REGEX/SECOND:      2216472
HD SIZE:           188.05 GB (rpool)
FSYNCS/SECOND:     310.98
DNS EXT:           53.95 ms
DNS INT:           14.71 ms

I don't know why there are such low fsyncs in pveperf. The system runs fast, all vms got perfect disk io. There's only this load problem by restoring backups.

PS: I've another production System with "normal" fsyncs, same load problem. It's an newer dual 12 core E5-2600v4 system with about 128GB of ram. But same HBA with SAS enterprise and consumer ssd in ZFS Raid 10.
So it's a proxmox related issue? Writing off 100GB on an empty zvol leaves load on low state.
 
Since your disks in the RAID are all SSD I would remove the log and cache. In your setup it could easily reduce your performance. As a general rule of thump: Unless your log device is given at least 4-5 times the random performance over your disks in the array it often only reduces the overall performance which means in your case the log device should be NVMe type SSD with M.2 or PCIe interface to actually give a performance improvement. I am convinced that your MB supports more RAM so I would recommend exchanging the log device with additional RAM.
 
First test was without log/cache, actually i tested it with log/cache. Only on SAS system there is some more performance. This high load is not a ssd problem, the sas raid10 zpool has the same issue.
But no reason for this high load on restore process.
 
Please someone have a Proxmox 5 with ZFS that not have this problem during restore?

We are testing a ZFS enviroment but during restor of one VM all services stops because we suspect too much disk I/O usage.
 
Thanks, just yesterday we have made a test with that and seems to works well. We are making configurations. Thanks very much for your reply.
 
I don´t see the bwlimit option anymore:

Code:
:~# man qmrestore
QMRESTORE(1)               Proxmox VE Documentation               QMRESTORE(1)

NAME
       qmrestore - Restore QemuServer `vzdump` Backups

SYNOPSIS
       qmrestore help

       qmrestore <archive> <vmid> [OPTIONS]

       Restore QemuServer vzdump backups.

       <archive>: <string>
           The backup file. You can pass - to read from standard input.

       <vmid>: <integer> (1 - N)
           The (unique) ID of the VM.

       --force <boolean>
           Allow to overwrite existing VM.

       --pool <string>
           Add the VM to the specified pool.

       --storage <string>
           Default storage.

       --unique <boolean>
           Assign a unique random ethernet address.

Restore is slowly working but all VMs are stalling. Anyway to fix this?
 
I've got a dual 6 core E5-2600v3 System with 64gb ram, LSI 9300-8i 12GB/s SATA/SAS HBA, 4 Intel DC SSD in ZFS Raid10 with 1 Intel DC SSD for log/cache.

On restore i got a system load up to 40-50 with many zfs processes in "top". No Change between restore by using nfs or local zfs dir.

Is there any possibility to reduce load for restore job? On time to time the whole system hangs for 1-2 seconds with a running restore job.

Running following Version:
Code:
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
What was your fix in the end? Any news that would help me with high load? What bwlimit did you set?
Thanks

bwlimit if others are reading this and wonder:
vi /etc/vzdump.conf
bwlimit 50000
for example
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!