Read from ZFS is very slow

docent

Renowned Member
Jul 23, 2009
96
1
73
Hi,

I have two node cluster with ZFS on each node.
And I see very slow work of ZFS
On the first node (pve-manager/3.4-3/2fc72fee (running kernel: 2.6.32-37-pve)):
Code:
root@hq-vmc1-1:/# dd bs=10M count=200 if=/dev/sda of=/dev/null
2097152000 bytes (2.1 GB) copied, 3.55839 s, 589 MB/s

root@hq-vmc1-1:/# dd bs=10M count=200 if=/rpool/backup/dump/vzdump-qemu-105-2015_03_29-00_16_51.vma.lzo of=/dev/null
2097152000 bytes (2.1 GB) copied, 3.86926 s, 542 MB/s

root@hq-vmc1-1:/# dd bs=10M count=200 if=/dev/zvol/rpool/vm-102-disk-1 of=/dev/null
2097152000 bytes (2.1 GB) copied, 29.2143 s, 71.8 MB/s
On the second node (pve-manager/3.4-1/3f2d890e (running kernel: 3.10.0-1-pve)):
Code:
root@hq-vmc1-2:/# dd bs=10M count=200 if=/dev/sda of=/dev/null
2097152000 bytes (2.1 GB) copied, 3.0852 s, 680 MB/s

root@hq-vmc1-2:/# dd bs=10M count=200 if=/pool2/VMs/images/100/vm-100-disk-1.qcow2 of=/dev/null
2097152000 bytes (2.1 GB) copied, 42.991 s, 48.8 MB/s

root@hq-vmc1-2:/# dd bs=10M count=200 if=/dev/zd32 of=/dev/null
2097152000 bytes (2.1 GB) copied, 107.888 s, 19.4 MB/s

Both nodes have RAID 6 on LSI MR9265-8i (8 x 2TB SAS)
 
You use ZFS on top of hardware raid ? What size ZFS ARC ?
Code:
root@hq-vmc1-1:~# cat /etc/modprobe.d/zfs.conf
# ZFS tuning for a proxmox machine that reserves 16GB for ZFS
#
# Don't let ZFS use less than 4GB and more than 16GB
options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=17179869184
#
# disabling prefetch is no longer required
options zfs l2arc_noprefetch=0
Code:
root@hq-vmc1-2:~# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=4299967296
Code:
root@hq-vmc1-1:~# zpool list -v
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
rpool  10.9T  2.34T  8.54T    21%  1.00x  ONLINE  -
  sda3  10.9T  2.34T  8.54T         -
Code:
root@hq-vmc1-2:~# zpool list -v
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
pool2  2.98T  2.64T   355G    88%  1.00x  ONLINE  -
  dm-name-pve-csv2  2.98T  2.64T   355G         -
  mirror  7.94G  25.9M  7.91G         -
    ata-INTEL_SSDSC2CW060A3_CVCV203102QQ060AGN-part1      -      -      -         -
    ata-INTEL_SSDSC2CW060A3_CVCV20310484060AGN-part1      -      -      -         -
cache      -      -      -      -      -      -
  ata-INTEL_SSDSC2CW060A3_CVCV203102QQ060AGN-part2  47.9G  47.9G  7.62M         -
  ata-INTEL_SSDSC2CW060A3_CVCV20310484060AGN-part2  47.9G  47.9G  7.98M         -
 
I can see that you have put zfs on partitions. This is a very bad idea. zfs needs disks and not partitions. Further more you have put cache on partitions on the same disks as zfs. This is a disaster waiting to happen!
 
This is not correct. zfs will create its filesystem on what ever you trough at it, being a disk or a partition. The recommended is to use hole disks.
 
This is not correct. zfs will create its filesystem on what ever you trough at it, being a disk or a partition. The recommended is to use hole disks.

No, zfs always use gpt partitions (see source code). AFAIK they have some magic code to hide that fact.
 
Code:
root@hq-vmc1-2:~# zpool list -v
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
pool2  2.98T  2.64T   355G    88%  1.00x  ONLINE  -
  [B]dm-name-pve-csv2  2.98T  2.64T   355G         -[/B]
  mirror  7.94G  25.9M  7.91G         -
    ata-INTEL_SSDSC2CW060A3_CVCV203102QQ060AGN-part1      -      -      -         -
    ata-INTEL_SSDSC2CW060A3_CVCV20310484060AGN-part1      -      -      -         -
cache      -      -      -      -      -      -
  ata-INTEL_SSDSC2CW060A3_CVCV203102QQ060AGN-part2  47.9G  47.9G  7.62M         -
  ata-INTEL_SSDSC2CW060A3_CVCV20310484060AGN-part2  47.9G  47.9G  7.98M         -


If you are using hardware raid and then putting ZFS on top of it, ZFS will loose controlling of HDD work. It will be depended on hardware raid not ZFS.
 
In fact, there is a third node in this cluster.
This node has only one HDD SATA 250GB.
And I watch the similar result on the third node.

Code:
root@hq-vmc1-3:~# pveversion
pve-manager/3.4-3/2fc72fee (running kernel: 2.6.32-37-pve)

root@hq-vmc1-3:~# parted /dev/sda print free
Model: ServeRA System (scsi)
Disk /dev/sda: 251GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name                  Flags
        17.4kB  1049kB  1031kB  Free Space
 1      1049kB  2097kB  1049kB               Grub-Boot-Partition   bios_grub
 2      2097kB  136MB   134MB   fat32        EFI-System-Partition  boot, esp
 3      136MB   251GB   251GB   zfs          PVE-ZFS-Partition
        251GB   251GB   1032kB  Free Space

root@hq-vmc1-3:~# zpool list -v
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
rpool   232G  51.9G   180G    22%  1.00x  ONLINE  -
  sda3   232G  51.9G   180G         -

root@hq-vmc1-3:~# zfs list
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool                64.5G   164G   144K  /rpool
rpool/ROOT           27.8G   164G   144K  /rpool/ROOT
rpool/ROOT/pve-1     27.8G   164G  27.8G  /
rpool/swap           12.8G   176G   118M  -
rpool/vm-100-disk-1  4.96G   164G  4.96G  -
rpool/vm-110-disk-1  13.2G   164G  13.2G  -
rpool/vm-134-disk-1  5.79G   164G  5.79G  -

root@hq-vmc1-3:~# dd bs=10M count=200 if=/dev/sda of=/dev/null
2097152000 bytes (2.1 GB) copied, 33.5967 s, 62.4 MB/s

root@hq-vmc1-3:~# dd bs=10M count=200 if=/var/lib/vz/dump/vzdump-qemu-100-2015_04_13-08_34_50.vma.lzo of=/dev/null
2097152000 bytes (2.1 GB) copied, 64.8132 s, 32.4 MB/s

root@hq-vmc1-3:~# dd bs=10M count=200 if=/dev/zd0 of=/dev/null
2097152000 bytes (2.1 GB) copied, 122.875 s, 17.1 MB/s

The first node has the similar disk partitioning
Code:
root@hq-vmc1-1:~# parted /dev/sda print free
Model: LSI MR9265-8i (scsi)
Disk /dev/sda: 12.0TB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name                  Flags
        17.4kB  1049kB  1031kB  Free Space
 1      1049kB  2097kB  1049kB               Grub-Boot-Partition   bios_grub
 2      2097kB  136MB   134MB   fat32        EFI-System-Partition  boot, esp
 3      136MB   12.0TB  12.0TB  zfs          PVE-ZFS-Partition
        12.0TB  12.0TB  1032kB  Free Space

root@hq-vmc1-1:~# zpool list -v
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
rpool  10.9T  2.34T  8.53T    21%  1.00x  ONLINE  -
  sda3  10.9T  2.34T  8.53T         -

root@hq-vmc1-1:~# zfs list
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool                2.63T  8.08T   144K  /rpool
rpool/ROOT           17.8G  8.08T   144K  /rpool/ROOT
rpool/ROOT/pve-1     17.8G  8.08T  17.8G  /
rpool/VMs             160K  8.08T   160K  /rpool/VMs
rpool/backup         1.84T  1.16T  1.84T  /rpool/backup
rpool/swap            133G  8.21T  74.3M  -
rpool/vm-102-disk-1  34.0G  8.08T  34.0G  -
rpool/vm-102-disk-2  86.2G  8.08T  86.2G  -
rpool/vm-102-disk-3  94.3G  8.08T  94.3G  -
rpool/vm-103-disk-1   173G  8.08T   172G  -
 
You should not have cache on the same drives as data. The purpose of zfs larc is to put commonly used data on faster drives. If you are using cache and data on the same drive, you are just wasting time and speed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!