Performance issues with ZFS

kenneth_vkd

Well-Known Member
Sep 13, 2017
37
3
48
31
Hi
We are running a three node PVE 5.1 cluster. Two nodes for running virtual machines and a backup node for storing VM backups and use it for virtual machines in case of an emergency.

All servers use locally attached storage with 2x 4TB SATA drives in a ZFS mirror and 2x 450GB NVMe drives, on the two primary VM hosts. The backup node runs 12x4TB SAS drives in a ZFS RAID-Z2 and no SSDs.
Within the last few weeks (shortly after Microsoft released their Spectre/Meltdown patches), we patched some Windows Server 2012 R2 VMs on one of the hosts and then users started reporting issues with random hangs. We removed the patches from the VMs, but that did not change.
The random hangs are described as being between 5-30 seconds and then it runs fine again.
Our attention then turned to the physical PVE hosts, which have also been patched. No BIOS/UEFI patches have been installed yet.
Until a few days ago, we had one SSD for L2ARC and one SSD for ZIL. This has been performing great so far, but as an experiment, we tried to change the configuration, so that we partition the drives and then assign them.
This was done based on the user feedback, it seemed like random cache misses, where data is no longer in the L2ARC and therefore it queries the slower SATA drives for the data. Due to the access times on the disk, when multiple users query at the same time, we see random hangs (Application not responding).
The server where we have the issues, is primarily running Windows servers (due to Windows Server licensing), while the other node is running everything else (Primarily CentOS 6, CentOS 7 and Debian 9)

Code:
# zpool status
  pool: rpool
 state: ONLINE
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            sda2       ONLINE       0     0     0
            sdb2       ONLINE       0     0     0
        logs
          mirror-1     ONLINE       0     0     0
            nvme0n1p1  ONLINE       0     0     0
            nvme1n1p1  ONLINE       0     0     0
        cache
          nvme0n1p2    ONLINE       0     0     0
          nvme1n1p2    ONLINE       0     0     0

errors: No known data errors

We did the same configuration on both primary hosts, as they are physically the same type of server. After changing this, we allowed the servers a few days to rebuild the cache and just a few hours ago we checked again and performance has degraded further on both nodes.

The original configuration looked like this:
Code:
# zpool status
  pool: rpool
 state: ONLINE
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            sda2       ONLINE       0     0     0
            sdb2       ONLINE       0     0     0
        logs
          nvme1n1     ONLINE       0     0     0
        cache
          nvme0n1   ONLINE       0     0     0

errors: No known data errors

Both servers are running running this version:
Code:
pve-manager/5.1-49/1e427a54 (running kernel: 4.13.13-6-pve)
Tonight we will revert to the old configuration to regain the desired performance.
Does anyone else see similar issues?
Does anyone have any tips for how to find out if this is the case?
Can we get anything interesting from monitoring with arcstat?
 
Hi,
What is the usage of your pool?
Because ZFS will significance lose performance after 70%.
 
The pool only has 800GB used on the node running the Windows servers
Code:
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  3.62T   807G  2.84T         -    43%    21%  1.00x  ONLINE  -

The secondary node only has around 600GB used:
Code:
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  3.62T   590G  3.05T         -    20%    15%  1.00x  ONLINE  -
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!