Insane load avg, disk timeouts w/ZFS

Why RAIDZ3 and not RAIDZ2?
For each increment of raid level you more or less half the write speed.

I've discovered that the hard way!

I went and got two 3TB SATA drives today, set them up as a mirror, and am zfs-send'ing merrily away to them right now (I only had about 2TB of data).
As snapshots finish transferring, I'll have to start up a few key VMs again and run them live off the external mirror pair. Then once everything is transferred, I'll destroy the RAIDZ3 pool and re-create it as a RAID-10 style array. And at that point I'll likely be having to use PVE to do live disk migrations because I can't keep all these VMs down for this length of time, which will take freaking forever, but oh well. (This server is an emergency replacement for some similarly-aged hardware that had quasi-random hardware issues.)

I'm still not sure if it's a good idea, or worth it, to add the mirrored Toshiba "Enterprise" eMLC drives as a ZIL to that RAID10 array. Opinions welcome.
 
If your SSD's are these: http://www.storagereview.com/toshiba_px04s_enterprise_ssd_review
they should give good performance with synchronious random writes. I would personally split the partitions like this:
2 x 8 GB for mirrored log and a stripe of the the rest for cache.
Previously discovered that there's no point in an L2ARC, this server is heavily write-biased; the L2ARC size never got past 6GB with a 32GB ARC, so ... not much value there.
Might be more valuable as really fast VM storage - haven't made up my mind yet.
I know that attaching a ZIL to the RAIDZ3 array made things worse somehow, not better. If I knew why, I'd have more confidence in what to do next.
 
I know that attaching a ZIL to the RAIDZ3 array made things worse somehow, not better. If I knew why, I'd have more confidence in what to do next.
You are free to add and remove a log and a cache from the pool at any time, also while it is online. So I would try some experiments before production.
 
I've reconfigured the 10 HDDs + 2 SSDs into a RAID10 setup with a mirrored SLOG.
Everything works great except for sustained write performance. When writing large amounts of data (e.g. during a VM clone or during backups) VMs time out while attempting to write to their virtual disks. (It doesn't matter if the virtual disk is configured as IDE, SATA, SCSI, or VirtIO.)
This causes filesystem problems inside three of the busier VMs on a nightly basis as their ext4 journal writes time out, and they remount the root filesystem read-only.

I *think* I'm hitting the ZFS write throttle; I have no idea how to overcome this - I've seen exactly the same behaviour on other systems, too, where (physical) write I/O will basically grind to a halt even though the disks are not 100% busy. It only seems to occur when very large amounts of data are written very rapidly. Unfortunately, it also seems to affect reads - as though all of ZFS just needed to rest for a few minutes before doing any more work!

I do have the ARC cache limited to 32GB (on a 72GB system), and I have memory left - no swapping is occurring. The root filesystem is on a separate pair of mirrored SSDs, and even it seems to be affected.
CPU use does not seem to be high when this happens, but the load average regularly reaches ~70-80.

The presence of an SLOG device seems to *very slightly* alleviate this problem, or rather delay it - it takes slightly more writes now before ZFS freezes.

I'm quite certain it's not hardware, since I observe the identical behaviour on four different, dissimilar, hardware platforms.

Ideas welcome!

(Read performance is awesome, btw, and so is write performance as long as I limit it to about 90% of maximum burst throughput.)
 
Are you still using deduplication? I just saw that on your first post and was astonished that no one mentioned this before. You're running 10 sata disks with deduplication on? Every test I tried with ordinary disks and deduplication were dead slow after some writes (when the SSD dedup table does not fit into the ram anymore). Please post this from the slow system zpool status -D <poolname>
 
Whoops, sorry for late reply. Dedup is only turned on for the ../ctdata subvolume, which currently is unused (no containers on this system any more). Oh. And apparently also for backups... which makes sense. I'll try turning them off, but those are the two legitimate use cases for deduplication, really... I can live without containers, I guess, long-term, but sure would like a way to use it with (uncompressed) backups. OTOH, the dedup ratio is absymal anyway so I guess there's not much point.

Code:
root@pve5:~# zpool status -D
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda2    ONLINE       0     0     0
            sdb2    ONLINE       0     0     0

errors: No known data errors

 dedup: no DDT entries

  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME                                               STATE     READ WRITE CKSUM
        tank                                               ONLINE       0     0     0
          mirror-0                                         ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abeb-lun-0  ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abea-lun-0  ONLINE       0     0     0
          mirror-1                                         ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe9-lun-0  ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe8-lun-0  ONLINE       0     0     0
          mirror-2                                         ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe7-lun-0  ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe6-lun-0  ONLINE       0     0     0
          mirror-3                                         ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe5-lun-0  ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe4-lun-0  ONLINE       0     0     0
          mirror-5                                         ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe3-lun-0  ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe2-lun-0  ONLINE       0     0     0
        logs
          mirror-4                                         ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe1-lun-0  ONLINE       0     0     0
            pci-0000:05:00.0-sas-0x500065b36789abe0-lun-0  ONLINE       0     0     0

errors: No known data errors

 dedup: DDT entries 31746831, size 309 on disk, 176 in core

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    30.0M   3.75T   3.73T   3.73T    30.0M   3.75T   3.73T   3.73T
     2     107K   13.3G   13.3G   13.3G     267K   33.4G   33.4G   33.4G
     4     137K   17.2G   17.2G   17.2G     695K   86.9G   86.9G   86.9G
     8    13.9K   1.74G   1.74G   1.74G     121K   15.1G   15.1G   15.1G
    16        3    384K    384K    384K       48      6M      6M      6M
 Total    30.3M   3.78T   3.76T   3.76T    31.1M   3.89T   3.86T   3.86T

root@pve5:~# zpool get all | grep dedup
rpool  dedupditto                  100                         local
rpool  dedupratio                  1.00x                       -
tank   dedupditto                  0                           default
tank   dedupratio                  1.02x                       -
root@pve5:~# zfs get all | grep dedup
rpool                                  dedup                  off                    default
rpool/ROOT                             dedup                  off                    default
rpool/ROOT/pve-1                       dedup                  off                    default
rpool/swap                             dedup                  off                    default
tank                                   dedup                  off                    default
tank/backups                           dedup                  sha256,verify          received
tank/ctdata                            dedup                  sha256,verify          received
tank/templates                         dedup                  off                    default
tank/vmdata                            dedup                  off                    default
tank/vmdata/vm-100-disk-1              dedup                  off                    default
tank/vmdata/vm-100-disk-2              dedup                  off                    default
tank/vmdata/vm-101-disk-1              dedup                  off                    default
tank/vmdata/vm-101-disk-2              dedup                  off                    default
tank/vmdata/vm-102-disk-1              dedup                  off                    default
tank/vmdata/vm-102-disk-2              dedup                  off                    default
tank/vmdata/vm-103-disk-1              dedup                  off                    default
tank/vmdata/vm-103-disk-2              dedup                  off                    default
tank/vmdata/vm-117-disk-1              dedup                  off                    default
tank/vmdata/vm-118-disk-1              dedup                  off                    default
tank/vmdata/vm-120-disk-1              dedup                  off                    default
tank/vmdata/vm-120-disk-2              dedup                  off                    default
tank/vmdata/vm-122-disk-1              dedup                  off                    default
tank/vmdata/vm-122-disk-2              dedup                  off                    default
tank/vmdata/vm-125-disk-1              dedup                  off                    default
tank/vmdata/vm-127-disk-1              dedup                  off                    default
tank/vmdata/vm-127-disk-2              dedup                  off                    default
tank/vmdata/vm-127-disk-3              dedup                  off                    default
tank/vmdata/vm-226-disk-1              dedup                  off                    default
tank/vmdata/vm-228-disk-1              dedup                  off                    default
tank/vmdata/vm-901-disk-1              dedup                  off                    default
tank/vmdata/vm-901-disk-2              dedup                  off                    default
 
Also, on at least one system that exhibits nearly identical behaviour, dedup isn't enabled anywhere. But I'll turn it off anyway.
 
Default Proxmox VE backups are no deduplicatable. They use a proprietary format (even in uncompressed format) which does not work with deduplication. You have to unpack the files, then it's perfectly deduplicatable and you can use snapshots to have the backup infrastructure. I built such a server and was able to store at least 100x the amount of backups on a single (!) 4 TB drive (as external zfs pool for off-site storage). That's really amazing.

Deduplication can be turned off, but it will not be faster then. The tables and structures are all in place. You have to destroy and recreate the pool to get rid of all deduplication entries. Better performance on the datasets where you disabled it is not possible unless you send/receive the datasets.
 
Yes, I've discovered that about backups :-(.
However, I do actually still want dedup for containers; all the OS files should dedupe, producing substantial savings. At least in theory.
 
Yes, I've discovered that about backups :-(.
However, I do actually still want dedup for containers; all the OS files should dedupe, producing substantial savings. At least in theory.

Yes, it's also true in practice, but you have to have a lot of RAM or very fast disks. I use dedup for containers on SSDs which works very fast and saves a lot of space, so you're right. Yet you will loose performance which increases with the files. I'd suggest to have two pools for your containers, one of the deduplicatable containers and one of data, e.g. fileserver, webserver, sql goes on the non-dedup and the os to the dedup. This is a more complex setup, but you'll have a smaller dedup table (in ram and on disk).
 
I have since found one potential fix for my performance issues (although it may cause other problems, don't know yet): setting the zfs_arc_lotsfree_percent parameter to zero (0).
 
  • Like
Reactions: vkhera

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!