Write speeds fall off/VM lag during file copies ZFS

oUr0Vy.png


I think swap usage so normal for virtualization system if real hardware not have enought real memory for all.. This picture from homelab HP Miroserver GEN8 and this server only have 16 GB ECC buffered RAM ( I want grow up but this is cpu limitation anyway ) I using this server for my homelab for static guest like NFS,ISCS,AD,Monitoring,LOG Server also I using this server as a file servicess... On this server I have BTRFS and ZFS disk system for diffrent job..

and of the day my server real memory actually not enought for all so KVM need some echange area as swap, that is all..
 
I think swap usage so normal for virtualization system if real hardware not have enought real memory for all..

Yes, so plug in more memory or reduce the number of VMs or memory for the VMs. The system does already operate at its limits and will be noticeable slow.
 
On this machine performance not more important for me also I can not add more RAM due CPU limit... I have another two server for all real scenario testing..
 
@ertanerbek
@LnxBil

Ok so there were a lot of suggestions:
  • I disabled Discard on all the drives connected to the VM in question and this resulted in no improvement - 1. Disable discard on GUEST config.
  • I checked "zfs get compression" and all my drives have compression enabled already - 2. You should open compression on ZFS pool other way ZFS can not fight with "0" it is mean you can not use your level 1 cache effectifly.. Also this is better solition instead of DISCARD and use global discard for your ssd on ZFS option.
  • I limited my ARC to 8GB a week ago and my server has 10GB of memory free. I have 2 ssds and 4 magnetic drives using ZFS (total of 30TB), should I lower this number further? - 3. Limit your ARC usage ( you can found on pve db ) ( for 16GB ram 2 GB enough )
  • I set these options in /etc/ksmtuned.conf, do I need to reboot proxmox to make the changes take affect? - 4. use KSM system effectifly KSM_NPAGES_MAX=10000 KSM_THRES_COEF=80
  • Can you tell me more about how this will help my problem and how you suggest I implement it? - use ZRAM ( also Vmware use that why you do not use that )
  • This server is for home use and only a few users use it sparingly for different services. Are you saying do not use a separate LOG or CACHE disk for my spinning drives? 6. Do not use LOG or CACHE disk if many many people can not access to your system at same time..
  • Ok so before I had Discard enabled and then disabled it as suggested in this thread. How do I configure it to only run daily instead of after every delete? - only applicable for KVM VM and it'll waste storage space, so I would and also suggest to use it e.g. daily and not on every delete.
  • This is already getting overly complicated so I would rather not deal with this unless you think this would solve my issue - Depends on your problem. Better way is to reduce what is cached and what not (e.g over primarycache zfs attribute). Optimizing ZFS for best cache option is hard work and extremely problem oriented. A database does need different settings than the message logfiles and other options than the general OS.
  • Ok so it sounds like the KSM suggestion will not help my Windows VM - also only applicable to KVM VM due to a special syscall that only KVM/QEMU uses.
  • My system has enough ram and I do not think I have ever needed SWAP to be used yet. Do you still suggest I do this? - Always a good suggestion, zram is extremely useful. I replaced all physical swap disks/partitions by zram in my systems.
 
@ertanerbek
@LnxBil

Ok so there were a lot of suggestions:
  • I disabled Discard on all the drives connected to the VM in question and this resulted in no improvement - 1. Disable discard on GUEST config.
  • Discard create unneccessary IO on your storage.Also not work exacly on Windows GUEST. But ZFS know which guest use how much dirty disk area on real disk pool.
  • I checked "zfs get compression" and all my drives have compression enabled already - 2. You should open compression on ZFS pool other way ZFS can not fight with "0" it is mean you can not use your level 1 cache effectifly.. Also this is better solition instead of DISCARD and use global discard for your ssd on ZFS option.
  • Activate TRIM on your SSD POOL "zpool set autotrim=on POOLNAME"
  • I limited my ARC to 8GB a week ago and my server has 10GB of memory free. I have 2 ssds and 4 magnetic drives using ZFS (total of 30TB), should I lower this number further? - 3. Limit your ARC usage ( you can found on pve db ) ( for 16GB ram 2 GB enough )
  • People general forget Linux use RAM as a Cache and Buffer also.. So RAM never be enought for any DISK write or read issue. Watch your system with dstat you will see when you was start any write,read operation Linux also start use RAM if you not have enought RAM all disk operation will be slow. " dstat -c -m -d -D sda,sdb,sdc,sdd etc.."
  • I set these options in /etc/ksmtuned.conf, do I need to reboot proxmox to make the changes take affect? - 4. use KSM system effectifly KSM_NPAGES_MAX=10000 KSM_THRES_COEF=80
  • Just restart " systemctl restart ksmtuned.service " if you also change LOG side you will see what is heppen on KSM side

    LOGFILE=/var/log/syslog

    DEBUG=1

  • Can you tell me more about how this will help my problem and how you suggest I implement it? - use ZRAM ( also Vmware use that why you do not use that )
  • On linux all system use CACHE and Buffer system and general Buffer and Cache system is your RAM so if you have more more RAM your all disk operation will be speedy you can wahtch with dstat
  • This server is for home use and only a few users use it sparingly for different services. Are you saying do not use a separate LOG or CACHE disk for my spinning drives? 6. Do not use LOG or CACHE disk if many many people can not access to your system at same time..
  • I think you not need also you can watch with "zpool iostat -i POOLNAME 1 " ( withouth pool name zpool show all pool ) you will see if you have free RAM ZFS never use LOG side and CACHE for read it is mean ZFS send same data to CACHE disk and Slow Pool Disk at the same time means, create more more unnecessary IO on your SSD and also after send data cache and normal pool when you want read data ZFS will chek CACHE on wirtualization we using big file so CACHE not for us maybe very good solition for File Server based system..
  • Ok so before I had Discard enabled and then disabled it as suggested in this thread. How do I configure it to only run daily instead of after every delete? - only applicable for KVM VM and it'll waste storage space, so I would and also suggest to use it e.g. daily and not on every delete.
  • If you activate zfs trim, ZFS will make when dirty data will be erase on real POOL. Discard option on GUEST just means "GUEST Operation system will tell to virtual disk I was delete this file from there" actually good feature but ZFS not need more, also Windows operation system discard operation not work exacly.. Brother All Solid CHIP like a lighter and if you open your lighter and forget close that lighter, life time will be down. TRIM for this one; if unnecessary TRIM close that cell. For speed side also Solid CHIP high speed comming from access to many cell at same time ( Like Stripe disk pool ) if you do not make TRIM or make defrag type operation on SSD disk your speed down dramaticly.
    ZFS already have discard system, just activate on zfs "zpool set autotrim=on POOLNAME" after you can watch TRIM operation with "zpool iostat -r POOLNAME 1"
  • Ok so it sounds like the KSM suggestion will not help my Windows VM - also only applicable to KVM VM due to a special syscall that only KVM/QEMU uses.
  • KSM for KVM not for your any GUEST, it is means KSM talk with kernel and kernel was know who use how many page on memory :) Windows or Linux or ESXi guest...
  • My system has enough ram and I do not think I have ever needed SWAP to be used yet. Do you still suggest I do this? - Always a good suggestion, zram is extremely useful. I replaced all physical swap disks/partitions by zram in my systems.
  • After ZRAM not need more but if KSM will crash or all guest want use full memory thn you will need.



For your all speed issue, I was write one FIO test line here, test youse all POOL with this tool and you will see your disk real speed and do not forget open AHCI mode on your computer bios..

fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=test --bs=1M --iodepth=32 --size=5G --readwrite=randrw --rwmixread=50 --numjobs=8 --time_based --runtime=120

 
  • I limited my ARC to 8GB a week ago and my server has 10GB of memory free. I have 2 ssds and 4 magnetic drives using ZFS (total of 30TB), should I lower this number further? - 3. Limit your ARC usage ( you can found on pve db ) ( for 16GB ram 2 GB enough )

The lower the ARC, the less performant is your system. According to general rule that can be read all over the internet for ZFS, you should have 1-2 GB of ARC per 1 TB of data stored in your pool - if you want performance. You can get away with any lower number, but you will have a lot of cache misses. To optimize this, the general approach would be to try out different ARC settings while running standardized tests of your usual workload. Anything else is just asking the oracle.

  • This server is for home use and only a few users use it sparingly for different services. Are you saying do not use a separate LOG or CACHE disk for my spinning drives? 6. Do not use LOG or CACHE disk if many many people can not access to your system at same time.

SLOG will help if you have SYNC writes, if you do not have them, you will not gain any performance improvement.
As stated previously, L2ARC can improve your performance, but most of the time it does not, because you have a very, very limited ARC with only 8 GB and L2ARC will eat away from this.

Best is to try out for yourself. If your SSD is fast and at an enterprise level (I've never heard of Seagate SSD before), this could work. If the SSD is not good, you will not see an improvement at all - you can even worse your results. Yet, try for yourself.

  • Ok so before I had Discard enabled and then disabled it as suggested in this thread. How do I configure it to only run daily instead of after every delete? - only applicable for KVM VM and it'll waste storage space, so I would and also suggest to use it e.g. daily and not on every delete.

If you disabled it at the VM level, you cannot use it inside of the VM. The idea to disable it at a VM level - as you already experienced yourself - does not yield any performance gain and is for the use of ZFS counter productive, because you waste space. ZFS is faster if you have stored less in your pool. Discard takes care of releasing unused space inside of your VM (e.g. deleted data) and freeing it inside of your ZFS dataset (only applies to VMs not containers).

Please refer to the guest OS manual to setup discard as a scheduled maintenance task.


  • Ok so it sounds like the KSM suggestion will not help my Windows VM - also only applicable to KVM VM due to a special syscall that only KVM/QEMU uses.

KSM can only help if you have multiple VMs with the same OS and only if you use KVM VM. Windows can only be virtualized by a KVM VM, so you already have this use case.

  • My system has enough ram and I do not think I have ever needed SWAP to be used yet. Do you still suggest I do this? - Always a good suggestion, zram is extremely useful. I replaced all physical swap disks/partitions by zram in my systems.

Check swap usage with free and zram is optional. It is however better for overall performance than a storage backed swap drive. If you don't swap and you do not intend to do so, just keep your current setting. Without any swap, your system will kill processes if you run out of memory, just be aware and monitor dmesg.
 
fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=test --bs=1M --iodepth=32 --size=5G --readwrite=randrw --rwmixread=50 --numjobs=8 --time_based --runtime=120


Hi,

Any test that you want to do on zfs with a size data < zfs arc size have NO practical value(5 GB < 8 GB ). And the author of this thread do not use: random rw, bs is not 1 M and so on.

His main problem is the insufficient RAM used for zfs. The second problem is the fact that he have 2 zfs pools. So the same RAM is split for 2 different pool. For this reason until the zfs cache is not full, the speed is ok. But when the arc if full then the speed is go down.

Good luck / Bafta
 
On that test 512MB/s incompressible data will be cycle on Cache if your disk can not be read and write 256MB/s data in a second cache will be full in a short time also that test will be continue 120second. Also I was give that test for test LOG disk, because that test will be full ARC very shortly so if LOG is must, after when ARC full system should be continue with LOG device..

If people tell that test create Rand IO anyone can use sync on ioengine..
 
Last edited:
@ertanerbek
@LnxBil
@guletz

Ok so everyone was mentioning ARC cache at 8GB is not enough so I raised it to 20GB. I still have the same problem when doing a 5gigabyte file copy from my SSDs to my spinners. I was hovering at 16GB free ram in the Proxmox user interface for my pve server.

I also disabled discard on my Windows VM and used zpool set autotrim=on POOLNAME for all of my ZFS pools (I have 3 pools).

I ran systemctl restart ksmtuned.service to restart the service after adjusting the values.

The SSDs are Seagate Nytro 1230 960GB Enterprise SATA drives. The spinners are HGST SATA 8TB 7200RPM NAS drives.

Here is the result of zpool iostat storage -n 1 during 5GB file copy from SSD to spinners:

Code:
zpool iostat storage -n 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
storage     1.91T  5.36T     27      4  4.80M  2.04M
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      0      0      0      0
storage     1.91T  5.36T      3      0   228K      0
storage     1.91T  5.36T      2      0  36.0K      0
storage     1.91T  5.36T      3    421   188K   256M
storage     1.91T  5.36T      0    363      0   322M
storage     1.91T  5.36T      0    526      0   388M
storage     1.91T  5.36T      5   1016   200K   310M
storage     1.91T  5.36T     23    296  95.9K   266M
storage     1.91T  5.36T      0    502      0   290M
storage     1.91T  5.36T      0    410      0   409M
storage     1.91T  5.36T      0     81      0  68.4M
storage     1.91T  5.36T      0    424      0   278M
storage     1.91T  5.36T      0    384      0   271M
storage     1.91T  5.36T      0    411      0   281M
storage     1.91T  5.36T      0    753      0   244M
storage     1.91T  5.36T      0    968      0   416M
storage     1.91T  5.36T      0    581      0   372M
storage     1.91T  5.36T      0    628      0   396M
storage     1.91T  5.36T      0    416      0   387M
storage     1.91T  5.36T      0    412      0   396M
storage     1.91T  5.36T      0    442      0   402M
storage     1.91T  5.36T      0    406      0   377M
storage     1.91T  5.36T      0    406      0   391M
storage     1.91T  5.36T      3    374  16.0K   215M
storage     1.91T  5.35T      0    607      0   242M
storage     1.91T  5.35T      0    369      0   364M
storage     1.91T  5.35T      0    357      0   358M
storage     1.91T  5.35T      0    368      0   368M
storage     1.91T  5.35T      0    348      0   346M
storage     1.91T  5.35T      0    194  20.0K   115M
storage     1.91T  5.35T      0    743  4.00K   268M
storage     1.91T  5.35T      3    675  48.0K   378M
storage     1.91T  5.35T      0    424      0   357M
storage     1.91T  5.35T      0    423      0   383M
storage     1.91T  5.35T      0    389      0   375M
storage     1.91T  5.35T      0    321      0   296M
storage     1.91T  5.35T      0    301      0   296M
storage     1.91T  5.35T      0    362      0   361M
storage     1.91T  5.35T      0    403      0   404M
storage     1.91T  5.35T      0    404      0   405M
storage     1.91T  5.35T      0    404      0   405M
storage     1.91T  5.35T      0    404      0   405M
storage     1.91T  5.35T      0    394      0   395M
storage     1.91T  5.35T      0    381      0   382M
storage     1.91T  5.35T      0    397      0   394M
storage     1.91T  5.35T      0    277      0   264M
storage     1.92T  5.35T      5    363   236K  51.1M
storage     1.92T  5.35T      0      0      0      0
storage     1.92T  5.35T      0      1      0  16.0K
storage     1.92T  5.35T      0      0      0      0
storage     1.92T  5.35T      0      0      0      0
 
@ertanerbek
@LnxBil
@guletz

Ok so everyone was mentioning ARC cache at 8GB is not enough so I raised it to 20GB. I still have the same problem when doing a 5gigabyte file copy from my SSDs to my spinners. I was hovering at 16GB free ram in the Proxmox user interface for my pve server.


I never tell that type word, 20Gb ARC so big and do not use L2ARC.. About LAG issue and IO delay can you chek your server with "atop 1" ( when you see atop screen then push shift+c for watch cpu activitiy ) "iotop -P -d 1" then you will see which system was create IO on your CPU. Also I suggest to you do not use zVOL, mount ZPOOL to directory and create qcow2..
 
Block based system every time is best, I know because operation system or aplication can manage block size and disk formatsystem it is very useful feature... Also I was and I will use block based storage system at any professional project with SAN device because on that project block switch operation managing by SAN device CPU and generaly all Random IO store on big Write Back cache.. But homeLAB type system that all block based switchin operation calculation maked by centrall CPU and it means more more IO on CPU. After all if your disk can not handle that requested transaction and can not be create enough IO then CPU will calculate one request 3 time on ARC ( Random ), on Buffer and on disk sequential write or read request.

So try one more time and share result with us please...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!