Very high IO Delay on any load

Goal 1 and biggest priority -- Fix the performance issue that is killing this production server. Starting to wonder if I made a mistake not going with VMWare.

Cold ZFS can_be/is slow. It have to load data and metadata to ARC.

Goal 2 - Storage problem, after reading how great ZFS is in the documentation and the threads I decided to go that route. I don't fully regret this, but there are some consequences I was not aware of and would like to fix. The only image option is RAW and that means thick provisioning and wasted space. PVE thinks I am completely out of drive space even though "zpool list" shows I have 4.95TB free. I am thinking I might have missed a step and PVE is using the raw RAIDZ pool as a block device. I have been wondering if there was a way to put a file system on it before telling PVE to use the storage. I mean for example, could I have created the RAIDZ and gotten all of the advantages of that for the LVM portion of the picture, but then formatted it EXT4 or something so I could have used a better image format that would have allowed for thin provisioning? How do I get there from here? Can I move my images to an external HD, redo the storage pool, move the images back and reattach them? I assume so, but I have been unable to figure out how to do that.

Add ZFS as directory and you can use qcow.

Code:
root@the-verse:~# zpool status -v guests-zpool
  pool: guests-zpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 32h25m with 0 errors on Mon Oct  9 08:49:09 2017
config:

        NAME                           STATE     READ WRITE CKSUM
        guests-zpool                   ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST2000NX0403_W46069PW  ONLINE       0     0     0
            ata-ST2000NX0403_W4606CZP  ONLINE       0     0     0
            ata-ST2000NX0403_W46069Z8  ONLINE       0     0     0
            ata-ST2000NX0403_W4606DHW  ONLINE       0     0     0

errors: No known data errors

Its not mine but look at this ZFS calculator https://forum.proxmox.com/threads/slow-io-and-high-io-waits.37422/#post-184974

Code:
root@the-verse:~# zfs list
NAME                                         USED  AVAIL  REFER  MOUNTPOINT
guests-zpool                                4.95T   156G   140K  none
guests-zpool/vm-100-disk-1                   792G   695G   253G  -
guests-zpool/vm-101-disk-1                   792G   458G   490G  -
guests-zpool/vm-102-disk-1                   792G   432G   516G  -
guests-zpool/vm-103-disk-1                   823G   772G   197G  -
guests-zpool/vm-104-disk-1                   792G   889G  58.6G  -
guests-zpool/vm-105-disk-1                   542G   611G  79.4G  -
guests-zpool/vm-105-state-Still_Evaluation  8.76G   163G  2.02G  -
guests-zpool/vm-106-disk-1                   318G   401G  59.2G  -
guests-zpool/vm-106-state-Sentora           8.76G   160G  4.14G  -
guests-zpool/vm-107-disk-1                   198G   331G  22.8G  -

Do you use snapshots?


Suggestions for performance
1. compression = lz4
2. volblocksize = 128k
3. Count your RAM usage. Limit ARC min/max and don`t let others programs to shrink it.

Suggestions for storage usage.
1. ashift 12 will use more space then ashift 9
2. Look at ZFS calculator
3. Persistent L2ARC can be good improvement https://github.com/zfsonlinux/zfs/pull/2672
4. volblocksize 8k use more space then volblocksize 128k.
5. On linux VM discard/fstrim can free space on ZFS. On Windows it is done already.
Just use scsi and discard=on in VM config.

Try to defragment your VM storage to get performance and less space usage.

p.s. Was a time then it took ~2 hours to start all VM after server reboot. After some tuning I need only ~1 hour on the same server. BTW I start VM manually.
 
In another server ( not impressing spec. ) with 3x240G SSD ( KINGSTON SHFS37A240G ) VM read i/o runs with no delay.
As for ZFS sync writes I have disabled it. I don`t need double write to SSD.
 
Available resources and $$ brother. The ProxMox server is brand new hardware, it was not cheap, and I had to push to get it. the ESXi servers it is replacing are 10 years old. Fine for what I will repurpose them for, but not generally what I want to continue to run my VMs on.

Have you priced hosting recently? How much do you think customers are paying for? Certainly not enough for us to setup HA. I don't have the scale of RackSpace or GoDaddy, we are a small mom and pop place. The advantages of using local is:
  • The old adage “you get what you pay for” certainly applies to web hosting. Cut rate web hosting companies often overload their servers or double sell bandwidth. I am using CloudLinux Limits and low occupancy servers, and don't over subscribe the bandwidth.
  • Service - You can call us and speak with the person that has both direct control and direct responsibility for what you need done.
    • Most large web hosting companies only offer support through a ticketing system.
    • They also restrict the issues they will help with to billing and account access. When you need help using your website they will simply refer you to the support forums, leaving you with hours of research or unanswered questions.
  • If you own a local business, hosting your site in the area where your customers live may provide a boost to your rankings on Google.
    • From Google Webmaster Central: In our understanding of web content, Google considers both the IP address and the top-level domain (e.g. .com, .co.uk)… we often use the web server’s IP address as an added hint in our understanding of content.
    • If your marketing to a local market this may also help with targeting your search results.
  • Relationship – Knowing someone locally will help establish a more ‘real’ business relationship.
  • Face-to-Face – You can always meet in person.
  • Hosting local – Since the servers are local, they are also physically closer than the big hosting data centers. Often this means faster than connecting to a site halfway across the country or even in another country.
It is already hard to compete with the big guy and we can't do it on price. You're insane if you think a small, local place could afford 2x the cost and ongoing admin/maint/power expense to do a 2x HA, much less 3 times.

Yes, you're not wrong, depending on the customers and stuff.

In my line of work, the breakdown of a service for more than e.g. 30 minutes will result in non-productive time worth at least a medium sized car, so it is normal that everything is HA.

Did you also read my other remark concerning the space?
 
I got the two additional hard drives in. I have scheduled a maintenance window this weekend to convert the existing RAID-Z to a Stripe/Mirror setup. With the additional two drives the final raw capacity should be on par with my current capacity. Once this is done I will report back and let everyone know how it went. For now, I am getting ready to print this entire thread so I can crawl through it at least one more time and compile everything I need to remember when I actually do this tomorrow night.
 
I need a better option to do these backups. I don't know if I am just hitting the limitation on how fast I can write t o the external USB drive, or if there is something deeper broken about my ProMox I/O. I gave myself a 12 hour windows of downtime to do this and there was no way I was going to make it. Here is how it went.

I shutdown all VMs to do offline backups before rebuilding. I currently have 7 VMs and this took about an hour to do. Then I started the backup on my first VM. It was done in offline mode with lzo commpression. There was nothing else running on the server, so full resources should have been available to do this backup exclusively. It took 1 hour 35 minutes and the final file was 138GB. The next one was also offline, lzo but the final filesize was 285GB and it took 2 hours 39 minutes to complete. The next one I tried no compression to see if I could speed this up. The backup filesize ended up being 339GB and it took 3 hours to complete. At this point I had to give up, so I started the staged powering on of the VMs. After starting each one it took 30-45 minutes before the IO Delay has settled down enough for me to start the next one. Starting one at a time like this my peak IO Delay hit 45.

I need suggestions as to how I am going to be able to do this is a timely manner. One idea that I had was to give myself a longer window by doing online backups and not taken the servers down until I was ready to rebuild. The Issues I have here are that the backup will get stale the older it gets. If the servers are running transactions that may happen (CMS or ECommerce sites) may have changes after the backup. The other issue is that the IP Wait jumps to the point where the servers are hanging and very slow if I do a backup while they are all running.

Any ideas or help on how to proceed? Thank you.
 
I need a better option to do these backups. I don't know if I am just hitting the limitation on how fast I can write t o the external USB drive, or if there is something deeper broken about my ProMox I/O. I gave myself a 12 hour windows of downtime to do this and there was no way I was going to make it. Here is how it went.

I shutdown all VMs to do offline backups before rebuilding. I currently have 7 VMs and this took about an hour to do. Then I started the backup on my first VM. It was done in offline mode with lzo commpression. There was nothing else running on the server, so full resources should have been available to do this backup exclusively. It took 1 hour 35 minutes and the final file was 138GB. The next one was also offline, lzo but the final filesize was 285GB and it took 2 hours 39 minutes to complete. The next one I tried no compression to see if I could speed this up. The backup filesize ended up being 339GB and it took 3 hours to complete.

assuming no performance penalty for compression:
138GB/95min~ 24.2M/sec
285GB/169min~ 28.10M/sec
339/180min~ 31.39M/sec

Speeds look consistent with a USB 2.0 connection. Cant make a turtle into a hare...
 
assuming no performance penalty for compression:
138GB/95min~ 24.2M/sec
285GB/169min~ 28.10M/sec
339/180min~ 31.39M/sec

Speeds look consistent with a USB 2.0 connection. Cant make a turtle into a hare...
It is a USB 3 drive and port.

EDIT: So after my comment I went and checked. Like I thought, both the Drive and the port are USB 3. HOWEVER: the output of lsusb -vv shows me that the bcdUSB line states 2.10. Obviously the OS is not recognizing and connecting it as USB 3. Hmm....

Code:
Bus 002 Device 003: ID 1058:25ee Western Digital Technologies, Inc.
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.10
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  idVendor           0x1058 Western Digital Technologies, Inc.
  idProduct          0x25ee
  bcdDevice           40.04
  iManufacturer           2 Western Digital
  iProduct                3 My Book 25EE
  iSerial                 1 375347524B423343
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           32
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0
    bmAttributes         0xc0
      Self Powered
    MaxPower               26mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass         8 Mass Storage
      bInterfaceSubClass      6 SCSI
      bInterfaceProtocol     80 Bulk-Only
      iInterface              0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
Binary Object Store Descriptor:
  bLength                 5
  bDescriptorType        15
  wTotalLength           22
  bNumDeviceCaps          2
  USB 2.0 Extension Device Capability:
    bLength                 7
    bDescriptorType        16
    bDevCapabilityType      2
    bmAttributes   0x0000f41e
      Link Power Management (LPM) Supported
  SuperSpeed USB Device Capability:
    bLength                10
    bDescriptorType        16
    bDevCapabilityType      3
    bmAttributes         0x00
    wSpeedsSupported   0x000e
      Device can operate at Full Speed (12Mbps)
      Device can operate at High Speed (480Mbps)
      Device can operate at SuperSpeed (5Gbps)
    bFunctionalitySupport   1
      Lowest fully-functional device speed is Full Speed (12Mbps)
    bU1DevExitLat          10 micro seconds
    bU2DevExitLat        2047 micro seconds
Device Status:     0x0001
  Self Powered

So how do I get this thing to operate at the SuperSpeed so I can get closer to 300MB/sec?
This is what I have in fstab:
/dev/sdg1 /media/mybook auto noauto,x-systemd.automount 0 2

Also, I have seen it asked many times, but never answered satisfactorily. Why does offline backups fire up the VM? I have seen the answers about it using a qemu environment and how it is safe to copy the data online because of how it is done. That is just saying what they do, not why. So I guess restated, the better question is: if the system is powered off, should the backup not consist of simply copying the HD image and some config to a new location? You don't need to boot the VM to copy the HD image some where.
 
Last edited:
Doing more research, both the of rear USB ports on my Dell PowerEdge R730 are definitely USB 3. However, I do have a USB to PS/2 Keyboard and Mouse adapter plugged in back there going to one of the ports for my KVM. There isn't any chance that could cause the internal USB hub to go all 2.1 only on me is there? The only USB 2 ports on this server are the two in the front. I can move the adapter up there, but it wouldn't be ideal.

Another Update: I am going to bypass USB all together. I just ordered this (https://www.amazon.com/SMAKN-22-pin...coding=UTF8&psc=1&refRID=HCGT9ZSQW6D34RYWYAAP) and I am going to pull the hard drive out of the USB enclosure for my backup and restore. At least this is a temporary solution that lets me move forward. I researched the external USB drive before I purchased it and it has inside Western Digital drives that have serial number which are under warranty by WD independent of the enclosure. Also, they are standard SATA drives and don't have the funky USB direct drive electronics WD sometimes uses.
 
Last edited:
Your current setup is ZFS based, right? Just do a poweroff of all machines, do a snapshot of everything. power them up again and transfer the data to an external drive with zfs (send/receive). At the time of your actual rebuild of you disk, just power them off, do a snapshot again, transfer the difference and the configs to the external drive, reinstall and copy the files and the configs back and you should be golden.
 
Your current setup is ZFS based, right? Just do a poweroff of all machines, do a snapshot of everything. power them up again and transfer the data to an external drive with zfs (send/receive). At the time of your actual rebuild of you disk, just power them off, do a snapshot again, transfer the difference and the configs to the external drive, reinstall and copy the files and the configs back and you should be golden.
The steps are here, but what you are saying is new an foreign to me. Since I am rebuilding I am worried about making sure my backups are 100% guaranteed of being restorable. Could you please expound on this?