Very high IO Delay on any load

John C. Reid

New Member
Oct 17, 2017
15
0
1
53
I am getting very high IO Delay on any kind of a load. It basically stalls all VMs. We are a web host and we have five CloudLinux/cPanel VMs (all with about 100 sites each), A Windows 2016/Plesk VM, and a CentOS7/Sentora VM running. When this happpens all sites stop responding, I get kernel:NMI watchdog: BUG: soft lockup messages, daemons die or panic, and we have had a MySQL database become corrupt and need repair. I have been struggling with this for weeks and following the other threads on the forums of the many other users having similar problems. Very early this morning I did an update hoping the upgrade to PVE 5.1-36 with the 4.13.4-1 kernel would help, but it did not. Upon boot of the first VM the IO Delay instantly shot to above 25 and within 30 seconds hovered between 30 and 40 the entire time the system was booting. I have to wait a minimum of 15 minutes between starting each VM and let it settle back down or it kills the entire system with IO Delays hitting between 70 and 90. This is the same as it was prior to the update. I have also had to turn off all backups and system updates because a single VM starting either of those is going to push the IO Delay into the 30+ range and basically stall all VMs. Prior to the update I was on PVE 5.0-32 with kernel 4.10.17-4. I have been having these issues since 2 or 3 updates and at least 1 kernel ago. Like I mentioned, it has been a few weeks.

Here is my current setup:
Dell PowerEdge R730
CPU: Dual Xeon E5-2630 v4 (10 core, 20 Thread)
RAM: 128GB (4x32 RDIMM, 2400MT/s, 2RX4, 8G, DDR4, R)
Storage: PERC H330 RAID Controller
Slots 0 & 1 in RAID 1 mirror - OS Drives - 128 GB SSD
Slot 2 is 128GB SSD setup as the cache drive for ZFS Pool (No longer in use, I removed the ZIL and L2ARC while troubleshooting)
Slots 3-6 are 2TB platter drives in RAIDZ configuration (HBA mode on controller)

The OS is installed on the mirrored SSD drives. The images are on a RAIDZ1 pool. The images are RAW, I would like to change that - more on this later. I will now post some interesting if not relevant lines from my last boot dmesg. It should answer questions about hardware setup and what the system is seeing, and possibly provide some clues.

Code:
[    0.000000] [Firmware Bug]: TSC_DEADLINE disabled due to Errata; please update microcode to version: 0xb000020 (or later)
[    0.000000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    0.124043] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    0.124043] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)
[    0.225138] smp: Brought up 2 nodes, 40 CPUs
[    0.225138] smpboot: Total of 40 processors activated (176003.68 BogoMIPS)
[    2.377991] scsi 0:0:3:0: Direct-Access     ATA      ST2000NX0403     NA05 PQ: 0 ANSI: 6
[    2.441801] scsi 0:0:4:0: Direct-Access     ATA      ST2000NX0403     NA05 PQ: 0 ANSI: 6
[    2.477820] scsi 0:0:5:0: Direct-Access     ATA      ST2000NX0403     NA05 PQ: 0 ANSI: 6
[    2.513780] scsi 0:0:6:0: Direct-Access     ATA      ST2000NX0403     NA05 PQ: 0 ANSI: 6
[    2.548685] scsi 0:2:0:0: Direct-Access     DELL     PERC H330 Mini   4.27 PQ: 0 ANSI: 5
[    2.589046] sd 0:0:2:0: Attached scsi generic sg0 type 0
[    2.589167] sd 0:0:3:0: Attached scsi generic sg1 type 0
[    2.589286] sd 0:0:4:0: Attached scsi generic sg2 type 0
[    2.589422] sd 0:0:5:0: Attached scsi generic sg3 type 0
[    2.589543] sd 0:0:6:0: Attached scsi generic sg4 type 0
[    2.589659] sd 0:2:0:0: Attached scsi generic sg5 type 0
[    2.590117] sd 0:2:0:0: [sdf] 233308160 512-byte logical blocks: (119 GB/111 GiB)
[    2.590240] sd 0:2:0:0: [sdf] Write Protect is off
[    2.590241] sd 0:2:0:0: [sdf] Mode Sense: 1f 00 10 08
[    2.590369] sd 0:2:0:0: [sdf] Write cache: disabled, read cache: disabled, supports DPO and FUA
[    2.592969] sd 0:0:2:0: [sda] 468862128 512-byte logical blocks: (240 GB/224 GiB)
[    2.593170]  sdf: sdf1 sdf2 sdf3
[    2.593708] sd 0:2:0:0: [sdf] Attached SCSI disk
[    2.594348] sd 0:0:5:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    2.594392] sd 0:0:3:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    2.594705] sd 0:0:2:0: [sda] Write Protect is off
[    2.594706] sd 0:0:2:0: [sda] Mode Sense: 9b 00 10 08
[    2.594816] sd 0:0:4:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    2.594824] sd 0:0:6:0: [sde] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    2.595314] sd 0:0:2:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    2.600817]  sda: sda1 sda2
[    2.607987] sd 0:0:2:0: [sda] Attached SCSI disk
[    2.646535] ata4: SATA link down (SStatus 0 SControl 300)
[    2.720004] sd 0:0:4:0: [sdc] Write Protect is off
[    2.720008] sd 0:0:4:0: [sdc] Mode Sense: 9b 00 10 08
[    2.721243] sd 0:0:4:0: [sdc] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    2.721611] sd 0:0:5:0: [sdd] Write Protect is off
[    2.721614] sd 0:0:5:0: [sdd] Mode Sense: 9b 00 10 08
[    2.722888] sd 0:0:5:0: [sdd] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    2.726344] sd 0:0:3:0: [sdb] Write Protect is off
[    2.726347] sd 0:0:3:0: [sdb] Mode Sense: 9b 00 10 08
[    2.727628] sd 0:0:3:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    2.729133] sd 0:0:6:0: [sde] Write Protect is off
[    2.729136] sd 0:0:6:0: [sde] Mode Sense: 9b 00 10 08
[    2.730367] sd 0:0:6:0: [sde] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    2.894610]  sdd: sdd1 sdd9
[    2.895920]  sdc: sdc1 sdc9
[    2.897016]  sdb: sdb1 sdb9
[    2.905710]  sde: sde1 sde9
[    2.958925] ata6: SATA link down (SStatus 0 SControl 300)
[    3.021492] sd 0:0:5:0: [sdd] Attached SCSI disk
[    3.050236] sd 0:0:4:0: [sdc] Attached SCSI disk
[    3.056027] sd 0:0:3:0: [sdb] Attached SCSI disk
[    3.064492] sd 0:0:6:0: [sde] Attached SCSI disk

I have a couple of different goals here, and some of the required steps to meet the goals may be related to each other or be convenient to do at the same time.

Goal 1 and biggest priority -- Fix the performance issue that is killing this production server. Starting to wonder if I made a mistake not going with VMWare.

Goal 2 - Storage problem, after reading how great ZFS is in the documentation and the threads I decided to go that route. I don't fully regret this, but there are some consequences I was not aware of and would like to fix. The only image option is RAW and that means thick provisioning and wasted space. PVE thinks I am completely out of drive space even though "zpool list" shows I have 4.95TB free. I am thinking I might have missed a step and PVE is using the raw RAIDZ pool as a block device. I have been wondering if there was a way to put a file system on it before telling PVE to use the storage. I mean for example, could I have created the RAIDZ and gotten all of the advantages of that for the LVM portion of the picture, but then formatted it EXT4 or something so I could have used a better image format that would have allowed for thin provisioning? How do I get there from here? Can I move my images to an external HD, redo the storage pool, move the images back and reattach them? I assume so, but I have been unable to figure out how to do that.

This server (provided I add the storage space and RAM to keep up) should easily be able to do 5-10 times this number of VMs without breaking a sweat, and it was purchased with the promise of being able to do that. Something is seriously wrong here. below is some more information that might be of interest.

Entire dmesg from last boot: pastebin(dot)com/FScr3LUe -- I can't post external links, because that might be helpful.

Code:
root@the-verse:~# zpool list
NAME           SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
guests-zpool  7.25T  2.30T  4.95T         -    29%    31%  1.00x  ONLINE  -
Code:
root@the-verse:~# zpool status -v guests-zpool
  pool: guests-zpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 32h25m with 0 errors on Mon Oct  9 08:49:09 2017
config:

        NAME                           STATE     READ WRITE CKSUM
        guests-zpool                   ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST2000NX0403_W46069PW  ONLINE       0     0     0
            ata-ST2000NX0403_W4606CZP  ONLINE       0     0     0
            ata-ST2000NX0403_W46069Z8  ONLINE       0     0     0
            ata-ST2000NX0403_W4606DHW  ONLINE       0     0     0

errors: No known data errors
Code:
root@the-verse:~# zfs list
NAME                                         USED  AVAIL  REFER  MOUNTPOINT
guests-zpool                                4.95T   156G   140K  none
guests-zpool/vm-100-disk-1                   792G   695G   253G  -
guests-zpool/vm-101-disk-1                   792G   458G   490G  -
guests-zpool/vm-102-disk-1                   792G   432G   516G  -
guests-zpool/vm-103-disk-1                   823G   772G   197G  -
guests-zpool/vm-104-disk-1                   792G   889G  58.6G  -
guests-zpool/vm-105-disk-1                   542G   611G  79.4G  -
guests-zpool/vm-105-state-Still_Evaluation  8.76G   163G  2.02G  -
guests-zpool/vm-106-disk-1                   318G   401G  59.2G  -
guests-zpool/vm-106-state-Sentora           8.76G   160G  4.14G  -
guests-zpool/vm-107-disk-1                   198G   331G  22.8G  -
Code:
root@the-verse:~# zfs get all guests-zpool
NAME          PROPERTY              VALUE                  SOURCE
guests-zpool  type                  filesystem             -
guests-zpool  creation              Fri Jul 28 17:19 2017  -
guests-zpool  used                  4.95T                  -
guests-zpool  available             156G                   -
guests-zpool  referenced            140K                   -
guests-zpool  compressratio         1.00x                  -
guests-zpool  mounted               no                     -
guests-zpool  quota                 none                   default
guests-zpool  reservation           none                   default
guests-zpool  recordsize            128K                   default
guests-zpool  mountpoint            none                   local
guests-zpool  sharenfs              off                    default
guests-zpool  checksum              on                     default
guests-zpool  compression           off                    default
guests-zpool  atime                 off                    local
guests-zpool  devices               on                     default
guests-zpool  exec                  on                     default
guests-zpool  setuid                on                     default
guests-zpool  readonly              off                    default
guests-zpool  zoned                 off                    default
guests-zpool  snapdir               hidden                 default
guests-zpool  aclinherit            restricted             default
guests-zpool  createtxg             1                      -
guests-zpool  canmount              on                     default
guests-zpool  xattr                 on                     default
guests-zpool  copies                1                      default
guests-zpool  version               5                      -
guests-zpool  utf8only              off                    -
guests-zpool  normalization         none                   -
guests-zpool  casesensitivity       sensitive              -
guests-zpool  vscan                 off                    default
guests-zpool  nbmand                off                    default
guests-zpool  sharesmb              off                    default
guests-zpool  refquota              none                   default
guests-zpool  refreservation        none                   default
guests-zpool  guid                  5417188825227061568    -
guests-zpool  primarycache          all                    default
guests-zpool  secondarycache        all                    default
guests-zpool  usedbysnapshots       0B                     -
guests-zpool  usedbydataset         140K                   -
guests-zpool  usedbychildren        4.95T                  -
guests-zpool  usedbyrefreservation  0B                     -
guests-zpool  logbias               latency                default
guests-zpool  dedup                 off                    local
guests-zpool  mlslabel              none                   default
guests-zpool  sync                  standard               default
guests-zpool  dnodesize             legacy                 default
guests-zpool  refcompressratio      1.00x                  -
guests-zpool  written               140K                   -
guests-zpool  logicalused           1.14T                  -
guests-zpool  logicalreferenced     40K                    -
guests-zpool  volmode               default                default
guests-zpool  filesystem_limit      none                   default
guests-zpool  snapshot_limit        none                   default
guests-zpool  filesystem_count      none                   default
guests-zpool  snapshot_count        none                   default
guests-zpool  snapdev               hidden                 default
guests-zpool  acltype               off                    default
guests-zpool  context               none                   default
guests-zpool  fscontext             none                   default
guests-zpool  defcontext            none                   default
guests-zpool  rootcontext           none                   default
guests-zpool  relatime              off                    default
guests-zpool  redundant_metadata    all                    default
guests-zpool  overlay               off                    default
Code:
I can't post images either!!
Again, don't want to allow a user to provide useful information (SERIOUSLY MODS?!?)

photos(dot)app(dot)goo(dot)gl/zJuUfJc9d9eDNFMr1
photos(dot)app(dot)goo(dot)gl/fQEV9KW9DLPvGmhu1
photos(dot)app(dot)goo(dot)gl/X9cyIkJk0eLmqIUn1
 
Honestly, you have several serious issues with the way the zpool was configured and unfortunately the only way to recover from this is to back up all data, destroy the pool, re-create it and then restore all data. The easiest way to accomplish that would be to use Proxmox's built in backup feature. Mount an external drive (or NFS/SMB share, or whatever), add that as a Directory type storage with 'VZDump backup file' as the content type (under Datacenter -> Storage), then backup each VM. Make sure the external storage is selected when you backup each VM.

Storage: PERC H330 RAID Controller ... (HBA mode on controller)
I did a quick search and it doesn't appear that the H330 can be easily cross flashed to IT mode firmware. Without doing so, this card is probably not operating in a pure HBA mode and still has some active RAID functions which is against best practices with ZFS. This is probably not the source of the performance issue however, as at least one ZFS on Linux user said they used this card for over a year with no ill effects.

The images are on a RAIDZ1 pool.
You really should be using striped mirrors (similar to RAID10) rather than any raidz level for performance reasons. Especially when running VMs, maintaining high IOPS on storage is critical. The IOPS of a raidz array will be equivalent to a single hard drive and that's really bad. Striped mirrors will have performance of n/2 disks, so 10 disks in 5 striped mirrored pairs will have the IOPS performance of 5x disks. I would only use raidz for archival/backup purposes where maximizing storage space is the most important and performance isn't critical. Also with modern large capacity disks a raidz level of less than raidz2 is ill advised.

The only image option is RAW and that means thick provisioning and wasted space.
Proxmox does support thin provisioning with ZFS. Under Datacenter -> Storage -> Add -> ZFS, there is a check box for Thin provisioning. I am currently using this and it does work. You should also enable compression on the pool as the space savings can be significant and the performance penalty on modern processors is negligible. (`zfs set compression=lz4 $POOL_NAME`) Note that turning on compression on a pool will only compress newly written data, it will not compress data that has already been written to the pool.

"zpool list" shows I have 4.95TB free
This can be confusing because the 'zpool list' command is showing all raw disk capacity in the pool, without redundancy. So if you have 3x 2TB disks in a raidz1, 'zpool list' will show the pool size as 6TB, even though only 4TB can be used due to one disk being used for parity information. 'zfs list' will more closely provide the numbers you're expecting.

I have been wondering if there was a way to put a file system on it before telling PVE to use the storage.
There is no way to do this, ZFS is a combined LVM, software RAID and filesystem all in one. The numerous features and benefits of ZFS wouldn't be possible otherwise.

Finally, free space in a zpool should be kept under 80% for performance reasons. Otherwise read, write and resilvering performance will suffer (in my experience, greatly).

I understand your frustration and I'm sorry to hear about the situation you're having. ZFS is complicated and there are a lot of gotchas/best practices/rules of thumb that aren't always clear if you haven't worked with it before. It's very different from standard RAID and a typical filesystem. That being said, once you get the hang of it the benefits far outweigh its quirks.
 
Last edited:
Thank you very much for taking the time to read all of that and providing me with a very usable answer to my questions. Noting my second goal, I already guessed a rebuild was in order.

If I get a 8TB USB 3 external drive for backups, will that work to rebuild?

Any guesses on how long I would be down to backup, rebuild, restore?

If I have (4) 2TB drives in a RAIDZ1 providing roughly 6TB of storage. I am thinking to do the striped mirrors option and keep the same rough storage size I will need two more disks. That is about $1000 (OUCH) because these are 2.5" drives. Will I reclaim enough space after the rebuild using the thin provisioning that I should still be able to expand? Can I continue to add drives to grow later, or must I appropriately size my storage now? How likely is it that I just missed the checkbox for thin provisioning, or is it because my setup somehow only allowed be the RAW image option?

Thank you again for your assistance.
 
John, make sure you test those backed up guests. Since you may be using a USB external drive for the backup storage, reliability comes into question.
 
  • Like
Reactions: gsupp
Starting to wonder if I made a mistake not going with VMWare.
I've deployed both VMWare and Proxmox in production environments. VMWare is very expensive to license, especially if you want to use the live migration features (Storage vMotion). It's a solid platform and there's tons of quality documentation and technicians available to support it. However, it's not flawless. I've experienced PSODs (purple screen of death) related to bugs in the VMWare kernel, even on hardware fully supported by the compatibility list. VMFS is also nothing like ZFS and some of the older versions had weird limitations like volume size (max 2TB), no block level checksumming and all RAID functions had to be done with expensive RAID hardware. RAID rebuilds could take days (with a performance impact) on busy arrays since even empty blocks had to be rebuilt (compared to ZFS resilvering). You also couldn't manage the RAID array with any built in functions so simple things like replacing a failed disk was a nightmare with vendor supplied tools like MegaCli. It was also maddening having to use a Windows VM (or workstation) to run the vSphere client just to manage everything and vSphere Server also had to run on a Windows server (2 more licenses to pay for). A lot of this has improved with the vSphere Web Client and vSphere Server Virtual Appliance...a bit too little, too late in my opinion.

Proxmox is built on open source components (if that matters to you), has ZFS support (a huge win), a great community, helpful developers, full web UI, broader hardware support and runs on top of a standard Debian install so you can actually log into the base OS and perform troubleshooting, tweak settings, repair things, install drivers, etc. You're not locked into some limited esxcli interface. I would highly recommend purchasing a Proxmox subscription ( https://www.proxmox.com/en/proxmox-ve/pricing ) as their pricing is extremely reasonable and as I mentioned the developers are very helpful and knowledgeable.
 
If I get a 8TB USB 3 external drive for backups, will that work to rebuild?
Yes, that should work. I would format it as ext4 for simplicity's sake and mount it as directory storage in Proxmox.

Any guesses on how long I would be down to backup, rebuild, restore?
Sorry, I really have no idea. From the situation you've described though, it's going to take a long time. Shutting down the VM you're backing up rather than trying to back it up while it's running will help speed up the backup a bit. Ideally if you had a 2nd server to restore to (even temporarily), that would allow you to backup 1 VM at a time and restore it to the 2nd server so only 1 VM was down at a time. Once restored to the 2nd server you could delete the VM off the 1st server and possibly speed up the backup process for the remaining VMs (since your pool appears to be almost full and that will impact performance).

If I have (4) 2TB drives in a RAIDZ1 providing roughly 6TB of storage. I am thinking to do the striped mirrors option and keep the same rough storage size I will need two more disks. That is about $1000 (OUCH) because these are 2.5" drives.
Yeah...that's the downside to 2.5" disks. They're way more expensive than their 3.5" counterparts for large capacity disks. On the bright side if you ever decide to move to flash/SSD storage, you won't need 2.5" to 3.5" adapters. One other note (and some may disagree with me on this) but if you can afford it, I would strongly recommend using only SAS drives for production workloads as opposed to SATA. Often the price difference is small and there's a much smaller chance (i.e., 100 times smaller) of encountering a hard error versus consumer grade SATA drives. Furthermore, the SATA data channel has a higher occurrence of silent data corruption than SAS. Also, if using SAS expanders of any kind (which I wouldn't recommend to begin with), you're extremely likely to have issues with SATA drives attached to them.

Will I reclaim enough space after the rebuild using the thin provisioning that I should still be able to expand? Can I continue to add drives to grow later, or must I appropriately size my storage now? How likely is it that I just missed the checkbox for thin provisioning, or is it because my setup somehow only allowed be the RAW image option?
According to your 'zfs list' output, you're using almost 5TB of storage so I think you'll need more disks. The way to tell for sure is to log into each VM and add up the space used from a 'df' output. You want to make sure your 'zpool list' utilization doesn't ever exceed 80% or things can get fragmented and there's no going back from that. The nice thing about striped mirrors is you can always add a pair of drives at any time and attach it to the existing vdev, expanding the available space. It's not as good performance wise as provisioning everything from the beginning since the space utilization won't be balanced across all the mirrored pairs, so keep that in mind. I'm a little confused about the RAW image option, I can't seem to find that anywhere related to the ZFS settings on my Proxmox install, so I'm not exactly sure how that happened.

Thank you again for your assistance.
You're welcome, I'm glad I was able to help.

Oh one more thing, deduplication should almost never be enabled. Deduplication built into ZFS has many additional considerations including resource utilization, performance impact and can't be effectively disabled on a zpool without destroying, recreating it and restoring data from backup.

Here's a few Proxmox-specific links on ZFS:
This is also a very good read: http://nex7.blogspot.com/2013/03/readme1st.html
 
Last edited:
John, make sure you test those backed up guests. Since you may be using a USB external drive for the backup storage, reliability comes into question.
Solid advice. Especially since you'll be destroying the pool and re-creating it. Also as a general rule, if you're not testing your backups regularly by restoring them, you don't have backups.
 
  • Like
Reactions: Feni
Thank you to everyone who has responded. I have a lot to do it looks like.

Can you print #arc_summary
Here is is:
Code:
root@the-verse:~# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Oct 26 06:28:17 2017
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                9.80m
        Mutex Misses:                           2.37k
        Evict Skips:                            2.37k

ARC Size:                               10.50%  6.72    GiB
        Target Size: (Adaptive)         10.56%  6.76    GiB
        Min Size (Hard Limit):          6.25%   4.00    GiB
        Max Size (High Water):          16:1    64.00   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       30.62%  2.07    GiB
        Frequently Used Cache Size:     69.38%  4.69    GiB

ARC Hash Breakdown:
        Elements Max:                           7.12m
        Elements Current:               16.00%  1.14m
        Collisions:                             6.66m
        Chain Max:                              6
        Chains:                                 37.75k

ARC Total accesses:                                     116.38m
        Cache Hit Ratio:                90.95%  105.84m
        Cache Miss Ratio:               9.05%   10.53m
        Actual Hit Ratio:               78.76%  91.66m

        Data Demand Efficiency:         85.64%  36.29m
        Data Prefetch Efficiency:       93.06%  71.99m

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             12.80%  13.55m
          Most Recently Used:           10.29%  10.89m
          Most Frequently Used:         76.31%  80.77m
          Most Recently Used Ghost:     0.45%   479.37k
          Most Frequently Used Ghost:   0.14%   153.38k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  29.36%  31.07m
          Prefetch Data:                63.30%  67.00m
          Demand Metadata:              1.87%   1.98m
          Prefetch Metadata:            5.47%   5.79m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  49.47%  5.21m
          Prefetch Data:                47.43%  5.00m
          Demand Metadata:              1.56%   163.96k
          Prefetch Metadata:            1.55%   162.78k


DMU Prefetch Efficiency:                                        79.94m
        Hit Ratio:                      39.03%  31.20m
        Miss Ratio:                     60.97%  48.74m



ZFS Tunable:
        zvol_volmode                                      1
        l2arc_headroom                                    2
        dbuf_cache_max_shift                              5
        zfs_free_leak_on_eio                              0
        zfs_free_max_blocks                               100000
        zfs_read_chunk_size                               1048576
        metaslab_preload_enabled                          1
        zfs_dedup_prefetch                                0
        zfs_txg_history                                   0
        zfs_scrub_delay                                   4
        zfs_vdev_async_read_max_active                    3
        zfs_read_history                                  0
        zfs_arc_sys_free                                  0
        l2arc_write_max                                   8388608
        zil_slog_bulk                                     786432
        zfs_dbuf_state_index                              0
        zfs_sync_taskq_batch_pct                          75
        metaslab_debug_unload                             0
        zvol_inhibit_dev                                  0
        zfs_abd_scatter_enabled                           1
        zfs_arc_pc_percent                                0
        zfetch_max_streams                                8
        zfs_recover                                       0
        metaslab_fragmentation_factor_enabled             1
        zfs_deadman_checktime_ms                          5000
        zfs_sync_pass_rewrite                             2
        zfs_object_mutex_size                             64
        zfs_arc_min_prefetch_lifespan                     0
        zfs_arc_meta_prune                                10000
        zfs_read_history_hits                             0
        zfetch_max_distance                               8388608
        l2arc_norw                                        0
        zfs_dirty_data_max_percent                        10
        zfs_per_txg_dirty_frees_percent                   30
        zfs_arc_meta_min                                  0
        metaslabs_per_vdev                                200
        zfs_arc_meta_adjust_restarts                      4096
        spa_load_verify_maxinflight                       10000
        spa_load_verify_metadata                          1
        zfs_multihost_history                             0
        zfs_send_corrupt_data                             0
        zfs_delay_min_dirty_percent                       60
        zfs_vdev_sync_read_max_active                     10
        zfs_dbgmsg_enable                                 0
        zfs_metaslab_segment_weight_enabled               1
        zio_requeue_io_start_cut_in_line                  1
        l2arc_headroom_boost                              200
        zfs_zevent_cols                                   80
        zfs_dmu_offset_next_sync                          0
        spa_config_path                                   /etc/zfs/zpool.cache
        zfs_vdev_cache_size                               0
        dbuf_cache_hiwater_pct                            10
        zfs_multihost_interval                            1000
        zfs_multihost_fail_intervals                      5
        zio_dva_throttle_enabled                          1
        zfs_vdev_sync_write_min_active                    10
        zfs_vdev_scrub_max_active                         2
        ignore_hole_birth                                 1
        zvol_major                                        230
        zil_replay_disable                                0
        zfs_dirty_data_max_max_percent                    25
        zfs_expire_snapshot                               300
        zfs_sync_pass_deferred_free                       2
        spa_asize_inflation                               24
        dmu_object_alloc_chunk_shift                      7
        zfs_vdev_mirror_rotating_seek_offset              1048576
        l2arc_feed_secs                                   1
        zfs_autoimport_disable                            1
        zfs_arc_p_aggressive_disable                      1
        zfs_zevent_len_max                                640
        zfs_arc_meta_limit_percent                        75
        l2arc_noprefetch                                  1
        zfs_vdev_raidz_impl                               [fastest] original scalar sse2 ssse3 avx2
        zfs_arc_meta_limit                                0
        zfs_flags                                         0
        zfs_dirty_data_max_max                            4294967296
        zfs_arc_average_blocksize                         8192
        zfs_vdev_cache_bshift                             16
        zfs_vdev_async_read_min_active                    1
        zfs_arc_dnode_reduce_percent                      10
        zfs_free_bpobj_enabled                            1
        zfs_arc_grow_retry                                0
        zfs_vdev_mirror_rotating_inc                      0
        l2arc_feed_again                                  1
        zfs_vdev_mirror_non_rotating_inc                  0
        zfs_arc_lotsfree_percent                          10
        zfs_zevent_console                                0
        zvol_prefetch_bytes                               131072
        zfs_free_min_time_ms                              1000
        zfs_arc_dnode_limit_percent                       10
        zio_taskq_batch_pct                               75
        dbuf_cache_max_bytes                              104857600
        spa_load_verify_data                              1
        zfs_delete_blocks                                 20480
        zfs_vdev_mirror_non_rotating_seek_inc             1
        zfs_multihost_import_intervals                    10
        zfs_dirty_data_max                                4294967296
        zfs_vdev_async_write_max_active                   10
        zfs_dbgmsg_maxsize                                4194304
        zfs_nocacheflush                                  0
        zfetch_array_rd_sz                                1048576
        zfs_arc_meta_strategy                             1
        zfs_dirty_data_sync                               67108864
        zvol_max_discard_blocks                           16384
        zvol_threads                                      32
        zfs_vdev_async_write_active_max_dirty_percent     60
        zfs_arc_p_dampener_disable                        1
        zfs_txg_timeout                                   5
        metaslab_aliquot                                  524288
        zfs_mdcomp_disable                                0
        zfs_vdev_sync_read_min_active                     10
        zfs_arc_dnode_limit                               0
        dbuf_cache_lowater_pct                            10
        zfs_abd_scatter_max_order                         10
        metaslab_debug_load                               0
        zfs_vdev_aggregation_limit                        131072
        metaslab_lba_weighting_enabled                    1
        zfs_vdev_scheduler                                noop
        zfs_vdev_scrub_min_active                         1
        zfs_no_scrub_io                                   0
        zfs_vdev_cache_max                                16384
        zfs_scan_idle                                     50
        zfs_arc_shrink_shift                              0
        spa_slop_shift                                    5
        zfs_vdev_mirror_rotating_seek_inc                 5
        zfs_deadman_synctime_ms                           1000000
        send_holes_without_birth_time                     1
        metaslab_bias_enabled                             1
        zvol_request_sync                                 0
        zfs_admin_snapshot                                1
        zfs_no_scrub_prefetch                             0
        zfs_metaslab_fragmentation_threshold              70
        zfs_max_recordsize                                1048576
        zfs_arc_min                                       4294967296
        zfs_nopwrite_enabled                              1
        zfs_arc_p_min_shift                               0
        zfs_multilist_num_sublists                        0
        zfs_vdev_queue_depth_pct                          1000
        zfs_mg_fragmentation_threshold                    85
        l2arc_write_boost                                 8388608
        zfs_prefetch_disable                              0
        l2arc_feed_min_ms                                 200
        zio_delay_max                                     30000
        zfs_vdev_write_gap_limit                          4096
        zfs_pd_bytes_max                                  52428800
        zfs_scan_min_time_ms                              1000
        zfs_resilver_min_time_ms                          3000
        zfs_delay_scale                                   500000
        zfs_vdev_async_write_active_min_dirty_percent     30
        zfs_vdev_sync_write_max_active                    10
        zfs_mg_noalloc_threshold                          0
        zfs_deadman_enabled                               1
        zfs_resilver_delay                                2
        zfs_metaslab_switch_threshold                     2
        zfs_arc_max                                       68719476736
        zfs_top_maxinflight                               32
        zfetch_min_sec_reap                               2
        zfs_immediate_write_sz                            32768
        zfs_vdev_async_write_min_active                   2
        zfs_sync_pass_dont_compress                       5
        zfs_vdev_read_gap_limit                           32768
        zfs_compressed_arc_enabled                        1
        zfs_vdev_max_active                               1000
 
I just thought of an additional question. When I rebuild this into a striped mirror, which scenario is preferable and why: ZFS RAID S+M or hardware (PERC H330 RAID Controller) RAID10 with the single logical volume as a ZFS Pool?

This is not a battery backed controller, but as was mentioned, it is possible that that the direct attach mode may not be as direct as one might hope. I'm not sure if using it with the RAID native mode would increase IO any over my current non-RAID, no cache setup.

eP6x3CRr8rOB0p2Z2

https://photos.app.goo.gl/eP6x3CRr8rOB0p2Z2
khoicqUCRSlum09A2

https://photos.app.goo.gl/khoicqUCRSlum09A2
 
ZFS RAID S+M or hardware (PERC H330 RAID Controller) RAID10 with the single logical volume as a ZFS Pool?
Ideally you should never use ZFS without direct access to the disks. The reason for this using your use case is that zfs is not aware of any redundancy beneath it and therefore does not benefit for purposes of checksum- it cannot detect fault or self heal. With a single RAID volume underneath you should use LVM. You will lose out on much of what ZFS has to offer, but you're not really gaining it without direct access to disks.
 
Ideally you should never use ZFS without direct access to the disks. The reason for this using your use case is that zfs is not aware of any redundancy beneath it and therefore does not benefit for purposes of checksum- it cannot detect fault or self heal. With a single RAID volume underneath you should use LVM. You will lose out on much of what ZFS has to offer, but you're not really gaining it without direct access to disks.
That is in line with what I was thinking. The fault detection and self healing are primary reason I wish to use it.

So to "build" it properly (because RAID1+0 and RAID0+1 are have vastly different manners in which you can lose drives and still recover) I think what I want to do is create a tank of two drives mirrored, and then continue to add to the tank another two drives mirrored until I have run out of drives to use. Is this correct?
 
Ok. I have asked for an 8TB, external, USB 3 drive and two additional 2TB drives of the same model number currently in the server. Hopefully this is not too much friction for the boss/owner. I will report back when I have progress to report.

Thank you all.
 
From the above post I'm guessing you dont have any provisions for backup. Might be a good time to discuss that more broadly; you CAN use an external usb disk for backup but thats not very reliable/resilient. just sayin.
 
From the above post I'm guessing you dont have any provisions for backup. Might be a good time to discuss that more broadly; you CAN use an external usb disk for backup but thats not very reliable/resilient. just sayin.

Actually, once I get the kinks worked out and I can migrate all the ESXi 5.0 VM to the ProxMox server, I will have an older but still OK Dell storage chassis I can repurpose for backup. It currently is the storage for two vSphere servers. Also, it is not critical the VMs themselves get images backups because the underlying servers have OS level backups going on. I can't rebuild a CloudLinux/cPanel server and restore the system and all the account backups as quickly as I could restore a VM image, but I can do it in a couple hours.

One of the old EXSi servers will become a front end firewall for the ProxMox server (so the firewall can take a good traffic hit and not put that load on the ProxMox box.) As an ISP and hosting provider we tend to get scanned a lot and light ddos attempts happen frequently.

For the other I plan on taking the majority of the RAM out of the one that will be the firewall and just leave it with what it needs, then transfer it to the second box. This second box will be a Varnish Cache Server, so the more RAM the better.
 
One of the old EXSi servers will become a front end firewall for the ProxMox server (so the firewall can take a good traffic hit and not put that load on the ProxMox box.) As an ISP and hosting provider we tend to get scanned a lot and light ddos attempts happen frequently.

I'd use two boxes for that or use three and build a PVE cluster. You always want to have HA with any stuff customers pay you.
Code:
root@the-verse:~# zpool list
NAME          SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
guests-zpool 7.25T 2.30T 4.95T - 29% 31% 1.00x ONLINE

This implies that you need at most 2.3 TB of storage externally, yet in fact less if you compress it and reduce the raid-z duplicates.
 
I'd use two boxes for that or use three and build a PVE cluster. You always want to have HA with any stuff customers pay you.

Available resources and $$ brother. The ProxMox server is brand new hardware, it was not cheap, and I had to push to get it. the ESXi servers it is replacing are 10 years old. Fine for what I will repurpose them for, but not generally what I want to continue to run my VMs on.

Have you priced hosting recently? How much do you think customers are paying for? Certainly not enough for us to setup HA. I don't have the scale of RackSpace or GoDaddy, we are a small mom and pop place. The advantages of using local is:
  • The old adage “you get what you pay for” certainly applies to web hosting. Cut rate web hosting companies often overload their servers or double sell bandwidth. I am using CloudLinux Limits and low occupancy servers, and don't over subscribe the bandwidth.
  • Service - You can call us and speak with the person that has both direct control and direct responsibility for what you need done.
    • Most large web hosting companies only offer support through a ticketing system.
    • They also restrict the issues they will help with to billing and account access. When you need help using your website they will simply refer you to the support forums, leaving you with hours of research or unanswered questions.
  • If you own a local business, hosting your site in the area where your customers live may provide a boost to your rankings on Google.
    • From Google Webmaster Central: In our understanding of web content, Google considers both the IP address and the top-level domain (e.g. .com, .co.uk)… we often use the web server’s IP address as an added hint in our understanding of content.
    • If your marketing to a local market this may also help with targeting your search results.
  • Relationship – Knowing someone locally will help establish a more ‘real’ business relationship.
  • Face-to-Face – You can always meet in person.
  • Hosting local – Since the servers are local, they are also physically closer than the big hosting data centers. Often this means faster than connecting to a site halfway across the country or even in another country.
It is already hard to compete with the big guy and we can't do it on price. You're insane if you think a small, local place could afford 2x the cost and ongoing admin/maint/power expense to do a 2x HA, much less 3 times.
 
The cloud servers with any realistic storage capacity cost a bloody fortune. Get a test VM from Azure and you will see what I mean. We use a CloudberryLabs client to backup our data to Amazon S3, but that is cheap by comparison. Not everyone has millions of dollars to spend and Proxmox works quite well if you do adequate testing of your setup/environment.
 
The cloud servers with any realistic storage capacity cost a bloody fortune. Get a test VM from Azure and you will see what I mean. We use a CloudberryLabs client to backup our data to Amazon S3, but that is cheap by comparison. Not everyone has millions of dollars to spend and Proxmox works quite well if you do adequate testing of your setup/environment.
Yea, we host everything in house. When I say local hosting, that is exactly what I mean. We are probably about the only one in our area that does it. If you are a local business, and you need a website designed and hosted by people you are sit down with in person, and then get it hosted in the local area (which really helps with SEO in many ways,) Then you come to us.

It also makes a significant difference for E-mail. So many local businesses and residential customers in the area use us, that often E-mail never has to leave our server for delivery. I maintain a separate physical mail server due to the volume of mail we deal with.