ZFS sync question

sahostking

Renowned Member
Just setup 2 new servers.

We had HW RAID 10 2TB x 6 Enterprise SATAs.

Those were the old.

The new ones are 6 x 1TB Crucial SSDs in ZFS RAID 10

Now I first got this on the SSD servers running pveperf:

CPU BOGOMIPS: 57599.64
REGEX/SECOND: 2440575
HD SIZE: 2631.09 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 459.30
DNS EXT: 252.50 ms
DNS INT: 1.55 ms (hkdns.co.za)


Then I read somewhere than turn syncs to disabled to get better speed. So did that by running:

zfs set sync=disabled rpool


Now fsyncs jumped from 459 to 28806.26

CPU BOGOMIPS: 57599.64
REGEX/SECOND: 2427467
HD SIZE: 2631.09 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 28806.26
DNS EXT: 319.44 ms
DNS INT: 1.76 ms (hkdns.co.za)


That is one massive difference. Is it safe to have sync off? Note this server is for shared hosting - need optimal performance please. Note these servers are brand new. Just installed Proxmox and then ran that. No vms etc on them yet.

Also I will only be using KVM with Qcow2 so I can do snapshots.

Thanks
 
Last edited:
Yea was thinking that it may not be a good idea as some VMs of clients would be database servers etc.

How about Raiz2 for 5 disks which gives 4.53TB and 1 SSD disk dedicated for ZIL and L2ARC?
 
The new ones are 6 x 1TB Crucial SSDs in ZFS RAID 10
Sorry, did I understand you right, you use 6 pieces of 1TB SSD drives? Yet you get that fsyncs/sec?

Anyway, some consumer grade drives are supposedly very bad for sync writes, according to this link. It might stem from the fact that these drives don't employ mechanisms to defend against unintended blackouts and forced syncing data all the way to the flash cells is expensive (ie. slow). But that's a different story.
 
You want to use qcow2 on ZFS? Please don't.
ZFS can do anything you want (snapshots)... and more (compression, copy-on-write cloning, replication, thin-provisioning, ... ).

Don't use one-disk ZIL either, that it the worst idea possible. If this disk fails, you'll have data loss.

Stick to your install settings:
Normally, ZFS-backed KVM machines do not write synchronously (write-back cache), so you do not need to disable sync. Enable compression on your pool (LZ4), which will - in generell - speed things up. You can disable atime if you do not need it for further improvement. You have to size your ARC, normally half of RAM is used for that.

Your shiny new machine will be lightning fast. I have a similar machine with 6x 960 GB Samsung Enterprise. Unfortunately updated to PVE with ZFS, so I have a hardware raid 10 and LVM below, still figures are good:

Code:
FSYNCS/SECOND:     4606.41 (ZFS, sync=standard)
FSYNCS/SECOND:     12117.63 (ZFS, sync=disabled)
FSYNCS/SECOND:     3414.38 (ext4 ROOT fs)

My machine is a 6 year old Supermicro server with "only" PCI-E 2.0 bottleneck and LSI MR9271-8i.
 
Thanks I did the following.

Setup 6 disks using the proxmox install as RAIDZ2.

Then I created the ZFS pool via storage plugin in ZFS administration console under Storage.

I think create /etc/modprobe.d/zfs.conf as it did not exist and added the following as server has 64GB of ECC memory

options zfs zfs_arc_max=34359738368

and also

sysctl -w vm.swappiness=10

Compression lz4 is enabled.

Now though even if I reboot still get the same FSYNC value with these 6 x SSDs with sync=standard

CPU BOGOMIPS: 57597.72
REGEX/SECOND: 2427935
HD SIZE: 3526.58 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 415.66
DNS EXT: 23.38 ms
DNS INT: 1.63 ms


Not sure what I am missing or if this is just expected. Weird thing is if I run arc_summary I get nothing. How can I confirm L2ARC is working?

Ok so I got arcstat.py to show me some info

When I ran the following it worked for a brief moment at 32GB limit then just somehow changed to 190M

root@vz-jhb-1:~# echo "options zfs zfs_arc_max=34359738368" >> /etc/modprobe.d/zfs.conf
root@vz-jhb-1:~# echo "34359738368" > /sys/module/zfs/parameters/zfs_arc_max



08:41:09 96 19 20 19 21 0 0 0 0 200M 32G
08:41:14 17 1 5 1 5 0 0 0 17 201M 32G
08:41:19 6 0 8 0 8 0 0 0 10 201M 32G
08:41:24 96 20 20 20 20 0 0 20 27 201M 32G
08:41:29 68 3 4 2 3 0 44 3 15 189M 190M
08:41:34 223 23 10 23 10 0 0 23 25 188M 190M
 
Last edited:
ARC is freed if it is of no further use - 32 GB is also the default (as described earlier) for 64 GB-RAM.

The problem with your "slow" disks is what @kobuki already described. Your SSDs are not ready for enterprise use, therefore sync write will be very slow. Have you checked and performed the test sebastien han described on his blog post?

Again, who cares if you're not going to do a lot of synced writes?
 
@sahostking, for future reference, can you tell what SSD models you're using? BTW, if you invested in that many of them, even if it may sound a little weird, you should invest in another one that has fast sync write rates and use it as log device. But for mostly-read workloads, I can imagine your setup fly as is.
 
Last edited:
=== START OF INFORMATION SECTION ===
Device Model: Crucial_CT1024MX200SSD1
Serial Number: 16031184DD45
LU WWN Device Id: 5 00a075 11184dd45
Firmware Version: MU03
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Apr 16 15:57:44 2016 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
Thanks. Well, yes, if that table I linked earlier holds true information, this drive is pretty slow for sync write ops.
 
Thanks.
I just checked and noticed this now - new server and already disk failed? Purchased these new yesterday and already 1 failed :(


I checked zpool status

root@vz-jhb-1:~# zpool status
pool: rpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: none requested
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
4855097323164774743 UNAVAIL 0 0 0 was /dev/sda2
sdb2 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
sdd2 ONLINE 0 0 0
sde2 ONLINE 0 0 0
sdf2 ONLINE 0 0 0

errors: No known data errors



Is that really a faulty disk? Because if I run smartctl I get this:


root@vz-jhb-1:~# smartctl -a /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.8-1-pve] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: Crucial_CT1024MX200SSD1
Serial Number: 16031184DD45
LU WWN Device Id: 5 00a075 11184dd45
Firmware Version: MU03
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Apr 16 15:57:44 2016 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 263) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 8) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x0035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
 
Sad to hear that the drives are already failing, but MX200 are on the lower end of consumer devices Good thing is that you'll have warranty, right?
 
I got it working. Just had to replace the sda2 with the diskid. All worked after that. Probbably some label issue.

added it to monitoring.

Speeds are far better than our HW RAID 10 Enterprise SATA disks though. So even though lower grade we using RAID Z2 which is RAID 6. And 2 disks will need to fail before we have issues. Also doing regularly backups :) So guess should be fine.
 
Does it not use that by default when backing up?

I noticed when I did a backup it stats Snapshot in Proxmox backup log

INFO: Starting Backup of VM 106 (qemu)
INFO: status = running
INFO: update VM 106: -lock backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
 
I also notice one a proxmox staff member stated this and quote:

https://forum.proxmox.com/threads/backup-using-zfs-snapshots.17228/

"just to note, a snapshot is not a backup archive"

Hence the normal snapshots which go to NFS remote server is best? as its full backup which can be restored anywhere?
ZFS snapshots as he states is not a backup archive if I am understanding correctly where backup archives are better? Think still using vzdump for KVM servers.
 
you are confusing two very separate issues:
  • backups in proxmox store the current state (data, configuration and optionally for KVM RAM) in an archive file, for restoring in case of a failure, etc
  • snapshots store the current state internally within a storage solution, to allow rollbacks to this state / point in time
a single snapshot alone is not a valid backup, because snapshots (in order to allow fast creation and deletion and save space) usually only contain a delta of the data (either to the previous or next snapshot, depending on technology). the snapshot backup mode only uses snapshot mechanisms of the underlying storage to achieve backups with very short downtime (it will create a temporary snapshot, and use this as source for the backup, so that the guest can continue to run while data is backed up, the temporary snapshot is deleted afterwards).

neither the "snapshot mode backup" nor the snapshot feature in proxmox uses zfs-send/receive (but they use zfs snapshots if the underlying storage is ZFS). there is a tool that integrates zfs-send/receive with proxmox called pve-zsync, but it is separate from the backup and snapshot feature.
 
  • Like
Reactions: maxprox

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!