KVM on top of DRBD and out of sync: long term investigation results

giner · Apr 9, 2014

> A) if that you say is correct, why when i do the things in this order the VM start more faster?
I'm not sure if it is tunable.

> B) Why I don't get to have OOS in DRDB when the verification of replicated volumes to finished? (this test is run a time for week automatically)
Depends on OS type (Windows? Linux?), OS configuration (Write cache without barriers? Swap partition?), and OS usage (does it use swap or it is free most of the time?).

cesarpk · Apr 9, 2014

An interesting article for think... "description of cache modes":
https://www.suse.com/documentation/sles11/book_kvm/data/sect1_1_chapter_book_kvm.html

And this:
http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=/liaat/liaatbpkvmguestcache.htm

e100 · Apr 9, 2014

I am not an IO expert nor do I know everything there is to know about IO, but this is what I believe is happening and why it makes a difference.

With cache=none this can happen:
1. Guest issues write
2. DRBD sends write to remote host
3. Guest modifies the write (it has not issued sync yet)
4. DRBD writes the modified block locally
Now you are out of sync

With cache=directsync it works like this:
1. Guest issues write
2. DRBD sends write to remote host and local disks
3. Guest modifies the write (has not issued sync yet)
4. DRBD sends write to remote host and local disks

O_DSYNC is the difference here
With O_DSYNC each write is flushed and written the moment it happens thus not allowing any window of opprotunity for the guest to modify the write.

Only directsync and writethrough offer O_DSYNC.

cesarpk · Apr 9, 2014

Hi e100

In this link of PVE "Performance Tweaks":
https://pve.proxmox.com/wiki/Performance_Tweaks

I see that literally say:

some interestings articles :
barriers : http://monolight.cc/2011/06/barriers-caches-filesystems/
cache mode and fsync : http://www.ilsistemista.net/index.p...s-on-red-hat-enterprise-linux-62.html?start=2

And the Website of "cache mode and fsync", i see a graph explanatory, and below literally say:

if the cache policy is set to “writeback”, data will be cached in the host-side pagecache (red arrow);
if the cache policy is set to “none”, data will be immediately flushed to the physical disk/controller cache (gray arrow);
if the cache policy is set to “writethrough”, data will be immediately flushed to the physical disk platters (blue arrow).

But you said that with "none", i will have write cache in the VM, right?

If your answer is positive, i understand that the information of this Website is wrong and contradicts the behavior, right?

e100 · Apr 9, 2014

The SUSE page you linked too really shows the differences:
none = O_DIRECT
writethrough = O_DSYNC
directsync = O_DIRECT + O_DSYNC

With O_DIRECT data is copied to the IO device directly from the user-space buffer bypassing the cache. It does not guarantee that the operation is synchronous.
With O_DSYNC the data is written synchronously.
Combine the two and you bypass the buffer cache and perform synchronous IO. ( The most safest form of IO possible AFAIK )

cesarpk · Apr 9, 2014

Hi e100, and thanks for your answer

But in "none" mode, is the write cache in RAM of PVE Host or in the cache of the RAID Controller/Disk?

Please see "none" mode in this link:
http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=/liaat/liaatbpkvmguestcache.htm

cesarpk · Apr 9, 2014

e100 said:
The SUSE page you linked too really shows the differences:
none = O_DIRECT
writethrough = O_DSYNC
directsync = O_DIRECT + O_DSYNC

With O_DIRECT data is copied to the IO device directly from the user-space buffer bypassing the cache. It does not guarantee that the operation is synchronous.
With O_DSYNC the data is written synchronously.
Combine the two and you bypass the buffer cache and perform synchronous IO. ( The most safest form of IO possible AFAIK )

Hi e100, but now i am very confused about of directsync:

Why do you think that is the most safest form of I/O possible?.
If with O_DIRECT the VM need to do fsync, why directsync use both technologies?, and what is the advantage?

e100 · Apr 10, 2014

Directsync gets the data to the disk synchronously and since it bypasses the host buffer cache should get the data to the disk faster than writethrough.

Less time for data to get lost and the guest not getting confirmation of the write until the data is on permanent storage seems like the safest method possible.

Guest OS can(does) implement its own read cache and benefit by using writethrough IO itself. It makes no sense and provides no benefit to have a read cache of the same block data within the host AND the guest. Caching the data twice just wastes RAM, RAM bandwidth and CPU cycles. If read cache is needed add more RAM to the guest so it can utilize it for cache.
That is why I am switching to directsync.

Writethrough has never shown any benefit whenever I have tested it and is often slower reading and writing than cache=none or directsync. But I have only tested on high performance IO systems it might be beneficial on slower IO subsystems but I doubt it. Allowing the guest to cache block data using the RAM allocated to it is likely to provide the best IO performance in most situations.

cesarpk · Apr 10, 2014

e100 said:
Guest OS can(does) implement its own read cache and benefit by using writethrough IO itself. It makes no sense and provides no benefit to have a read cache of the same block data within the host AND the guest. Caching the data twice just wastes RAM, RAM bandwidth and CPU cycles. If read cache is needed add more RAM to the guest so it can utilize it for cache.
That is why I am switching to directsync.

Thanks e100 for your answer, but in the PVE Wiki ( https://pve.proxmox.com/wiki/Performance_Tweaks ) says:

cache=writethrough
- host do read cache
- guest disk cache mode is writethrough
- Writethrough make a fsync for each write. So it's the more secure cache mode, you can't loose data. It's also the slower.

And don't tell us about of the read cache.

In this link of Suse ( https://www.suse.com/documentation/sles11/book_kvm/data/sect1_1_chapter_book_kvm.html ), if you read the documentation of Writethrough, says about of the write cache but don't tell us nothing about of the read cache. Then, why do you say that with writethrough we have two read cache? (i understand that the VM have his own read cache)

Many thanks for share your knowledge and experiences with us (to me, are worth gold)

Best regards
Cesar

e100 · Apr 12, 2014

The whole point of writethrough is to cache the writes in ram so if you read the blocks you just wrote the data can come from ram instead of having to read from the disk.

If the guest and host are both caching data it will get cached twice.

cesarpk · Apr 12, 2014

Thanks e100 for your answer

But spirit has another recommendation, please see this link:
http://forum.proxmox.com/threads/16629-Best-Config-for-VMs-Raid-10-High-IO-wait?p=85711#post85711

Then, if you want to give your opinion, i like to hear you about of this recommendation

Best regards
Cesar

e100 · Apr 14, 2014

Spirits suggestion of cache=none is not helpful here, giner reports that when he used cache=none he got inconsistencies on DRBD.
Only cache=writethrough or cache=directsync did not cause inconcistencies.

He also suggests for read performance to add more RAM to the VM, same suggestion I have made here.
Let the guest do read cache, not the host. That rules out using writethrough leaving only one solution, directsync.

mir · Apr 14, 2014

My observation are also that cache=none or cache=directsync provides best performance if your storage layer has a decent cache. Eg. dont use cache settings which involves host cache. If you use host cache you will not use your controllers cache.

cesarpk · Apr 14, 2014

e100 said:
Spirits suggestion of cache=none is not helpful here, giner reports that when he used cache=none he got inconsistencies on DRBD.
Only cache=writethrough or cache=directsync did not cause inconcistencies.

He also suggests for read performance to add more RAM to the VM, same suggestion I have made here.
Let the guest do read cache, not the host. That rules out using writethrough leaving only one solution, directsync.

Many thanks e100, directsync will be my best option if i have a RAID controller with WB and BBU enabled.

But if i have a single SATA HDD for use it with DRBD and without a RAID controller in the middle, and i want that the Host/VM don't use the RAM as write cache, and that deem that the data are written in the disk when really are on the buffer of DRBD, how i should configure DRBD and the cache of the VM?

Best regards
Cesar

giner · Apr 14, 2014

cesarpk said:
Many thanks e100, directsync will be my best option if i have a RAID controller with WB and BBU enabled.

But if i have a single SATA HDD for use it with DRBD and without a RAID controller in the middle, and i want that the VM don't use the RAM as write cache, and that deem that the data are written in the disk when really are on the buffer of DRBD, how i should configure DRBD and the cache of the VM?

Best regards
Cesar

I suppose the best option here is writeback with barriers. So, if my VM was Linux I would create one virtual HDD with ext4 (barriers are enabled by default) and another HDD with swap partition. The first HDD can be attached with cache=writeback and the second one must be attached with directsync/writethrough. At the same time we have to be sure that all layers between VM and physical drive support barriers: VM, LVM (if used), DRBD, physical drive, something else (if used).

cesarpk · Apr 14, 2014

giner said:
I suppose the best option here is writeback with barriers. So, if my VM was Linux I would create one virtual HDD with ext4 (barriers are enabled by default) and another HDD with swap partition. The first HDD can be attached with cache=writeback and the second one must be attached with directsync/writethrough. At the same time we have to be sure that all layers between VM and physical drive support barriers: VM, LVM (if used), DRBD, physical drive, something else (if used).

Many thanks giner for your answer, but i have some doubts, if you see this link:
http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=/liaat/liaatbpkvmguestcache.htm

You will see that says:

none With caching mode set to none, the host page cache is disabled, but the disk write cache is enabled for the guest. In this mode, the write performance in the guest is optimal because write operations bypass the host page cache and go directly to the disk write cache. If the disk write cache is battery-backed, or if the applications or storage stack in the guest transfer data properly (either through fsync operations or file system barriers), then data integrity can be ensured. However, because the host page cache is disabled, the read performance in the guest would not be as good as in the modes where the host page cache is enabled, such as writethrough mode.

Then, i understand that:
disk write cache = buffer of the RAID controller or buffer of the HDD
host page cache = buffer on Host RAM

On the other hand, i understand that:
1- We can get the conclusion that the writes go to the buffer of the HDD, and as this buffer is a lower layer to DRBD, obviously the data always will be replicated
2- And this web link tell us literally about of cache=none : "If the disk write cache is battery-backed, or if the applications or storage stack in the guest transfer data properly (either through fsync operations or file system barriers), then data integrity can be ensured.", I understand that the data will be guaranteed in the same Host thanks to the BBU.
3- For this cause, i never get a "OOS" of DRBD in my verification of DRBD volumes that is executed automatically once a week.

If i missed of anything, please give your comments about of my errors.

Best regards
Cesar

giner · Apr 14, 2014

Cesar,

Sorry, I didn't get your point. What is the question?

Best regards,
Stanislav

cesarpk · Apr 14, 2014

giner said:
Cesar,
Sorry, I didn't get your point. What is the question?

Excuse me please, i believe that "cache=none" is a good configuration for the DRBD volumes, if i am wrong, please let me to know it.

Best regards
cesar

Re edited: Always based on the report of IBM in the link that i put in the previous post

giner · Apr 14, 2014

1. You asked about configuration when we have a single HDD without RAID, so I suggested.
2. Any modes except directsync and writethrough shouldn't be used if we can't be sure that upper layer wouldn't try to submit new date before old data is committed. So ext4 with barriers enabled will work with any of caching modes without any problems but swap partition won't.

e100 · Apr 15, 2014

giner said:
So ext4 with barriers enabled will work with any of caching modes without any problems but swap partition won't.

Having swap inconsistent would be bad for live migrations.

Cesar, there are two issues here:
1. Data safety/consistency
2. Performance

To ensure safety/consistency, when using DRBD, the only options you can use are directsync or writethrough.
I believe that performance will be best with directsync no matter what the storage is. ( single Dusk, RAID array or whatever )
However I have only performed benchmarks on RAID arrays.

If someone wants to benefIt from read cache, in my experience, the best performance will be if the guest does the cache not the host.

KVM on top of DRBD and out of sync: long term investigation results

Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Famous Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Renowned Member

We value your privacy