KVM on top of DRBD and out of sync: long term investigation results

giner

Member
Oct 14, 2009
239
0
16
40
Tokyo
Hello,

Long time ago after when I found a guide how to configure Proxmox two node cluster + DRBD I was really happy. We only needed to nodes for online migration and quick recovery after hardware failure. I saw that as a solution that could be widely used. And it worked for a while.

When I noticed that online migration sometimes fails for no visible reason. While reading DRBD documentation I found a recommendation to check DRBD synchronization consistency at least ones a month and I started to do. Surprisingly I found that there were new out of sync sectors every week.

I went deeper and found that:
- most of the time out of sync happen on a swap space of Linux VMs (not critical for primary/secondary mode but critlical for primary/primary as can cause memory corruption)
- sometimes (quite rarely) out of sync happen for Windows VMs
- out of sync never happen for ext4 volumes of VMs

At the beginning I was thinking it was hardware issue and we tried to disable any kind of offload, disable rr-bonding and we even asked Dell to replace hard drives (we assumed there was a firware issue). However nothing helped.

Finally we found (thanks for Lars Ellenberg) that KVM can change buffers while data in-flight if cache mode with O_DIRECT for a particular virtual hard drive is used. Switching cache mode solves (or works around) the issue.

So far I have the following recommendations for KVM on top of DRBD:
- WRONG: use writethrough or directsync for all drives of all VMs on DRBD (means no write cache)
- CORRECT: use writethrough or writeback for all drives of all VMs on DRBD (means no O_DIRECT)
- use hardware RAID with write cache and BBU (this is extremely necessary as we disabled write cache for VMs)
- you can enable modes other than writethrough or writeback for some virtual drives that have reliable barrier support, for example, if a particular drive has only ext4 partition with barrier enabled and no swap

Any more ideas and suggestions are welcome.

More information here: http://www.gossamer-threads.com/lists/drbd/users/25227


Best regards,
Stanislav German-Evtushenko
 
Last edited:
LVM does write cache by default in PVE and also in many distros (in my case with PVE, LVM is on top of DRBD). For this is that I have it disabled, and for more of one year (from that I have it installed, and with the latest version), it never was "out of sync". Also i have installed a script in the crontab that does verification of Volumes replicates each week and generates reports by log and mail.

And if you want much more speed of replication, please see this link:
http://blogs.linbit.com/p/469/843-random-writes-faster/

Also see the drbd link for tuning:
http://www.drbd.org/users-guide/s-throughput-tuning.html

Good luck
Cesar

Re-edit: from the standpoint of DRBD, the BBU is unnecessary, due to that if a PVE Node is turned off brutally, all data always will be in the other Node, but is better to have BBU if you have VMs out of some DRBD volume, or if any day DRBD don't replied correctly.
 
Last edited:
> LVM does write cache
What do you mean when you say "LVM write cache"?

> it never was "out of sync"
What kind of VMs do you use on top? In case of Linux do any of them use swap actively?
 
> LVM does write cache
What do you mean by "LVM write cache"?

> it never was "out of sync"
What kind of VMs do you use on top? In case of Linux do any of them use swap actively?

Please see in your Host PVE the file /etc/lvm/lvm.conf (the part of write cache)

And the verification of DRBD is a comparison of a volume to other DRBD volume, then, if you have several MBs of cache into of the VM, when the VM apply fsync, then, these data will be replicated. In conclusion, the write cache of VM isn't a reason for have DRBD volumes out of sync (only for obtain some lost data in case of a fall of the Server)

Best regards
Cesar

Re-Edit:
1- In the PDFs documentation of DRBD and HA, tell that the LVM write cache should be disabled (you can download of the Web page of LinBit)
2- In this Web you can see that the best option for useful with DRBD should be nocache (because will use the cache of the RAID controller):
"if the cache policy is set to “none”, data will be immediately flushed to the physical disk/controller cache (gray arrow);"
http://www.ilsistemista.net/index.p...s-on-red-hat-enterprise-linux-62.html?start=2
 
Last edited:
> Please see in your Host PVE the file /etc/lvm/lvm.conf (the part of write cache)

Could you be more specific? I don't see anything regarding write cache in lvm.conf. There is only an option related to filter scan cache (i.e. /etc/lvm/.cache).

> And the verification of DRBD is a comparison of a volume to other DRBD volume, then, if you have several MBs of cache into of the VM, when the VM apply fsync, then, these data will be replicated. In conclusion, the write cache of VM isn't a reason for have DRBD volumes out of sync (only for obtain some lost data in case of a fall of the Server)

We would not have out of sync blocks if that was so simple: http://www.gossamer-threads.com/lists/drbd/users/25946#25946
 
> Please see in your Host PVE the file /etc/lvm/lvm.conf (the part of write cache)

Could you be more specific? I don't see anything regarding write cache in lvm.conf. There is only an option related to filter scan cache (i.e. /etc/lvm/.cache).

> And the verification of DRBD is a comparison of a volume to other DRBD volume, then, if you have several MBs of cache into of the VM, when the VM apply fsync, then, these data will be replicated. In conclusion, the write cache of VM isn't a reason for have DRBD volumes out of sync (only for obtain some lost data in case of a fall of the Server)

We would not have out of sync blocks if that was so simple: http://www.gossamer-threads.com/lists/drbd/users/25946#25946

Please run in the PVE Host:
shell# grep write_cache /etc/lvm/lvm.conf

And i think that you must be to have other problems:
1- Do you have the latest version of your firmware for the NICs and for the RAID controller installled?
2- Do you have the latest version of the drivers installed (NICs and RAID controller)?
3- Please read my previous post that i had it re-edit, and you will see about of nocache (for avoid the lost data, but this don't should be a problem for that DRBD is in mode OOS - simply what is in write cache will not be replicated by DRBD, and the data will not be in both disks)

This post was re edited

Best regards
Cesar
 
Last edited:
One question, how many DRBD volumes and hard drives are wearing your two nodes?
if you explain very clear all your configuration, maybe that i can help you, and please tell the brand and model of each hardware into of both Servers.
And if you can, with the drivers version of the NICs and RAID controller)

Note: Also i have in production to PVE + DRBD with desktop motherboard Asus (and never was DRBD in OOS)

Best regards
Cesar
 
Last edited:
> Please run in the PVE Host:

$ man lvm.conf | grep write_cache
write_cache_state — Set to 0 to disable the writing out of the persistent filter cache file when lvm exits. Defaults to 1.

> 1- Do you have the latest version of your firmware for the NICs and for the RAID controller installled?
> 2- Do you have the latest version of the drivers installed (NICs and RAID controller)?
I assume I wasn't clear. We tried everything we could try in relation to hardware.

> 2- In this Web you can see that the best option for useful with DRBD should be nocache (because will use the cache of the RAID controller):
> "if the cache policy is set to “none”, data will be immediately flushed to the physical disk/controller cache (gray arrow);"
With cache=none VM can change buffers before data is commited to DRBD and that causes out of sync.
With cache=writethrough or cache=directsync VM waits for commit and only then can change buffers. You are right, it won't use any cache in case of software raid or raid without BBU. However RAID with BBU will return commit message as soon as data reach RAID cache.
 
Hi Giner

First of all i want rectify a thing that i said incorrectly in my previous posts and ask apologies for my error of guide you incorrectly, If you have any software that does "write cache" (as qemu in the VM or Host, lvm, or any other software that apply this "write cache" in any other layer.), then, this will produce that when DRBD does verification of volumes replicated will give results negative.

Second: You must have configured the LVM "write cache" in off mode, for that this don't do "write cache" (in order to avoid another layer of write cache that will give problems with the verification of DRBD volumes).

Third - about of write cache configured in the VM of PVE:
1- You are wrong if you think that cache=none will give negative results when DRBD does verification of volumes replicated.
2- The word "none", does not that tell you something?
3- With cache=none the VM can not change the buffers before data is commited to DRBD, because the VM don't have write cache!!!!
4- For me this configuration ("cache = none" and "LVM without write cache") runs very well and without problems, only try it.
5- In Windows VMs, you must have configured the write cache disabled if your data are very important.

Write Cache on the RAID Controller:
- Is correct that the RAID Controller must have "write cache" enabled for better performance, not affect to DRBD replication (speaking of OOS of DRBD), and is of great importance if you have a data base due to that a data base does many I/O to HDD.

Write Cache on the HDDs:
- Always think that all the HDDs also have write/read cache, not affect to DRBD replication (speaking of OOS of DRBD), but with the RAID controller installed and doing write cache, this don't must be a problem of lost of data due to that buffer of the RAID controller is more large that the buffer of HDDs.

Best regards
Cesar
 
Last edited:
I won't make any comments on the points I already mentioned. The only one I would like to add a note about is cache=none

"none" means "avoid host cache", however it still doesn't wait for commit from hardware level and can change buffers while data is not completely committed.

From man qemu:
The host page cache can be avoided entirely with cache=none. This will attempt to do disk IO directly to the guest's memory. QEMU may still perform an internal copy of the data. Note that this is considered a writeback mode and the guest OS must handle the disk write cache correctly in order to avoid data corruption on host crashes.The host page cache can be avoided entirely with cache=none. This will attempt to do disk IO directly to the guest's memory. QEMU may still perform an internal copy of the data. Note that this is considered a writeback mode and the guest OS must handle the disk write cache correctly in order to avoid data corruption on host crashes.
 
>
With cache=none VM can change buffers before data is commited to DRBD and that causes out of sync.

This is what I find interesting, seems that LinBit also confirms this http://www.gossamer-threads.com/lists/drbd/users/26003#26003
Lars clearly states
The virtual machine itself is most likely doing "it".

I've never personally seen this myself and I run many windows machines, some of them doing high IOPS all the time too.
I am using a RAID card with 4GB of battery backed cache RAM, maybe this is why I've not seen the problem.

What sort of IO subsystem do you have giner?
 
I think I am going to switch all of my VMs to directsync, its performance is close to cache=none and in my cases far better than writethrough.

Thanks for bringing this to everyone's attention, we should update the wiki to mention this.

Generally writethrough should be faster as it utilizes host memory for read cache.
BTW: How often do you run verify?
 
Writethrough, in my experience, is slower. I think the extra memory copies is why it is slower. Done tons of benchmarks over the last few years but I do suspect that with slower storage writethrough might be faster.

I usually do a verify once a quarter but since I've never seen an issue before I have not made this much of a concern. I ran a verify on three DRBD volumes today and did find some issues.

Cesar, giner is correct about cache=none doing some cache. I know it sounds dumb that none does not actually mean none, maybe the KVM folk should have called it cache=minimal....
 
@giner:
@e100 (a master of masters):

After of read the message of giner and after read the man of QEMU, i am very surprised, the definition of QEMU "cache=none" is wrong, i am learning thanks to you, as also disappointed of the terms of use of QEMU. Now I can only say thank you for sharing your knowledges.

May be that i don't have problems of "Out of Sync" DRBD due to that i have disabled the write cache in my VMs and in my lvm.conf file of my PVE Host.

But all of us have the read cache enabled in the PVE host due to the LVM configuration (here we have a more layer), so i believe that this can help for get more performance in the reads for all cases, and if we have other read cache only would a redundance and wastage of RAM memory.

@e100:
Can you give your review about of better configuration for cache of QEMU as also into of the VMs?, if i have these scenery and requeriments:

1- LVM is on top of DRBD
2- Data of VMs not must lose it (the data are into of LVM that is on top of DRBD)
3- LVM write cache in PVE Host is in power off mode
4- LVM read cache in PVE Host is in power on mode
5- With a RAID controller of 512 MB. and other with 1024 MB. of cache? (Both with flash drive zero maintenance)

@somebody:
What is the difference between "cache=writethrough" and "cache=directsync" in terms practices of QEMU


Best regards
Cesar
 
Last edited:
Hello cesarpk,

> May be that i don't have problems of "Out of Sync" DRBD due to that i have disabled the write cache in my VMs and in my lvm.conf file of my PVE Host.

There is no option in lvm.conf to enable or disable write cache.

> What is the difference between "cache=writethrough" and "cache=directsync" in terms practices of QEMU

cache=directsync means - no read cache, and no write cache, all write requests have to be commited before VM consider data is written and a new write can be performed.
cache=writethrough - similar to cache=directsync with the difference that host RAM (i.e. Proxmox VE hardware RAM) is used for VM block device read cache.

Cheers,
Stanislav German-Evtushenko
 
Hello cesarpk,

> May be that i don't have problems of "Out of Sync" DRBD due to that i have disabled the write cache in my VMs and in my lvm.conf file of my PVE Host.

There is no option in lvm.conf to enable or disable write cache.

> What is the difference between "cache=writethrough" and "cache=directsync" in terms practices of QEMU

cache=directsync means - no read cache, and no write cache, all write requests have to be commited before VM consider data is written and a new write can be performed.
cache=writethrough - similar to cache=directsync with the difference that host RAM (i.e. Proxmox VE hardware RAM) is used for VM block device read cache.

Cheers,
Stanislav German-Evtushenko

Hi giner

Many thanks for the information

But if you run:
shell# head -n 93 /etc/lvm/lvm.conf |tail -2

You will can see this text:
# You can turn off writing this cache file by setting this to 0.
write_cache_state = 1

And my Manual of drbd-users-guide-1.3.4.pdf (downloaded of the Website of linbit long time ago: http://www.linbit.com/en/downloads/tech-guides ) literally said:

In addition, you should disable the LVM cache by setting:
write_cache_state = 0

After disabling the LVM cache, make sure you remove any stale cache entries by deleting /etc/
lvm/cache/.cache.

You must repeat the above steps on the peer node.

------- 0 -------

I like that you read this information of the PDf file, and after i like to hear your judgment about of the read and write cache of LVM and how this can help or harm us (may be that i am understanding something mistakenly)

Best regards
Cesar
 
Last edited:
Hi Cesar,

"write_cache_state = 1" is not related to block cache. As i mentioned before this is a filter cache. This is assumed to be disabled in order to avoild posibility of two hosts trying to manage volume information concurently. Let's say you try to create volumes on both nodes at the same time. If a node caches volumes information than it won't be aware of updates on another node.

Cheers,
Stanislav German-Evtushenko
 
Hi Cesar,

"write_cache_state = 1" is not related to block cache. As i mentioned before this is a filter cache. This is assumed to be disabled in order to avoild posibility of two hosts trying to manage volume information concurently. Let's say you try to create volumes on both nodes at the same time. If a node caches volumes information than it won't be aware of updates on another node.

Cheers,
Stanislav German-Evtushenko

Thanks giner, and let me to do two more question;

A) if that you say is correct, why when i do the things in this order the VM start more faster?:
1- I have the Host PVE off
2- I turn on the Host PVE
3- I turn on a VM (start slowly for a manner of speaking)
4- I turn off this same VM
5- I turn on this same VM (start quickly for a manner of speaking)

B) Why I don't get to have OOS in DRDB when the verification of replicated volumes to finished? (this test is run a time for week automatically)

Best regards
Cesar
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!