Time drift during backup

chrisu

Renowned Member
Jun 5, 2012
41
0
71
Hello everybody,

since a few days (from the personal feeling since the time I run the update process of the proxmox host) my Windows VMs experience bad time drifts when running vzdump (Proxomox backup job). The vm are are slowing down getting behind the host time. The drift is the much the st guests are about 4 hours behind every morning. After the backup has finished, it seems that the clock is running faster an the machines might slowly catching up the hosts time.
I have already changed the hard disk of the VMs to ide and set various clock parameter combinatin I found in the forum regarding the time drift.
Here is one example configuration of a drifting VM.
args: -clock base=localtime,clock=guest,driftfix=slew -no-kvm-pit-reinjection -no-hpet
boot: cd
bootdisk: sata0
cores: 2
ide2: local:iso/virtio-win-0.1.126.iso,media=cdrom,size=152204K
memory: 8192
name: SERVER11-Prod
net0: e1000=46:E6:14:AC:FA:41,bridge=vmbr0
numa: 0
onboot: 1
ostype: w2k8
sata0: local-vm-data-zfs:vm-100-disk-3,size=82G
sata1: local-vm-data-zfs:vm-100-disk-4,size=200G
sata2: local-vm-data-zfs:vm-100-disk-2,cache=writeback,size=180G
scsihw: lsi
smbios1: uuid=a71d950b-9626-45c8-a9a2-0c60627f35f8
sockets: 1
startup: order=11,up=60,down=360

This host is running Windows 2008 (SBS 2008), he other drifting hot are running 2k3 and Windows 7.

The pveversion output is:
proxmox-ve: 4.4-90 (running kernel: 4.4.67-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.67-1-pve: 4.4.67-90
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-52
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-95
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-100
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
openvswitch-switch: 2.6.0-2

Has anybody experienced such a problem to or an idea how to get rid of those timedrifts?

Thanks in advance,

Chris
 
Hi,
you have to use a time server.
 
Hello,
thank you for the answer. The central time server is drifting too and sync intervalls are effected by the drift. The other thing is that the systems get unresponsive as result of the drift. So the time drift is seen as the result of some kind of timer / ticker issue that effect the execution of the programs running within the VMs. So using an external time server fixes only the windows clock values but not the programs running slow. The Proxmox host clock runs fine during all this, so the hardware timer is ok. There have been similar problems years ago with VMWares ESXi but the have finally been solve at hypervisor level. A parameter based solution for keeping the ticks in sync or to change the way the VM timer/clock works.

Thanks & greetings

Chris
 
What you describe is the expected behavior.

When a backup job run it slow down the VM, because it must catch all writes and bring them to the backup before it can write to the virtual disk.
 
Hello Wolfgang,

thank your for the feedback. Yes, Backup will increase the IO and the vzdump will catch a lot of CPU and IO cycles but there is much IO capacity left on the host. This can lead to performance issue but should never lead to time keeping issues. Loosing the local time result in the problem that all Appliacationsand th OS) run into problems that will show the machins hanging for some time, than running again, hanging, running,.... the system can be use but is slow and stucks from time to time but not totally frozen. This behavior has been observered with VMWare when they got an issue on multi-core host where the core had different TSC which have not be synchronized. Every time the VM was moved to a different core, the VM stucked until the TSC has reached the value of the prior cores TSC.
In our case the time seems simply not beeing udated any more (I don't not if TSC or RTC is used but I guess both would be virtualized in KVM). After the backup is finished the VMs clocks are running faster after until they reached the correct time again, indicating some kind of clock synchronization. Btw. interesting: Linux systems seem not to be affected of this here. The run smoothly during back, slower but smooth. I wonder what the difference of the VM parameters are. Could it be an idea to switch the OS type in the config maybe?
Still doing some tests but if it finally turns out that Proxmox (or KVM) is not abler to ensure that the VMs got a correct time base even in phases of high IO load, that platform cannot be used for most applications.

Thank you for you feedback, maybe the are some ideas to further get some solution to the time keeping problem(s)?

Greetings

Chris
 
I remember timing issue on windows years ago, but not recently. Also in this forum there are quite few such posting recently.

Please also check your hardware (mainboard) for bios updates and make sure you disable all powersaving CPU settings.

And if possible, test with a current windows version. (Win 2016 server or Win 10).
 
After the backup is finished the VMs clocks are running faster after until they reached the correct time again, indicating some kind of clock synchronization.
Yes this is a mechanisms to catch up time.

The problem is there are many different approaches to count time. And it depend on the setup and configuration which one are used.

This is the reason why you should use an external working timesync sever.
 
Hello,

I have been able to setup the Proxmox host as time server in my environment. This seems to make sense because all VMs are running on this host and should be linked to its clock. After some difficulties in forcing the Windows SBS 2008 domain controller to accept the new time server (fyi: SBS 2008 has a GPO active that is forcing the local CMOS clock to be the time source), I was able to keep the time of the server more or less good. More or less because of the fact that even with a 60 seconds sync interval, the time may drift for some minutes because of the slow clock performance. But it stays within the 10 Minutes Kerberos ticket tolerance.

In any case the time drift and clock skew are indicators for a bad implementation or design of the timing mechanism. But this seems to be a KVM error.

Thanks to every for the support on this.

Greetings

Chris
 
Hi,
I have put together the steps that work to get NTP with the Proxmox host as time server and a W2k8 DC as client up and running:
1. Run:

aptitude install ntp
or
apt-get install ntp

to install ntp packages. I don't know, if Proxmox has only ntpclient by default installed.

2. Configure /etc/ntp.conf to allow your local network hosts to query time from ntpd by adding:

restrict 127.0.0.1
restrict ::1
#your network follows here
restrict 192.168.0.0 mask 255.255.255.0

3. start ntpd by:
service ntpd start

4. On the Windows 2008 DC open the group policy management console and add the Proxmox host as time server to the Default Domain Controller Policy or add a new policy assigning the time server setting:

Computer Configuration | Administrative Template | System | Windows Time Service | Time Provider
activate Configure Windows NTP Client + Enable Windows NTP Client
Options:
NtpServer: Proxmox host IP
Type: NTP
Other defaults are ok.

Save the GPO Object.

5. Run gpupdate from administrative console on the DC.
Wait a few minutes for the GPO to be effective.
6. Check Time source with: w32tm /query /status
Now the Proxmox server should be shown as time source.

Never the less, even if the DC now is more or less in time with the host, you can see that the VM will still hold for some moments during backup and continue to run. The clock is slow as before but every few minutes the time is corrected by the time server. Those wait cycles badly decrease the overall performance of the VM but this is much more better then having the whole Domain / Directory out of synced time. Hopefully, the KVM developers will fix this timer issue soon.

If you got difficulties setting up time source on the DC, think it would be best to give the tool remark mentioned before a try.

Greetings Chris
 
If running backup job slows down time in VM, then there is definitely something wrong. This should not happen!

BTW, even with ntp-client you might run into problems. NTPd can correct time by either "slewing" (speeding up/slowing down) of timer frequency or by "stepping" (rapid change in single step). IIRC, standard kernel configuration allows speeding/slowing of up to +/-5ms per second. This is not enough to fix a few hours (sic!) delay every day. Again, IIRC ntpd is using default threshold 0.128ms for switching from "slewing" to "stepping". If time drift is higher, "stepping" is used, but this might cause problems for some services (I have seen it with mail-server, error logs like "detected incorrectly running system clock")...
 
Hi Rhinox,

thanks for the reply. But all my investigations showed that it is some kind of normal in KVM that the time in Windows VM drifts when the host is under heavy IO load. The is no specification off heavy IO mentioned anywhere. All sources are handling the problem only by offering solutions to get the clock back to the correct time but I could not find any solution addressing the underlying error that the clock is simple not updated by the host for the VM (or at least those parts Windows uses for time measurement). But I have not gone any further into detail regarding the different approaches used to implement the CMOS clock function in virtual machines in context with the different operating system platforms. At the moment it seems that KVM may not be used critical environments or Windows VM (at least when using vzdump or any other IO intensiv backup tool).

The current settings I am using will sync host and VM time every 60 seconds (more or less depending on the time skew of the VM) and that is enough to keep the domain in sync but that's only fighting the symptoms not solving the problem.

Greeting
Chris
 
Maybe vzdump bandwidth limiting and ionice tuning will help with time drifting?
In file /etc/vzdump.conf:
#bwlimit: KBPS
#ionice: PRI

And one more thing -- it seems that only Windows VMs are really affected, not Linux.
 
One more thing I want to share with you, guys.
Let's see at some pictures:
jupiter-io-delay.png
venus-io-delay.png
This is Monthly Average graphs from two Proxmox 4.4 nodes.
IO delay peaks are backup events to NAS storage QNAP TS-412. And, as you can see, theese peaks goes up to 2017-06-15.
What's happends on that day, you ask? I'll answer.
On June, 15, I changed RAID level on my NAS to RAID10. Before change it was RAID5.
I think this is because no more RAID5 checksums need to be calculated.
And -- miracle! -- Since June, 15, no more time drifting on Windows Server 2008R2 virtual machine!

Update.
According Windows System Log, time drifting is still exists, but decreased in at least 10-15 times: before Jun, 15 it was 1-2 min per day, now is 2-15 secs.

That's all, folks!
 
Last edited:
...But all my investigations showed that it is some kind of normal in KVM that the time in Windows VM drifts when the host is under heavy IO load.
It might be "normal" in KVM, but I have never seen this in Xen or ESXi. Clock in VM (be it KVM, LXC or whatever) should be related to HW-clock in host (with host-VM timezone delta considered), which never drifts under load. So why should VM-clock drift then?

But anyway, you should never do on-line VM backup (with VM running), because in fact it is broken.
 
It might be "normal" in KVM, but I have never seen this in Xen or ESXi.
AFIK the time sync is handel by vmwaretools. Try without and you will see.
 
Check VMware vSphere documentation. I bet you still find there something like "keeping time in VM using vmware-tools is NOT recommended". For very good reasons which were discussed on VMware-web many times...

Proper solution is using NTP-client. Actually, I'm running stratum-1 ntp-server on one of VMs, and both ESXi-server as well as all VMs are syncing time with it...
 
Hi all,

thanks for all the feedback.
On June, 15, I changed RAID level on my NAS to RAID10. Before change it was RAID5.
I think this is because no more RAID5 checksums need to be calculated.
And -- miracle! -- Since June, 15, no more time drifting on Windows Server 2008R2 virtual machine!
This indicates that the storage delay couses the time drift not the local IO load. I believe that the overall performance increased with the switch to Raid10. Interesting result.
In my scenario the time drift is 3-4 hours during a 6 hour backup cycle. That's pretty much.

Regarding VMWare and clock sync thru VMWare-Tools:
VMWare/ESXi had the problem that they failed to sync the TSC of the different CPU cores leading to the problem that every time a guest was moved to a different core with a lower TSC, the guest was waiting for the TSC get to the value that was provided by the recent core . This effected all guest platforms. The first fix was to activate CPU affinity to prohibit TSC changes. This worked fine but could only work with single core guests. The later fix was to keep the TSCs of the core in sync. There is no more need to sync the time via VMWare tools (I believe this functionality has been removed from the tools at all).

I am gonna give different ionice settings a try.

So long

Chris
 
Hi all,

thanks for all the feedback.

This indicates that the storage delay couses the time drift not the local IO load. I believe that the overall performance increased with the switch to Raid10. Interesting result.
In my scenario the time drift is 3-4 hours during a 6 hour backup cycle. That's pretty much.

Regarding VMWare and clock sync thru VMWare-Tools:
VMWare/ESXi had the problem that they failed to sync the TSC of the different CPU cores leading to the problem that every time a guest was moved to a different core with a lower TSC, the guest was waiting for the TSC get to the value that was provided by the recent core . This effected all guest platforms. The first fix was to activate CPU affinity to prohibit TSC changes. This worked fine but could only work with single core guests. The later fix was to keep the TSCs of the core in sync. There is no more need to sync the time via VMWare tools (I believe this functionality has been removed from the tools at all).

I am gonna give different ionice settings a try.

So long

Chris

Well, about performance.
Here is a piece of backup log:
Code:
2:55:05 INFO: status: 95% (306017992704/322122547200), sparse 2% (7066804224), duration 14102, 20/19 MB/s
Jul 01 02:57:39 INFO: status: 96% (309241315328/322122547200), sparse 2% (7115296768), duration 14256, 20/20 MB/s
Jul 01 03:00:08 INFO: status: 97% (312460836864/322122547200), sparse 2% (7263166464), duration 14405, 21/20 MB/s
Jul 01 03:02:35 INFO: status: 98% (315680358400/322122547200), sparse 2% (7387230208), duration 14552, 21/21 MB/s
Jul 01 03:04:57 INFO: status: 99% (318911283200/322122547200), sparse 2% (7553269760), duration 14694, 22/21 MB/s
Jul 01 03:07:25 INFO: status: 100% (322122547200/322122547200), sparse 2% (7696805888), duration 14842, 21/20 MB/s
Jul 01 03:07:25 INFO: transferred 322122 MB in 14842 seconds (21 MB/s)
Jul 01 03:07:31 INFO: archive file size: 192.90GB
Jul 01 03:07:31 INFO: delete old backup '/mnt/pve/backup-nas/dump/vzdump-qemu-108-2017_06_26-23_00_02.vma.gz'
Jul 01 03:07:56 INFO: Finished Backup of VM 108 (04:07:54)

21 MB/s, on bonded 2 x 1 Gbit network, it is quite slow. I believe it is NAS single core CPU performance issue; when I just copy file to NAS over NFS or SMB, it has about 100% CPU load:
nas.png

On RAID5 speed was the same, but IO delay with RAID10 lower by times, as I described above.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!