Hi again,
well here is what we gathered enduring some testing and rebooting...
exec summary: pve-2.6.32-4 and pve-2.6.18-4 kernels do work fine.
pve-kernel-2.6.18-5-pve and pve-kernel-2.6.18-6-pve give us trouble.
please don't take it as blame or whatever - we can only try and observe and maybe point the developers into the right direction, as we're definitely no kernel-devs and have no idea why those two kernels give this strange behaviour. the last two kernel docs were shortened/removed out as we hit the text-length-limit of this forum.
----------
all tests were performed on proxmox v1.8 running the proxmox-ve-2.6.18 framework and for the 2.6.32 of course the proxmox-ve-2.6.32 framework was installed.
the lenny ntp package 4.2.4p4+dfsg-8lenny3 was used.
after installing promox we downgraded to 2.6.18-6 kernel and framework and one of the puzzling things is when we do a "watch -n 5 date" we get output like this (this is the watch timestamp is accurate but the date output is wrong (and so is most of the timing in the OS)):
---------
Every 5.0s: date Sun Jun 5 19:01:36 2011
Sun Jun 5 19:06:36 CEST 2011
---------
the kernel used:
uname -a: Linux k71 2.6.18-6-pve #1 SMP Mon May 9 13:27:11 CEST 2011 x86_64 GNU/Linux
pveversion -v
pve-manager: 1.8-17 (pve-manager/1.8/5948)
running kernel: 2.6.18-6-pve
proxmox-ve-2.6.18: 1.8-15
pve-kernel-2.6.32-4-pve: 2.6.32-33
pve-kernel-2.6.18-6-pve: 2.6.18-15
qemu-server: 1.1-30
pve-firmware: 1.0-11
libpve-storage-perl: 1.0-17
vncterm: 0.9-2
vzctl: 3.0.26-1pve4
vzdump: 1.2-12
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm-2.6.18: 0.9.1-15
we observe the bootprocess of the hardwarenode and soon it seems ntp is getting in and setting things up, but after a few minutes reports servers unreachable and takes an instant timewarp (the following line timestamped 19:05:04 was real time 18:59:04 but the clock just got kicked):
syslog for ntp:
Jun 5 18:57:44 k71 ntpd[8436]: kernel time sync status 0040
Jun 5 18:57:44 k71 ntpd[8436]: frequency initialized 59.309 PPM from /var/lib/ntp/ntp.drift
Jun 5 18:57:44 k71 ntpd[8454]: signal_no_reset: signal 17 had flags 4000000
Jun 5 18:57:46 k71 ntpd_initres[8454]: signal_no_reset: signal 14 had flags 4000000
Jun 5 18:59:04 k71 ntpd[8436]: synchronized to 91.206.8.34, stratum 2
Jun 5 18:59:04 k71 ntpd[8436]: kernel time sync status change 0001
Jun 5 19:05:04 k71 ntpd[8436]: synchronized to 94.136.3.70, stratum 2
Jun 5 19:05:10 k71 ntpd[8436]: no servers reachable
Jun 5 19:05:18 k71 ntpd[8436]: synchronized to 78.46.40.125, stratum 2
but still: ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
k70.vss.kapper. 193.171.23.163 2 u 320 64 3 0.001 -150058 149967.
ntp1.sil.at 62.112.154.21 3 u 312 64 3 0.001 -150054 149967.
*ns4.nosuchhost. 131.130.251.107 2 u 342 64 1 0.001 -150064 2.414
86.59.113.116 193.171.23.163 2 u 305 64 3 0.001 -150054 113364.
svn.mediainvent 131.130.251.107 2 u 326 64 3 0.001 -150060 149967.
then timewarp back to real time:
Jun 5 19:02:43 k71 ntpd[8436]: no servers reachable
ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
k70.vss.kapper. 193.171.23.163 2 u 50 64 37 0.154 -41.817 80183.0
ntp1.sil.at 62.112.154.21 3 u 43 64 37 0.745 -39.808 80182.3
ns4.nosuchhost. 131.130.251.107 2 u 28 64 37 21.993 -40.793 138889.
86.59.113.116 193.171.23.163 2 u 40 64 37 0.901 -40.681 126788.
svn.mediainvent 193.171.23.163 2 u 55 64 37 0.615 -42.902 80183.6
and within seconds another timewarp again five minutes into the future (feeling a bit like Max Headroom here...).
ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
k70.vss.kapper. 131.130.251.107 2 u 330 64 77 0.145 -36.949 80185.6
ntp1.sil.at 62.112.154.21 3 u 324 64 77 0.001 -150017 126729.
ns4.nosuchhost. 131.130.251.107 2 u 308 64 77 0.001 -150018 80164.9
86.59.113.116 193.171.23.163 2 u 319 64 77 0.001 -150018 80164.4
svn.mediainvent 193.171.23.163 2 u 336 64 77 0.717 -37.957 80186.2
this goes on virtually forever five-minutes-warps occur again and again - and of course time-dependent services (especially ospf used here) go nuts.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
next we tried with 2.6.18-5 (a simple downgrade for the kernel):
the "watch -n 10 date" again gives strange results:
Every 10.0s: date Sun Jun 5 18:21:36 2011
Sun Jun 5 18:26:36 CEST 2011
(in case you wonder why this time is now before the first try it is because we have redocumented the differnt tests in more detail afterwards, so time seems to overlap, but keep in mind we're timewarping too here...)
---------------------------------------------------------------
uname -a: Linux k71 2.6.18-5-pve #1 SMP Mon Mar 28 07:05:51 CEST 2011 x86_64 GNU/Linux
pveversion -v
pve-manager: 1.8-17 (pve-manager/1.8/5948)
running kernel: 2.6.18-5-pve
proxmox-ve-2.6.18: 1.8-15
pve-kernel-2.6.32-4-pve: 2.6.32-33
pve-kernel-2.6.18-5-pve: 2.6.18-14
pve-kernel-2.6.18-6-pve: 2.6.18-15
qemu-server: 1.1-30
pve-firmware: 1.0-11
libpve-storage-perl: 1.0-17
vncterm: 0.9-2
vzctl: 3.0.26-1pve4
vzdump: 1.2-12
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm-2.6.18: 0.9.1-15
here too strange note in syslog says after a few minutes - no servers reachable....
Jun 5 18:40:46 k71 ntpd[8422]: kernel time sync status 0040
Jun 5 18:40:47 k71 ntpd[8422]: frequency initialized 59.937 PPM from /var/lib/ntp/ntp.drift
Jun 5 18:40:47 k71 ntpd[8435]: signal_no_reset: signal 17 had flags 4000000
Jun 5 18:40:49 k71 ntpd_initres[8435]: signal_no_reset: signal 14 had flags 4000000
Jun 5 18:42:10 k71 ntpd[8422]: synchronized to 86.59.113.117, stratum 2
Jun 5 18:42:10 k71 ntpd[8422]: kernel time sync status change 0001
Jun 5 18:47:37 k71 ntpd[8422]: sendto(91.206.8.70) (fd=24): Network is unreachable
Jun 5 18:48:08 k71 ntpd[8422]: no servers reachable
Jun 5 18:48:09 k71 ntpd[8422]: synchronized to 91.206.8.70, stratum 2
Jun 5 18:45:51 k71 ntpd[8422]: synchronized to 78.41.115.242, stratum 2
Jun 5 18:46:03 k71 ntpd[8422]: no servers reachable
kind of confirmed:
ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
k70.vss.kapper. 131.130.251.107 2 u 27 64 377 0.001 -150024 98181.0
91.206.8.70 193.171.23.163 2 u 62 64 177 0.001 -150024 80168.5
pils.linux-kern 129.132.2.21 3 u 31 64 377 0.001 -150025 113371.
86.59.113.117 193.171.23.163 2 u 29 64 377 0.001 -150024 113371.
boxi.trexler.at 193.171.23.163 2 u 47 64 177 0.001 -150000 80167.8
after a while we get
ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
k70.vss.kapper. 193.171.23.163 2 u 6 64 377 0.156 -47.725 113372.
91.206.8.70 193.171.23.163 2 u 42 64 377 0.877 -46.976 113375.
pils.linux-kern 129.132.2.21 3 u 11 64 377 1.511 -48.431 98183.8
86.59.113.117 193.171.23.163 2 u 10 64 377 1.910 -47.701 98184.1
boxi.trexler.at 193.171.23.163 2 u 29 64 377 2.078 -17.470 113381.
then we get another 5 minute timewarp and we see:
ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
94.136.3.70 193.171.23.163 2 u 307 64 377 0.001 -150032 113380.
91.206.8.70 193.171.23.163 2 u 343 64 377 0.641 -49.776 98182.7
213.129.242.82 129.132.2.21 3 u 376 64 376 1.511 -48.431 98183.8
86.59.113.117 193.171.23.163 2 u 375 64 376 1.910 -47.701 98184.1
78.41.115.242 193.171.23.163 2 u 330 64 377 0.817 -18.200 98188.3
in the syslog we get added:
Jun 5 18:57:47 k71 ntpd[8422]: sendto(86.59.1dto(78.41.115.242) (fd=24): Network is unreachable
Jun 5 18:58:50 k71 ntpd[8422]: sendto(86.59.113.117) (fd=24): Network is unreachable
Jun 5 18:58:54 k71 ntpd[8422]: sendto(213.129.242.82) (fd=24): Network is unreachable
and of course ospf goes nuts.
------------------------------------------------------------------------------------------
then we go back to where it last worked for us on old systems - 2.6.18-4 and yes, "watch -n 10 date" gets consistent:
Every 10.0s: date Sun Jun 5 18:32:03 2011
Sun Jun 5 18:32:03 CEST 2011
------------------------------------------------------------------------------------
here things go fine (and the previous timeerror gets corrected):
Jun 5 18:34:13 k71 ntpd[8403]: signal_no_reset: signal 17 had flags 4000000
Jun 5 18:34:15 k71 ntpd_initres[8403]: signal_no_reset: signal 14 had flags 4000000
Jun 5 18:30:40 k71 ntpd[8390]: synchronized to 86.59.113.115, stratum 2
Jun 5 18:30:40 k71 ntpd[8390]: time reset -300.036970 s
Jun 5 18:30:40 k71 ntpd[8390]: kernel time sync status change 0001
Jun 5 18:30:49 k71 ntpd[8390]: synchronized to 86.59.113.114, stratum 2
regards,
hk