Hi there,
i have a strange behaviour with Proxmox, DRBD and LVM on top of DRBD for the guests.
I have two similar hosts with raid 0+1 and DRBD in primary/primary. More infos at the bottom.
My setup runs well for about 6 months. Then the RAM on one host failed.
I started its guests on the remaining host with no problems, after the host failed. (thanks to drbd)
After two days i called the datacenter to reboot the failed host. At this time, i didn't knew the reason (failed ram).
The guy at the datacenter rebooted the wrong (remaining) host, which came back online, of course.
About half an hour later he rebooted the right (failed) host, which came back also, with the half of its ram (ram changed a few days later).
DRBDs resync started and ended succesfully.
I started fsck on any partitions on any guests. Nothing really bad there, just some unreferenced inodes.
But then after some days, one of my guests (the most significant, of course) freezed.
top on the host showed its KVM process with 600% cpu usage.
It wasn't responding to ssh or the vnc from Proxmox webinterface, so i restarted it from Proxmox webinterface.
It came back online and runned as nothing had happend, except for some unreferenced inodes, deleted during startup.
This problem persits since nearly 3 months now
Every few days (1-9 days, different time, different load) the guest freezes, shows 600% cpu usage on the host and wont respond to ssh, while other guests remain running.
One day i was just on the guest and had top open in an ssh session when it freezed.
top freezed showing ~70%wa so i think there is some problem with the storage.
The adaptec storage manager shows no problems with the harddisks.
fsck shows no problems on the guests partitions.
dmesg on the hosts shows no problems.
The logs of the guest shows also nothing interesting.
I couldn't find any corupted data until now.
I moved the guest to the other host. Stil hangs after some days. I halted one host and let all guests run on the remaining one, still the guest freezed after some days.
One day i made an online-migration to the other host while the guest was freezed, migration was successfull, guest was still freezed until i restarted it.
I thought perhaps it's an coincidence and just guest-related, because it was just one of my three guests which had this problem.
So i setup a new guest with similar settings and rsynced the data from the problematic guest. This guest also freezed, but just one time until now (about 4 weeks, perhaps the less load).
I configured rsyslog to send its data to another host, but there logging also just stops when the guest freezes, no kernel messages, nothing.
Today i runned DRBDs online-verification. It showed no errors, but in the first hour it runned, another guest freezed about 10 times, every few minutes.
Feels like the online-verification caused the freeze, because until today this guest never freezed.
So until now, 3 of 4 guests showed the problem.
I have no clue whats going on and i'm really intrested in finding the reason instead of just "make anything new".
Here are some informations about my setup:
Hosts:
Intel(R) Xeon(R) CPU X3440 @ 2.53GHz, 16gb RAM, 4x1tb sata in raid 0+1
uname -a:
Linux host1 2.6.32-6-pve #1 SMP Fri Nov 4 06:54:05 CET 2011 x86_64 GNU/Linux
pveversion:
pve-manager: 1.9-26 (pve-manager/1.9/6567)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 1.9-50
pve-kernel-2.6.32-4-pve: 2.6.32-33
pve-kernel-2.6.32-6-pve: 2.6.32-50
qemu-server: 1.1-32
pve-firmware: 1.0-14
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.29-3pve1
vzdump: 1.2-16
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.15.0-1
ksm-control-daemon: 1.0-6
pveperf:
CPU BOGOMIPS: 40528.22
REGEX/SECOND: 839699
HD SIZE: 50.20 GB (/dev/mapper/pve-root)
BUFFERED READS: 411.09 MB/sec
AVERAGE SEEK TIME: 7.94 ms
FSYNCS/SECOND: 438.11
DNS EXT: 26.74 ms
DNS INT: 8.13 ms
Guest:
name: Hosting1
ide2: none,media=cdrom
vlan0: virtio=xx:xx:xx:xx:xx:xx
ostype: l26
memory: 4096
onboot: 0
sockets: 8
cores: 1
boot: c
freeze: 0
cpuunits: 50000
acpi: 1
kvm: 1
virtio1: Storage:vm-101-disk-2
description:
bootdisk: virtio1
uname -a:
Linux hosting1 2.6.32-5-686-bigmem #1 SMP Thu Nov 3 05:12:00 UTC 2011 i686 GNU/Linux
I hope i didn't miss something fundamental.
Thanks in advance for any helping information.
Wanja
Oh and sorry for the tons of words and my bad english
i have a strange behaviour with Proxmox, DRBD and LVM on top of DRBD for the guests.
I have two similar hosts with raid 0+1 and DRBD in primary/primary. More infos at the bottom.
My setup runs well for about 6 months. Then the RAM on one host failed.
I started its guests on the remaining host with no problems, after the host failed. (thanks to drbd)
After two days i called the datacenter to reboot the failed host. At this time, i didn't knew the reason (failed ram).
The guy at the datacenter rebooted the wrong (remaining) host, which came back online, of course.
About half an hour later he rebooted the right (failed) host, which came back also, with the half of its ram (ram changed a few days later).
DRBDs resync started and ended succesfully.
I started fsck on any partitions on any guests. Nothing really bad there, just some unreferenced inodes.
But then after some days, one of my guests (the most significant, of course) freezed.
top on the host showed its KVM process with 600% cpu usage.
It wasn't responding to ssh or the vnc from Proxmox webinterface, so i restarted it from Proxmox webinterface.
It came back online and runned as nothing had happend, except for some unreferenced inodes, deleted during startup.
This problem persits since nearly 3 months now

Every few days (1-9 days, different time, different load) the guest freezes, shows 600% cpu usage on the host and wont respond to ssh, while other guests remain running.
One day i was just on the guest and had top open in an ssh session when it freezed.
top freezed showing ~70%wa so i think there is some problem with the storage.
The adaptec storage manager shows no problems with the harddisks.
fsck shows no problems on the guests partitions.
dmesg on the hosts shows no problems.
The logs of the guest shows also nothing interesting.
I couldn't find any corupted data until now.
I moved the guest to the other host. Stil hangs after some days. I halted one host and let all guests run on the remaining one, still the guest freezed after some days.
One day i made an online-migration to the other host while the guest was freezed, migration was successfull, guest was still freezed until i restarted it.
I thought perhaps it's an coincidence and just guest-related, because it was just one of my three guests which had this problem.
So i setup a new guest with similar settings and rsynced the data from the problematic guest. This guest also freezed, but just one time until now (about 4 weeks, perhaps the less load).
I configured rsyslog to send its data to another host, but there logging also just stops when the guest freezes, no kernel messages, nothing.
Today i runned DRBDs online-verification. It showed no errors, but in the first hour it runned, another guest freezed about 10 times, every few minutes.
Feels like the online-verification caused the freeze, because until today this guest never freezed.
So until now, 3 of 4 guests showed the problem.
I have no clue whats going on and i'm really intrested in finding the reason instead of just "make anything new".
Here are some informations about my setup:
Hosts:
Intel(R) Xeon(R) CPU X3440 @ 2.53GHz, 16gb RAM, 4x1tb sata in raid 0+1
uname -a:
Linux host1 2.6.32-6-pve #1 SMP Fri Nov 4 06:54:05 CET 2011 x86_64 GNU/Linux
pveversion:
pve-manager: 1.9-26 (pve-manager/1.9/6567)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 1.9-50
pve-kernel-2.6.32-4-pve: 2.6.32-33
pve-kernel-2.6.32-6-pve: 2.6.32-50
qemu-server: 1.1-32
pve-firmware: 1.0-14
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.29-3pve1
vzdump: 1.2-16
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.15.0-1
ksm-control-daemon: 1.0-6
pveperf:
CPU BOGOMIPS: 40528.22
REGEX/SECOND: 839699
HD SIZE: 50.20 GB (/dev/mapper/pve-root)
BUFFERED READS: 411.09 MB/sec
AVERAGE SEEK TIME: 7.94 ms
FSYNCS/SECOND: 438.11
DNS EXT: 26.74 ms
DNS INT: 8.13 ms
Guest:
name: Hosting1
ide2: none,media=cdrom
vlan0: virtio=xx:xx:xx:xx:xx:xx
ostype: l26
memory: 4096
onboot: 0
sockets: 8
cores: 1
boot: c
freeze: 0
cpuunits: 50000
acpi: 1
kvm: 1
virtio1: Storage:vm-101-disk-2
description:
bootdisk: virtio1
uname -a:
Linux hosting1 2.6.32-5-686-bigmem #1 SMP Thu Nov 3 05:12:00 UTC 2011 i686 GNU/Linux
I hope i didn't miss something fundamental.
Thanks in advance for any helping information.
Wanja
Oh and sorry for the tons of words and my bad english
