Problem by cluster to migrate virtual machines

iwes · May 10, 2010

In the proxmox cluster central management, the migration of virutal machines runs not correctly.

The config ist copying, but the real image not.
The command finished with:

/usr/sbin/qmigrate 141.51.159.80 108
May 10 10:49:03 starting migration of VM 108 to host '141.51.159.80'
May 10 10:49:04 copying disk images
May 10 10:49:04 migration finished successfuly (duration 00:00:01)
VM 108 migration done

whinpo · May 10, 2010

maybe you should give us more information?
what about your storage for example?

# cat /etc/pve/storage.cfg

iwes · May 10, 2010

on master:

dir: local
path /var/lib/vz
shared
content images,iso,vztmpl,rootdir

dir: kvmstorage_01
path /kvmstorage_01
shared
content images

dir: backupstorage
path /backups
content backup

on node:

dir: local
path /var/lib/vz
shared
content images,iso,vztmpl,rootdir

dir: kvmstorage_01
path /kvmstorage_01
shared
content images

dir: backupstorage
path /backups
content backup

udo · May 10, 2010

iwes said:
In the proxmox cluster central management, the migration of virutal machines runs not correctly.

The config ist copying, but the real image not.
The command finished with:

/usr/sbin/qmigrate 141.51.159.80 108
May 10 10:49:03 starting migration of VM 108 to host '141.51.159.80'
May 10 10:49:04 copying disk images
May 10 10:49:04 migration finished successfuly (duration 00:00:01)
VM 108 migration done

Hi,
why it's not correct? If your diskfile on shared storage proxmox don't copy the files, because both nodes (should) have access to the files:

Code:

/usr/sbin/qmigrate --online 172.20.3.62 110
May 10 12:08:55 starting migration of VM 110 to host '172.20.3.62'
May 10 12:08:55 copying disk images
May 10 12:08:55 starting VM on remote host '172.20.3.62'
May 10 12:08:56 starting migration tunnel
May 10 12:08:56 starting online/live migration
May 10 12:08:58 migration status: active (transferred 310248KB, remaining 755672KB), total 1065216KB)
May 10 12:09:00 migration status: active (transferred 367668KB, remaining 698616KB), total 1065216KB)
May 10 12:09:02 migration status: active (transferred 437776KB, remaining 628508KB), total 1065216KB)
May 10 12:09:04 migration status: active (transferred 506272KB, remaining 560036KB), total 1065216KB)
May 10 12:09:06 migration status: active (transferred 571336KB, remaining 495016KB), total 1065216KB)
May 10 12:09:08 migration status: active (transferred 634320KB, remaining 432056KB), total 1065216KB)
May 10 12:09:10 migration status: active (transferred 700988KB, remaining 365492KB), total 1065216KB)
May 10 12:09:12 migration status: active (transferred 768656KB, remaining 297824KB), total 1065216KB)
May 10 12:09:14 migration status: active (transferred 838856KB, remaining 227768KB), total 1065216KB)
May 10 12:09:16 migration status: active (transferred 956468KB, remaining 111052KB), total 1065216KB)
May 10 12:09:19 migration status: completed
May 10 12:09:19 migration speed: 44.52 MB/s
May 10 12:09:20 migration finished successfuly (duration 00:00:26)
VM 110 migration done

Udo

udo · May 10, 2010

Hi,
local should not be shared!

Udo

iwes · May 10, 2010

oh, thanks, it runs:

/usr/sbin/qmigrate 141.51.159.80 108
May 10 12:18:29 starting migration of VM 108 to host '141.51.159.80'
May 10 12:18:29 copying disk images
vm-108-disk-1.raw

rsync status: 4543119360 14% 44.80MB/s 0:10:03

plewka · Aug 24, 2010

iwes said:
In the proxmox cluster central management, the migration of virutal machines runs not correctly.

The config ist copying, but the real image not.
The command finished with:

/usr/sbin/qmigrate 141.51.159.80 108
May 10 10:49:03 starting migration of VM 108 to host '141.51.159.80'
May 10 10:49:04 copying disk images
May 10 10:49:04 migration finished successfuly (duration 00:00:01)
VM 108 migration done

That's what's happing to me, too.
I now understood >shared< means "copy off, both nodes see the same files".
I simply changed the config /etc/pve/storage.cfg on the master.
It got synced to the slave. And it works Ist this a correct way?

dietmar · Aug 25, 2010

plewka said:
I simply changed the config /etc/pve/storage.cfg on the master.
It got synced to the slave. And it works Ist this a correct way?

yes

plewka · Sep 1, 2010

Migration worked fine for me last week, but I there are issues now.
I'm using KVM via LVM/DRBD and local storage (*.raw, qcow2).
I added the following block, maybe it helps to identify a mayor issue.
Currently the cluster doesn't work really smoth...

I'm using Nehalem(Intel) and PhenomX4(AMD) for the two nodes, which is bad I know.
In past it was possible at least to hot-migrate from the primary machine(Intel) to the
secondary. It didn't work backwards, though.
Is there a possibility to "degrade" the CPUs via a parameter to get them equal enough?
Some VMs currently keep alive after the move, some die.
My IP-COP (SingleCoreInstall) for example dies. Windows2008R2 at least was fully vital
after moving it from Intel to AMD. I didn't check the other direction up to now.

Something was wrong with the synchronisation of the ISOs for a while, I thought it was
caused by the extremely long filename...but there are longer ones. It now finally works
again... hmm. did't find errors, but ISOs on nodes were not sync'ed for a while.

Backup (targz) runs forever from time to time and on on different VMs (I'm doing
backup on NAS/SMB) and I finally have had to kill it (someone else spoke about the
missing failture note in the Backup log, probably it's already fixed). The backup
continues at next VMs without trouble and finally sends out the statusmail. This probably is
caused by the NAS via an unexpected timeout or so. I already have had random trouble
with other Linux-Machines (SLES10). Therefore this is only for a note - if someone else sees
this issue, too.

Today an XP-Machine (DRBD/LVM) got damaged (LZ32.dll - maybe because of a failed hotfix
maybe because of any other error inside or outside). I removed the machine and
qmrestore'd it successfully. The resore changed the drive name to xxx disk-2 and showed
the disk-1 in unused disks at the WebGUI. I tried to migrate an got an error because of the
old drive. I did't expect to see the old LVM that after the remove, but ok the old LVM was
still there and I removed it. Then I tried the migration while not running the machine (
got stoped because of the previous failed step). The config migrates in zerotime+1, but
the secondary node doesn't show the disk (size: 0Kb). I migrated back and the harddisk
was available again. The machine again works if I start it. I'm able to reproduce that.

Any Idea where to look at?

Is it possible to restart (parts) the "administrative environment" of PVE without stopping the
running VMs? Last days there were some strange things on the primary machine, and I
would love to restart, but currently it's not a good idea.

plewka · Sep 1, 2010

I have had a closer look at the logs:
Regarding the the SAMBA/VFS issue there are some posts which recomend to disable
"opportunistic locking" via echo 0 > /proc/fs/cifs/OpLockEnabled (works until reboot).
I'll check if it works for me.
===============

My kernel Log shows me tons of:
Aug 30 06:25:03 rzsv1360 kernel: Buffer I/O error on device dm-5, logical block 0
Aug 30 06:25:03 rzsv1360 kernel: lost page write due to I/O error on dm-5
Aug 30 06:25:03 rzsv1360 kernel: EXT3-fs error (device dm-5): ext3_find_entry: reading directory #69656577 offset 0
...
Aug 31 08:32:59 rzsv1360 kernel: __ratelimit: 32 callbacks suppressed
Aug 31 08:32:59 rzsv1360 kernel: Buffer I/O error on device dm-5, logical block 0
Aug 31 08:32:59 rzsv1360 kernel: lost page write due to I/O error on dm-5

I managed to find which storage is in trouble via:
ls -l /dev/mapper* /dev/dm*
with the minor device numbers showing the dm-x association:
...
brw-rw---- 1 root disk 251, 5 Sep 1 17:08 /dev/dm-5
...
crw-rw---- 1 root root 10, 59 Jul 30 14:44 /dev/mapper/control
brw-rw---- 1 root disk 251, 4 Aug 31 08:57 /dev/mapper/drbd-vm--106--disk--1
brw-rw---- 1 root disk 251, 6 Aug 26 01:30 /dev/mapper/drbd-vm--106--disk--1-real
brw-rw---- 1 root disk 251, 5 Sep 1 20:02 /dev/mapper/drbd-vm--114--disk--2
brw-rw---- 1 root disk 251, 11 Aug 26 08:39 /dev/mapper/drbd-vm--129--disk--1
brw-rw---- 1 root disk 251, 7 Aug 17 00:02 /dev/mapper/drbd-vm--152--disk--1
brw-rw---- 1 root disk 251, 2 Jul 30 14:44 /dev/mapper/pve-data
brw-rw---- 1 root disk 251, 1 Jul 30 14:44 /dev/mapper/pve-root
brw-rw---- 1 root disk 251, 0 Jul 30 14:44 /dev/mapper/pve-swap
brw-rw---- 1 root disk 251, 3 Jul 30 14:44 /dev/mapper/raid-vz
brw-rw---- 1 root disk 251, 9 Sep 1 19:49 /dev/mapper/raid-vz-real
brw-rw---- 1 root disk 251, 8 Sep 1 19:49 /dev/mapper/raid-vzsnap--rzsv1360--0
brw-rw---- 1 root disk 251, 10 Sep 1 19:49 /dev/mapper/raid-vzsnap--rzsv1360--0-cow

It matches the machine which crashed. Hmm. But why ext-3, it's an XP machine?!
The RAID doesn't report trouble and it's DRBD. I doub't it's hardware. The 3ware does
verify (in kernel.log directly after that messages) and and logical block 0 is quite too
easy for hardware.
Why is there a "real" postfix left at machine 106 btw? What's the right thing to do?

This helped me to understand snapshoting use of the nan"base, real, snap and cow" postfix:
http://www.mjmwired.net/kernel/Documentation/device-mapper/snapshot.txt#66

Second node shows CIFS messages,too - but no dm-5 trouble...BUT it ran the VM114 when it died...I'm confused now:-(

Any ideas how to process further?
Many thanks in advance!

meto · Sep 2, 2010

plewka said:
I'm using Nehalem(Intel) and PhenomX4(AMD) for the two nodes, which is bad I know.
In past it was possible at least to hot-migrate from the primary machine(Intel) to the
secondary. It didn't work backwards, though.
Is there a possibility to "degrade" the CPUs via a parameter to get them equal enough?
Some VMs currently keep alive after the move, some die.
My IP-COP (SingleCoreInstall) for example dies. Windows2008R2 at least was fully vital
after moving it from Intel to AMD. I didn't check the other direction up to now.

Here you have it: http://www.linux-kvm.org/page/Tuning_KVM

plewka · Sep 4, 2010

DRBD/LVM migration doesn't work any longer for new machines

Thanks for the link, I'll look closer. Btw: My Win2008R2 migrates both directions (AMD/Intel) without trouble and without changing anything.

More detailed to the above problem:
The old machines work like the W2008R2 above, I recovered one from Backup and created a new one.
Both migrate with config but without data (on drbd) because the "volume doesn't exist", but it does. Is this a problem on master or slave? Where to look for?
I basicly still works, so drbd is ok, trouble at LVM ???

/usr/sbin/qmigrate --online 10.47.1.2 151
Sep 04 11:41:13 starting migration of VM 151 to host '10.47.1.2'
Sep 04 11:41:13 copying disk images
Sep 04 11:41:13 starting VM on remote host '10.47.1.2'
One or more specified logical volume(s) not found.
command '/sbin/lvchange -aly /dev/drbd/vm-151-disk-1' failed with exit code 5
volume 'drbd:vm-151-disk-1' does not exist
Sep 04 11:41:15 online migrate failure - command '/usr/bin/ssh -c blowfish -o BatchMode=yes root@10.47.1.2 /usr/sbin/qm --skiplock start 151 --incoming tcp' failed with exit code 2
Sep 04 11:41:15 migration finished with problems (duration 00:00:03)
VM 151 migration failed -

Storage is there:
drbd:vm-151-disk-1 151 48.00 -

plewka · Sep 4, 2010

Re: DRBD/LVM migration doesn't work any longer for new machines

Again more details:
The working machines have LVMs on both nodes, as I would expect.
The non-working machines have LVMs only on the master node.
In my understanding the cluster sync of the lvm-volumes doesn't work
any longer.
How to re-activate (of possible without a reboot)?
The sync of the configs is done different to the ISOs and different to
the drbd-lvm config, isn't it?

That's probably even the reason for all other trouble, too.

Many thanks in advance!

dietmar · Sep 4, 2010

Re: DRBD/LVM migration doesn't work any longer for new machines

Please run below command on the target node:

# /sbin/lvchange -aly /dev/drbd/vm-151-disk-1

You need to find out why it does not work.

dietmar · Sep 4, 2010

Re: DRBD/LVM migration doesn't work any longer for new machines

plewka said:
In my understanding the cluster sync of the lvm-volumes doesn't work
any longer.

We do not sync anything on a shared storage - that is what the storage needs to do.
Is you DRBD still in Primary/Primary mode?

plewka · Sep 4, 2010

Re: DRBD/LVM migration doesn't work any longer for new machines

dietmar said:
We do not sync anything on a shared storage - that is what the storage needs to do.
Is you DRBD still in Primary/Primary mode?

rzsvxxxx:~# cat /proc/drbd
version: 8.3.4 (api:88/proto:86-91)
GIT-hash: 70a645ae080411c87b4482a135847d69dc90a6a2 build by root@oahu, 2010-04-15 10:24:43
0: cs:WFConnection ro

rimary/Unknown ds:UpToDate/DUnknown C r----
ns:0 nr:0 dw:45957872 dr:1020989176 al:30754 bm:30695 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:15090836

:-((
Ok,will have a look

=================
Update:

PVE-Master DRBD CSTATE: StandAlone
PVE-Node DRBD CSTATE: WFConnection

There should be a warning or check...very dangerous.

=================
Next Update:

I removed the LVMs at the PVE-NODE to be able to change cstate to secondary
[FONT=Verdana,Arial,Helvetica][FONT=Verdana,Arial,Helvetica]drbdadm -- --discard-my-data connect all [/FONT][/FONT]
At least It syncs now...up to my (very limited) understanding I'd be able to change
to primary now.

=================

Primary/Primary UpToDate now! But inconsistent data. VMs are working a bit but were damaged :-(
Thanks a lot!

I'd recommend some kind of alert/check in PVE.

dietmar · Sep 6, 2010

Re: DRBD/LVM migration doesn't work any longer for new machines

plewka said:
There should be a warning or check...very dangerous.

Yes, you need to configure that yourself.

plewka · Sep 6, 2010

Re: DRBD/LVM migration doesn't work any longer for new machines

dietmar said:
Yes, you need to configure that yourself.

Hmmm..
One thing is to understand the risks, one thing is to get a warning, and the
third thing is preventing PVE to migrate to a no longer shared device. Isn't
there the chance to stop the migration or make it fail with any kind of check?

If the DRBD fails when migration runs I don't habe a chance to prevent
the crash. That's what I don't like.
Maybe you could call a selfdone script, like a hook (if it's placed somewhere
like some configs) which does a check an is able to prevent the final ok?

What are the right steps to shut down a cluster node running DRBD
(Master/Slave)? Is there a howto somewhere? I think the shutdown-node
should be changed to a DRBD secondary first, and its active machines
have to be moved to be the other machine first of course.
If the cluster ran to Split-Brain, I have to backup the running machine, on
the PVE-slave node, switch them off, and recover them on the PVE-master
or is there a HA-solution for this repair?

Many thanks in advance!
JP

dietmar · Sep 7, 2010

Re: DRBD/LVM migration doesn't work any longer for new machines

plewka said:
Hmmm..
One thing is to understand the risks, one thing is to get a warning, and the
third thing is preventing PVE to migrate to a no longer shared device. Isn't
there the chance to stop the migration or make it fail with any kind of check?

plewka said:
What are the right steps to shut down a cluster node running DRBD
(Master/Slave)? Is there a howto somewhere? I think the shutdown-node
should be changed to a DRBD secondary first, and its active machines
have to be moved to be the other machine first of course.

I guess you better ask DRBD related questions on the DRBD mailing lists.

plewka said:
If the cluster ran to Split-Brain, I have to backup the running machine, on
the PVE-slave node, switch them off, and recover them on the PVE-master
or is there a HA-solution for this repair?

No, there is currently no HA-solution (planned for 2.0).

plewka · Sep 7, 2010

Re: DRBD/LVM migration doesn't work any longer for new machines

dietmar said:
I guess you better ask DRBD related questions on the DRBD mailing lists.

Please don't get me wrong! I'm really happy with PVE and like to give feedback to further
enhance it.
The DRBD-Mailing list is not the right place to ask how to reboot the PVE-Master or
PVE-Node as intended by its makers (using DRBD as one of the storage solutions) in my understanding. The guys there don't know the ideas of your concepts. I would have found it helpfull to read how someone else fixed such problem, that's why I tried to write down the steps which were successfull for me (with help of you). Nothing more.
But sure, I'll have a look at their mailing list.

dietmar said:
No, there is currently no HA-solution (planned for 2.0).

Ok, change HA to "service back as soon as possible".

Problem by cluster to migrate virtual machines

New Member

Renowned Member

New Member

Distinguished Member

Distinguished Member

New Member

Member

Proxmox Staff Member

Member

Member

Member

Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member