IO-Error after life migration

soessle · Feb 23, 2009

Hi,

I just did a online migration of a kvm-node and it worked almost fine - except I get IO-errors :-(

On the node I have Ubuntu 8.04 installed. If i do a "sudo -s" I get "-bash: /usr/bin/sudo: Input/output error" as response.

After "switch off/switch on" of the node everything is OK.

Could somebody help me to find the root of this problem? Which logfile shoud I look at?

Thank you.

mangoo · Feb 24, 2009

soessle said:
Hi,

I just did a online migration of a kvm-node and it worked almost fine - except I get IO-errors :-(

On the node I have Ubuntu 8.04 installed. If i do a "sudo -s" I get "-bash: /usr/bin/sudo: Input/output error" as response.

After "switch off/switch on" of the node everything is OK.

Could somebody help me to find the root of this problem? Which logfile shoud I look at?

Thank you.

Do both hosts have access to guest's file image / block device? It is required, otherwise, after migration, guest will not be able to access its drive.

dietmar · Feb 24, 2009

mangoo said:
Do both hosts have access to guest's file image / block device? It is required, otherwise, after migration, guest will not be able to access its drive.

What are you talking about? - we copy the files, so the guest is always able to access the drive after migration.

mangoo · Feb 24, 2009

dietmar said:
What are you talking about? - we copy the files, so the guest is always able to access the drive after migration.

If you use NFS or iSCSI, you don't copy anything, obviously, but configure the hosts to access the same paths.

I didn't use live migration with Proxmox VE yet, but are you sure this is what is done?

- pausing the guest
- copying i.e. 20 GB guest's disk (file) image to another host
- transferring runtime data (RAM, state etc.)
- resuming guest on another host

If yes, I wouldn't call it a "live" migration.

dietmar · Feb 24, 2009

call it whatever you like.

tom · Feb 24, 2009

mangoo said:
If you use NFS or iSCSI, you don't copy anything, obviously, but configure the hosts to access the same paths.

I didn't use live migration with Proxmox VE yet, but are you sure this is what is done?

- pausing the guest
- copying i.e. 20 GB guest's disk (file) image to another host
- transferring runtime data (RAM, state etc.)
- resuming guest on another host

If yes, I wouldn't call it a "live" migration.

downtime depends on the size of the disk file and the memory setting. the process of the current live migration of KVM VM´s is a bit more advanced than you wrote. we do several rsyncs (before suspend) so there is minimal downtime here.

if you have a reasonable fast systems (server class hardware and network), you can live migrate a win xp with 512 ram with about 10 to 20 seconds downtime (less memory in guest, faster).

dietmar · Feb 25, 2009

soessle said:
Could somebody help me to find the root of this problem?

Is the error reproducable?

soessle · Feb 25, 2009

dietmar said:
Is the error reproducable?

The Error is not always reproducable.

This is the log of the migration:
--- begin log ---
/usr/bin/ssh -t -t -n -o BatchMode=yes 192.168.80.154 /usr/sbin/qmigrate --online 192.168.80.155 101
tcgetattr: Inappropriate ioctl for device
starting migration of VM 101 to host '192.168.80.155'
starting data sync
building file list ...
0 files... 2 files to consider
created directory /var/lib/vz/images/101
./
vm-101-disk.qcow2
rsync status: 13453234176 100% 54.26MB/s 0:03:56 (xfer#1, to-check=0/2)

sent 13454876547 bytes received 48 bytes 56652111.98 bytes/sec
total size is 13453234176 speedup is 1.00
suspending running VM
dumping state
copying dumpfile
building file list ...
0 files... 1 file to consider
VM101.state
rsync status: 237226489 100% 55.32MB/s 0:00:04 (xfer#1, to-check=0/1)

sent 237255541 bytes received 42 bytes 52723462.89 bytes/sec
total size is 237226489 speedup is 1.00
starting second sync
building file list ...
0 files... 2 files to consider
vm-101-disk.qcow2
rsync status: 13453234176 100% 78.92MB/s 0:02:42 (xfer#1, to-check=0/2)

sent 1392061 bytes received 927986 bytes 5821.95 bytes/sec
total size is 13453234176 speedup is 5798.69
starting/restoring VM on remote host
qemu_popen: returning result of qemu_fopen_ops
online again after 415 seconds
Connection to 192.168.80.154 closed.
VM 101 migration done
--- end log ---

When I connect through VNC-Console I now get the following error:
[ a timecode ] sd 2:0:0:0: rejecting I/O to offline device

After Stop/Start of the KVM everything is OK.

In the logfiles I don't find a hint what it could be :-(

dietmar · Mar 4, 2009

soessle said:
When I connect through VNC-Console I now get the following error:
[ a timecode ] sd 2:0:0:0: rejecting I/O to offline device

Maybe its because the downtime is too long, causing the device to go offline, but thats just a guess.

mangoo · Mar 4, 2009

dietmar said:
Maybe its because the downtime is too long, causing the device to go offline, but thats just a guess.

It may be a good guess. I think SCSI timeout is 2 minutes by default, settable in /sys.

To OP:

1) how long does the migration take?
2) could you log in _before_ migration o VNC console, do "dmesg -c". After migration, see what "dmesg" shows again. You may also configure a serial connection if you can't log in via SSH anymore. This will also allow you to copy and paste all kernel logs here.

http://pve.proxmox.com/wiki/FAQ#How_can_I_access_Linux_guests_through_a_serial_console.3F

soessle · Mar 11, 2009

The Migration takes about 2 to 3 minutes.

Right now I'm in another project at work so I can't play with it. I will do what you proposed in a few days.

mangoo · Mar 24, 2009

soessle said:
The Migration takes about 2 to 3 minutes.

Right now I'm in another project at work so I can't play with it. I will do what you proposed in a few days.

I see these problems (IO errors) as well with guests having 2 GB RAM.
Works fine for smaller guests.

Which brings us to a question: why is savevm/loadvm used for migration?

Live migration is way better/faster (almost no guest downtime).

It also doesn't break big guests like above (at least, when guest's storage is on a shared SAN).

mangoo · Mar 24, 2009

mangoo said:
I see these problems (IO errors) as well with guests having 2 GB RAM.
Works fine for smaller guests.

Which brings us to a question: why is savevm/loadvm used for migration?

Live migration is way better/faster (almost no guest downtime).

It also doesn't break big guests like above (at least, when guest's storage is on a shared SAN).

Hmm, or it does break as well? I have to do some more tests.

dietmar · Mar 24, 2009

mangoo said:
IWhich brings us to a question: why is savevm/loadvm used for migration?

Live migration is way better/faster (almost no guest downtime).

Just tell me how that works without shared storage?

mangoo · Mar 24, 2009

dietmar said:
Just tell me how that works without shared storage?

I imagine that could work if the guest was paused before migration and "cont" (continued) after migration is done.
Didn't test if it works though.

dietmar · Mar 24, 2009

mangoo said:
I imagine that could work if the guest was paused before migration and "cont" (continued) after migration is done.
Didn't test if it works though.

And why do you think that is faster than the current aproach? Ah, maybe because we can avoid some disk IO (save/restore the memory)?

mangoo · Mar 24, 2009

dietmar said:
And why do you think that is faster than the current aproach? Ah, maybe because we can avoid some disk IO (save/restore the memory)?

Yes, save/restore memory could save some IO - here, we only copy memory to another host. When guests have 1 or 2 GB RAM, it can make a difference.

But the advantage would be of course for shared storage.
With "live migration", the guest is not "paused" at all.
It still works as its pages are being migrated between the hosts.
For example, with live migration, if you ping the guest from another host, you will loose 1-2 pings, or a similarly low amount. Even when logged in via SSH you may not notice that the guest is being migrated!

With save/restore approach and a 2 GB RAM guest, you will loose 2-3 minutes of connectivity - long enough to reset existing TCP connections.

As I understand, shared storage is planned for the next Proxmox VE release - I would like to see "Live migration" there.

dietmar · Mar 24, 2009

mangoo said:
As I understand, shared storage is planned for the next Proxmox VE release - I would like to see "Live migration" there.

Sure, I am currently working on that.

mangoo · Mar 30, 2009

mangoo said:
Hmm, or it does break as well? I have to do some more tests.

It seems like generally migration in KVM is not very safe at the moment - see this thread:

http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29822

http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29822/focus=29829

"The LSI logic scsi device model doesn't implement device state save/restore.
Any suspend/resume, snapshot or migration will fail."

IO-Error after life migration

New Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Member

New Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member