IO-Error after life migration

soessle

New Member
Feb 23, 2009
3
0
1
Hi,

I just did a online migration of a kvm-node and it worked almost fine - except I get IO-errors :-(

On the node I have Ubuntu 8.04 installed. If i do a "sudo -s" I get "-bash: /usr/bin/sudo: Input/output error" as response.

After "switch off/switch on" of the node everything is OK.

Could somebody help me to find the root of this problem? Which logfile shoud I look at?

Thank you.
 
Hi,

I just did a online migration of a kvm-node and it worked almost fine - except I get IO-errors :-(

On the node I have Ubuntu 8.04 installed. If i do a "sudo -s" I get "-bash: /usr/bin/sudo: Input/output error" as response.

After "switch off/switch on" of the node everything is OK.

Could somebody help me to find the root of this problem? Which logfile shoud I look at?

Thank you.

Do both hosts have access to guest's file image / block device? It is required, otherwise, after migration, guest will not be able to access its drive.
 
Do both hosts have access to guest's file image / block device? It is required, otherwise, after migration, guest will not be able to access its drive.

What are you talking about? - we copy the files, so the guest is always able to access the drive after migration.
 
What are you talking about? - we copy the files, so the guest is always able to access the drive after migration.

If you use NFS or iSCSI, you don't copy anything, obviously, but configure the hosts to access the same paths.


I didn't use live migration with Proxmox VE yet, but are you sure this is what is done?

- pausing the guest
- copying i.e. 20 GB guest's disk (file) image to another host
- transferring runtime data (RAM, state etc.)
- resuming guest on another host

If yes, I wouldn't call it a "live" migration.
 
If you use NFS or iSCSI, you don't copy anything, obviously, but configure the hosts to access the same paths.


I didn't use live migration with Proxmox VE yet, but are you sure this is what is done?

- pausing the guest
- copying i.e. 20 GB guest's disk (file) image to another host
- transferring runtime data (RAM, state etc.)
- resuming guest on another host

If yes, I wouldn't call it a "live" migration.

downtime depends on the size of the disk file and the memory setting. the process of the current live migration of KVM VM´s is a bit more advanced than you wrote. we do several rsyncs (before suspend) so there is minimal downtime here.

if you have a reasonable fast systems (server class hardware and network), you can live migrate a win xp with 512 ram with about 10 to 20 seconds downtime (less memory in guest, faster).
 
Is the error reproducable?

The Error is not always reproducable.

This is the log of the migration:
--- begin log ---
/usr/bin/ssh -t -t -n -o BatchMode=yes 192.168.80.154 /usr/sbin/qmigrate --online 192.168.80.155 101
tcgetattr: Inappropriate ioctl for device
starting migration of VM 101 to host '192.168.80.155'
starting data sync
building file list ...
0 files... 2 files to consider
created directory /var/lib/vz/images/101
./
vm-101-disk.qcow2
rsync status: 13453234176 100% 54.26MB/s 0:03:56 (xfer#1, to-check=0/2)

sent 13454876547 bytes received 48 bytes 56652111.98 bytes/sec
total size is 13453234176 speedup is 1.00
suspending running VM
dumping state
copying dumpfile
building file list ...
0 files... 1 file to consider
VM101.state
rsync status: 237226489 100% 55.32MB/s 0:00:04 (xfer#1, to-check=0/1)

sent 237255541 bytes received 42 bytes 52723462.89 bytes/sec
total size is 237226489 speedup is 1.00
starting second sync
building file list ...
0 files... 2 files to consider
vm-101-disk.qcow2
rsync status: 13453234176 100% 78.92MB/s 0:02:42 (xfer#1, to-check=0/2)

sent 1392061 bytes received 927986 bytes 5821.95 bytes/sec
total size is 13453234176 speedup is 5798.69
starting/restoring VM on remote host
qemu_popen: returning result of qemu_fopen_ops
online again after 415 seconds
Connection to 192.168.80.154 closed.
VM 101 migration done
--- end log ---

When I connect through VNC-Console I now get the following error:
[ a timecode ] sd 2:0:0:0: rejecting I/O to offline device

After Stop/Start of the KVM everything is OK.

In the logfiles I don't find a hint what it could be :-(
 
When I connect through VNC-Console I now get the following error:
[ a timecode ] sd 2:0:0:0: rejecting I/O to offline device

Maybe its because the downtime is too long, causing the device to go offline, but thats just a guess.
 
Maybe its because the downtime is too long, causing the device to go offline, but thats just a guess.

It may be a good guess. I think SCSI timeout is 2 minutes by default, settable in /sys.

To OP:

1) how long does the migration take?
2) could you log in _before_ migration o VNC console, do "dmesg -c". After migration, see what "dmesg" shows again. You may also configure a serial connection if you can't log in via SSH anymore. This will also allow you to copy and paste all kernel logs here.

http://pve.proxmox.com/wiki/FAQ#How_can_I_access_Linux_guests_through_a_serial_console.3F
 
The Migration takes about 2 to 3 minutes.

Right now I'm in another project at work so I can't play with it. I will do what you proposed in a few days.
 
The Migration takes about 2 to 3 minutes.

Right now I'm in another project at work so I can't play with it. I will do what you proposed in a few days.

I see these problems (IO errors) as well with guests having 2 GB RAM.
Works fine for smaller guests.


Which brings us to a question: why is savevm/loadvm used for migration?

Live migration is way better/faster (almost no guest downtime).

It also doesn't break big guests like above (at least, when guest's storage is on a shared SAN).
 
I see these problems (IO errors) as well with guests having 2 GB RAM.
Works fine for smaller guests.


Which brings us to a question: why is savevm/loadvm used for migration?

Live migration is way better/faster (almost no guest downtime).

It also doesn't break big guests like above (at least, when guest's storage is on a shared SAN).

Hmm, or it does break as well? I have to do some more tests.
 
I imagine that could work if the guest was paused before migration and "cont" (continued) after migration is done.
Didn't test if it works though.

And why do you think that is faster than the current aproach? Ah, maybe because we can avoid some disk IO (save/restore the memory)?
 
And why do you think that is faster than the current aproach? Ah, maybe because we can avoid some disk IO (save/restore the memory)?

Yes, save/restore memory could save some IO - here, we only copy memory to another host. When guests have 1 or 2 GB RAM, it can make a difference.


But the advantage would be of course for shared storage.
With "live migration", the guest is not "paused" at all.
It still works as its pages are being migrated between the hosts.
For example, with live migration, if you ping the guest from another host, you will loose 1-2 pings, or a similarly low amount. Even when logged in via SSH you may not notice that the guest is being migrated!

With save/restore approach and a 2 GB RAM guest, you will loose 2-3 minutes of connectivity - long enough to reset existing TCP connections.


As I understand, shared storage is planned for the next Proxmox VE release - I would like to see "Live migration" there.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!