Proxmox DRBD + GFS2 - Online migration fails

frost

Renowned Member
Feb 17, 2014
3
0
66
Ukraine
Good day.

I had a two node cluster based on pve-manager/3.1-21/93bf03d4 (running kernel: 2.6.32-26-pve), storage configured with DRBD replication and GFS2 filesystem. Offline migration works well but online.. fails with error:
Code:
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:36 starting migration of CT 100 to node 'srv1' (192.168.0.1)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:36 container is running - using online migration[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:36 container data is on shared storage 'ssd-replica'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:36 start live migration - suspending container[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:46 # vzctl --skiplock chkpnt 100 --suspend[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:36 Setting up checkpoint...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:36 	suspend...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:46 Can not suspend container: Interrupted system call[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:46 Error: timed out (10 seconds).[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:46 Error: Unfrozen tasks (no more than 10): see dmesg output.[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:46 ERROR: Failed to suspend container: Checkpointing failed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:46 aborting phase 1 - cleanup resources[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:46 start final cleanup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Feb 17 16:55:46 ERROR: migration aborted (duration 00:00:11): Failed to suspend container: Checkpointing failed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: migration aborted[/FONT][/COLOR]

In syslog:
Code:
Feb 17 16:55:35 srv2 pvedaemon[3761]: <root@pam> starting task UPID:srv2:00003DA1:0003340B:530206C7:vzmigrate:100:root@pam:
Feb 17 16:55:39 srv2 pvedaemon[3761]: command '/usr/sbin/vzctl exec 100 /bin/cat /proc/net/dev' failed: exit code 8
Feb 17 16:55:42 srv2 pvestatd[4090]: command '/usr/sbin/vzctl exec 100 /bin/cat /proc/net/dev' failed: exit code 8
Feb 17 16:55:45 srv2 pvestatd[4090]: command '/usr/sbin/vzctl exec 100 /bin/cat /proc/net/dev' failed: exit code 8
Feb 17 16:55:45 srv2 pvestatd[4090]: status update time (6.070 seconds)
Feb 17 16:55:46 srv2 kernel: CPT ERR: ffff880bedae9000,100 :timed out (10 seconds).
Feb 17 16:55:46 srv2 kernel: CPT ERR: ffff880bedae9000,100 :Unfrozen tasks (no more than 10): see dmesg output.
Feb 17 16:55:46 srv2 kernel: saslauthd     D ffff880beb8a71a0     0 15636  15598  100 0x00800004
Feb 17 16:55:46 srv2 kernel: ffff880c26ad1dd8 0000000000000082 0000000000000000 ffff880beb8a71a0
Feb 17 16:55:46 srv2 kernel: 0000000126ad1e48 0000000000000000 0000000000000000 ffffffffa05f2720
Feb 17 16:55:46 srv2 kernel: 0000000000000286 00000001001b3cc5 ffff880c26ad1fd8 ffff880c26ad1fd8
Feb 17 16:55:46 srv2 kernel: Call Trace:
Feb 17 16:55:46 srv2 kernel: [<ffffffff8109b40e>] ? prepare_to_wait+0x4e/0x80
Feb 17 16:55:46 srv2 kernel: [<ffffffffa05e5ea3>] dlm_posix_lock+0x193/0x360 [dlm]
Feb 17 16:55:46 srv2 kernel: [<ffffffff8109b440>] ? autoremove_wake_function+0x0/0x40
Feb 17 16:55:46 srv2 kernel: [<ffffffffa081d599>] gfs2_lock+0x79/0xf0 [gfs2]
Feb 17 16:55:46 srv2 kernel: [<ffffffff811f2ff3>] vfs_lock_file+0x23/0x40
Feb 17 16:55:46 srv2 kernel: [<ffffffff811f3693>] fcntl_setlk+0x143/0x2f0
Feb 17 16:55:46 srv2 kernel: [<ffffffff811b3c67>] sys_fcntl+0xc7/0x550
Feb 17 16:55:46 srv2 kernel: [<ffffffff81543b65>] ? page_fault+0x25/0x30
Feb 17 16:55:46 srv2 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Feb 17 16:55:46 srv2 pvedaemon[15777]: migration aborted
Feb 17 16:55:46 srv2 pvedaemon[3761]: <root@pam> end task UPID:srv2:00003DA1:0003340B:530206C7:vzmigrate:100:root@pam: migration aborted
If i execute "/usr/sbin/vzctl exec 100 /bin/cat /proc/net/dev" command directly in console I got the following output:
Code:
root@srv2:/var/log# /usr/sbin/vzctl exec 100 /bin/cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
    lo:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
venet0:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
root@srv2:/var/log#
Help me please find and fix the problem. Thank you!
 
Frost,

I was fighting with this problem for some time, and there is workaround.
The problem is related to the lock`s on the FS - which GFS2 is caching and sharing with dlm to publish on another host.
If You will mount the GFS2 with option "-o localflocks" than checkpointing of the vz is working correctly.
This is not recommended setting for the FS`es which are published to another host. However in the case of using the virtualization on top of the gfs2 there is no risk with lock conflict, as there is no such situation that both nodes will try to touch the same FS areas in parallel.
I`m using this configuration in the test environment for few days - and seems to be working fine.
Please test as well.

Best Regards
barni
 
Barni, thank you very much! This option realy fix online migration.
Now it works well but i got another problem with powering-on vms. After vm shutdown or server restart I cant power-on vm and/or migrate it to another host from GUI with activated HA. I got the following errors os vm start:
Code:
Mar 17 11:42:52 rgmanager #43: Service pvevm:100 has failed; can not start.
Mar 17 11:42:52 rgmanager #13: Service pvevm:100 failed to stop cleanly
and this on vm migration:
Code:
[COLOR=#000000][FONT=tahoma]Executing HA migrate for CT 101 to node srv1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Trying to migrate pvevm:101 to srv1...Temporary failure; try again[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: command 'clusvcadm -M pvevm:101 -m srv1' failed: exit code 250[/FONT][/COLOR]
The only way to start vm is to execute this commands:
Code:
root@srv2:/var/log/cluster# clusvcadm -d pvevm:100
Local machine disabling pvevm:100...Success
root@srv2:/var/log/cluster# clusvcadm -e pvevm:100
Local machine trying to enable pvevm:100...Success
pvevm:100 is now running on srv2
How I can fix this problem?

Thank you for the help.
 
Last edited: