ZFS migration issue on one node

optim

Member
Jan 6, 2010
22
3
23
Toronto, Canada
I have a three node cluster that uses ZFS local storage on each node. For some reason, an error is generated when I try to migrate some VM's to a certain node. The error text looks to be from an uncaught exception that is bubbling up into the output from the migration script. Here's the error text (note the bold):

Feb 25 10:16:41 starting migration of VM 128 to node 'pve-atom16' (192.168.10.15)
Feb 25 10:16:41 copying disk images
internal error: Value too large for defined data type
Feb 25 10:16:42 ERROR: Failed to sync data - command 'zpool import -d /dev/disk/by-id/ -a' failed: got signal 6
Feb 25 10:16:42 aborting phase 1 - cleanup resources
Feb 25 10:16:42 ERROR: migration aborted (duration 00:00:01): Failed to sync data - command 'zpool import -d /dev/disk/by-id/ -a' failed: got signal 6
TASK ERROR: migration aborted

All nodes are running the latest version in pve-test:

root@pve-atom16:~# pveversion -v

proxmox-ve: 4.1-38 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-15 (running version: 4.1-15/8cd55b52)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-38
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-33
qemu-server: 4.0-59
pve-firmware: 1.1-7
libpve-common-perl: 4.0-49
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-41
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-7
pve-container: 1.0-46
pve-firewall: 2.0-18
pve-ha-manager: 1.0-23
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie

My zpool setup is a mirror (with SSD ZIL/Cache on 2 nodes). All zpools are named the same so that the migration works across nodes.

root@pve-atom16:~# zpool status

pool: zfspool
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
zfspool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0


The cluster status follows:

root@pve-atom16:~# pvecm status
Quorum information
------------------
Date: Thu Feb 25 11:23:47 2016
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 200
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.10.14
0x00000003 1 192.168.10.15 (local)
0x00000001 1 192.168.10.16


What is really weird is that it only occurs for some VM's. Others are able to be migrated fine. I am also able to do a "zfs send" of the zvol to the remote node without issue. I can then copy the .conf over and it will start up fine.

I'm going to try and sift through the perl scripts (even through I'm a Python guy!) to see if I can figure out where the error is coming from, but I thought I'd ask here in case anyone could help point me to why this might be happening or where I might start looking for errors first.

Thanks for any/all help!

Daniel
 
Figured it out.

I launched the migration using the "qm migrate" command and invoked it using the perl debugger. Combined with a breakpoint and some logging statements, this led me to around line 196-197 of /usr/share/perl5/PVE/QemuMigrate.pm.

PVE::QemuMigrate::sync_disks(/usr/share/perl5/PVE/QemuMigrate.pm:196):
196: my $dl = PVE::Storage::vdisk_list($self->{storecfg}, $storei
197: PVE::Storage::foreach_volid($dl, sub {


To briefly summarize, I had defined an old SSD ZFS pool in my storage configuration as being applicable to all nodes, when only one node had access to that ZFS pool. As the routine above was parsing the missing pool (I think I dumped the $volid during my tests) it would error out with the generic error. I'm not a big perl guy, so I am documenting it here in the hopes that maybe the dev's can have a look and perhaps bolster the code up to give a better error code (if possible).

The good news (for me!) is that the migration worked fine when I change my storage configuration to list that ssd pool as being applicable only to the node it was physically located on.

Hopefully that'll save someone else some frustration in the future.

Thank you to the Proxmox team and all the other dev's that have contributed. I love the product and truly appreciate all your efforts!

Daniel
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!