I have a three node cluster that uses ZFS local storage on each node. For some reason, an error is generated when I try to migrate some VM's to a certain node. The error text looks to be from an uncaught exception that is bubbling up into the output from the migration script. Here's the error text (note the bold):
Feb 25 10:16:41 starting migration of VM 128 to node 'pve-atom16' (192.168.10.15)
Feb 25 10:16:41 copying disk images
internal error: Value too large for defined data type
Feb 25 10:16:42 ERROR: Failed to sync data - command 'zpool import -d /dev/disk/by-id/ -a' failed: got signal 6
Feb 25 10:16:42 aborting phase 1 - cleanup resources
Feb 25 10:16:42 ERROR: migration aborted (duration 00:00:01): Failed to sync data - command 'zpool import -d /dev/disk/by-id/ -a' failed: got signal 6
TASK ERROR: migration aborted
All nodes are running the latest version in pve-test:
root@pve-atom16:~# pveversion -v
proxmox-ve: 4.1-38 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-15 (running version: 4.1-15/8cd55b52)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-38
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-33
qemu-server: 4.0-59
pve-firmware: 1.1-7
libpve-common-perl: 4.0-49
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-41
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-7
pve-container: 1.0-46
pve-firewall: 2.0-18
pve-ha-manager: 1.0-23
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
My zpool setup is a mirror (with SSD ZIL/Cache on 2 nodes). All zpools are named the same so that the migration works across nodes.
root@pve-atom16:~# zpool status
pool: zfspool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zfspool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
The cluster status follows:
root@pve-atom16:~# pvecm status
Quorum information
------------------
Date: Thu Feb 25 11:23:47 2016
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 200
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.10.14
0x00000003 1 192.168.10.15 (local)
0x00000001 1 192.168.10.16
What is really weird is that it only occurs for some VM's. Others are able to be migrated fine. I am also able to do a "zfs send" of the zvol to the remote node without issue. I can then copy the .conf over and it will start up fine.
I'm going to try and sift through the perl scripts (even through I'm a Python guy!) to see if I can figure out where the error is coming from, but I thought I'd ask here in case anyone could help point me to why this might be happening or where I might start looking for errors first.
Thanks for any/all help!
Daniel
Feb 25 10:16:41 starting migration of VM 128 to node 'pve-atom16' (192.168.10.15)
Feb 25 10:16:41 copying disk images
internal error: Value too large for defined data type
Feb 25 10:16:42 ERROR: Failed to sync data - command 'zpool import -d /dev/disk/by-id/ -a' failed: got signal 6
Feb 25 10:16:42 aborting phase 1 - cleanup resources
Feb 25 10:16:42 ERROR: migration aborted (duration 00:00:01): Failed to sync data - command 'zpool import -d /dev/disk/by-id/ -a' failed: got signal 6
TASK ERROR: migration aborted
All nodes are running the latest version in pve-test:
root@pve-atom16:~# pveversion -v
proxmox-ve: 4.1-38 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-15 (running version: 4.1-15/8cd55b52)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-38
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-33
qemu-server: 4.0-59
pve-firmware: 1.1-7
libpve-common-perl: 4.0-49
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-41
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-7
pve-container: 1.0-46
pve-firewall: 2.0-18
pve-ha-manager: 1.0-23
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
My zpool setup is a mirror (with SSD ZIL/Cache on 2 nodes). All zpools are named the same so that the migration works across nodes.
root@pve-atom16:~# zpool status
pool: zfspool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zfspool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
The cluster status follows:
root@pve-atom16:~# pvecm status
Quorum information
------------------
Date: Thu Feb 25 11:23:47 2016
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 200
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.10.14
0x00000003 1 192.168.10.15 (local)
0x00000001 1 192.168.10.16
What is really weird is that it only occurs for some VM's. Others are able to be migrated fine. I am also able to do a "zfs send" of the zvol to the remote node without issue. I can then copy the .conf over and it will start up fine.
I'm going to try and sift through the perl scripts (even through I'm a Python guy!) to see if I can figure out where the error is coming from, but I thought I'd ask here in case anyone could help point me to why this might be happening or where I might start looking for errors first.
Thanks for any/all help!
Daniel