Hello,
i'm testing HA on a 4 node cluster Proxmox 5 last ver. ZFS local storage + replication.
When one node go down the vps is migrated correctly on the second node but when the failed node come back online i have this error on the failback:
task started by HA resource agent
2017-11-14 18:06:40 starting migration of CT 100 to node 'nodo1' (192.168.100.11)
2017-11-14 18:06:40 found local volume 'SSDstorage:subvol-100-disk-1' (in current VM config)
full send of SSDstorage/subvol-100-disk-1@rep_TestBackup_2017-11-14_17:27:52 estimated size is 547M
send from @rep_TestBackup_2017-11-14_17:27:52 to SSDstorage/subvol-100-disk-1@rep_TestBackup_2017-11-14_17:28:01 estimated size is 66.6K
send from @rep_TestBackup_2017-11-14_17:28:01 to SSDstorage/subvol-100-disk-1@__replicate_100-0_1510678800__ estimated size is 1.19M
send from @__replicate_100-0_1510678800__ to SSDstorage/subvol-100-disk-1@__migration__ estimated size is 1.19M
total estimated size is 549M
TIME SENT SNAPSHOT
SSDstorage/subvol-100-disk-1 name SSDstorage/subvol-100-disk-1 -
volume 'SSDstorage/subvol-100-disk-1' already exists
command 'zfs send -Rpv -- SSDstorage/subvol-100-disk-1@__migration__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-11-14 18:06:40 ERROR: command 'set -o pipefail && pvesm export SSDstorage:subvol-100-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=nodo1' root@192.168.100.11 -- pvesm import SSDstorage:subvol-100-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
2017-11-14 18:06:40 aborting phase 1 - cleanup resources
2017-11-14 18:06:40 ERROR: found stale volume copy 'SSDstorage:subvol-100-disk-1' on node 'nodo1'
2017-11-14 18:06:40 start final cleanup
2017-11-14 18:06:40 ERROR: migration aborted (duration 00:00:00): command 'set -o pipefail && pvesm export SSDstorage:subvol-100-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=nodo1' root@192.168.100.11 -- pvesm import SSDstorage:subvol-100-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
TASK ERROR: migration aborted
----------------------------------------------
- Workaound: delete snapshot and vm image on the failed node and migrate manually the VM.
es:
root@nodo2:~# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
SSDstorage 599M 3.59T 30.6K /SSDstorage
SSDstorage/subvol-101-disk-1 598M 7.42G 597M /SSDstorage/subvol-101-disk-1
SSDstorage/subvol-101-disk-1@__replicate_101-0_1510683082__ 1.31M - 597M -
rpool 9.85G 221G 96K /rpool
rpool/ROOT 1.34G 221G 96K /rpool/ROOT
rpool/ROOT/pve-1 1.34G 221G 1.34G /
rpool/data 96K 221G 96K /rpool/data
rpool/swap 8.50G 229G 64K -
root@nodo2:~# zfs destroy -r SSDstorage/subvol-101-disk-1
- Live Migration Works Well after deleting old volume on the failed server.
- Live Migration Works without node's fail .. in normal state.
Some ideas ?
Thanks!
pveversion -V
proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.10.15-1-pve: 4.10.15-15
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
i'm testing HA on a 4 node cluster Proxmox 5 last ver. ZFS local storage + replication.
When one node go down the vps is migrated correctly on the second node but when the failed node come back online i have this error on the failback:
task started by HA resource agent
2017-11-14 18:06:40 starting migration of CT 100 to node 'nodo1' (192.168.100.11)
2017-11-14 18:06:40 found local volume 'SSDstorage:subvol-100-disk-1' (in current VM config)
full send of SSDstorage/subvol-100-disk-1@rep_TestBackup_2017-11-14_17:27:52 estimated size is 547M
send from @rep_TestBackup_2017-11-14_17:27:52 to SSDstorage/subvol-100-disk-1@rep_TestBackup_2017-11-14_17:28:01 estimated size is 66.6K
send from @rep_TestBackup_2017-11-14_17:28:01 to SSDstorage/subvol-100-disk-1@__replicate_100-0_1510678800__ estimated size is 1.19M
send from @__replicate_100-0_1510678800__ to SSDstorage/subvol-100-disk-1@__migration__ estimated size is 1.19M
total estimated size is 549M
TIME SENT SNAPSHOT
SSDstorage/subvol-100-disk-1 name SSDstorage/subvol-100-disk-1 -
volume 'SSDstorage/subvol-100-disk-1' already exists
command 'zfs send -Rpv -- SSDstorage/subvol-100-disk-1@__migration__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-11-14 18:06:40 ERROR: command 'set -o pipefail && pvesm export SSDstorage:subvol-100-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=nodo1' root@192.168.100.11 -- pvesm import SSDstorage:subvol-100-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
2017-11-14 18:06:40 aborting phase 1 - cleanup resources
2017-11-14 18:06:40 ERROR: found stale volume copy 'SSDstorage:subvol-100-disk-1' on node 'nodo1'
2017-11-14 18:06:40 start final cleanup
2017-11-14 18:06:40 ERROR: migration aborted (duration 00:00:00): command 'set -o pipefail && pvesm export SSDstorage:subvol-100-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=nodo1' root@192.168.100.11 -- pvesm import SSDstorage:subvol-100-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
TASK ERROR: migration aborted
----------------------------------------------
- Workaound: delete snapshot and vm image on the failed node and migrate manually the VM.
es:
root@nodo2:~# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
SSDstorage 599M 3.59T 30.6K /SSDstorage
SSDstorage/subvol-101-disk-1 598M 7.42G 597M /SSDstorage/subvol-101-disk-1
SSDstorage/subvol-101-disk-1@__replicate_101-0_1510683082__ 1.31M - 597M -
rpool 9.85G 221G 96K /rpool
rpool/ROOT 1.34G 221G 96K /rpool/ROOT
rpool/ROOT/pve-1 1.34G 221G 1.34G /
rpool/data 96K 221G 96K /rpool/data
rpool/swap 8.50G 229G 64K -
root@nodo2:~# zfs destroy -r SSDstorage/subvol-101-disk-1
- Live Migration Works Well after deleting old volume on the failed server.
- Live Migration Works without node's fail .. in normal state.
Some ideas ?
Thanks!
pveversion -V
proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.10.15-1-pve: 4.10.15-15
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
Attachments
Last edited: