cloning on ZFS fails most of the time

tuxmin · Feb 19, 2014

Hi all,

I'm experimenting with the latest ProxMox VE 3.1 and the ZFS plugin. I have attached a ZFS volume from a ReadyDATA 516 using the Comstar provider. Everything works surprisingly fine but unfortunately I'm having issues while cloning a VM. Most of the times when I trigger a clone operation on a stopped VM the task does either not even start according to the logs or crashes after reaching like 10%. Most of the times means 4 out of 5 in my case.
I can clone from and to NFS without a hitch, though.
Can anyone help me to shed some light on this issue? If I only had an idea what "got signal 13" in the error meassage means...

Here are two error logs of crashed cloning tasks:

## log 1
create full clone of drive virtio0 (ZFS-ReadyDATA:vm-101-disk-1)
transferred: 0 bytes remaining: 53687091200 bytes total: 53687091200 bytes progression: 0.00 %
TASK ERROR: clone failed: copy failed: command '/usr/bin/qemu-img convert -t writeback -p -C -f raw -O raw iscsi://192.168.100.4/iqn.1994-11.com.netgear:storage01:168bdb93roxmox/4
iscsi://192.168.100.4/iqn.1994-11.com.netgear:storage01:168bdb93roxmox/0'
failed: got signal 13

## log 2
create full clone of drive virtio0 (ZFS-ReadyDATA:vm-101-disk-1)
transferred: 0 bytes remaining: 53687091200 bytes total: 53687091200 bytes progression: 0.00 %
transferred: 1073741824 bytes remaining: 52613349376 bytes total: 53687091200 bytes progression: 2.00 %
transferred: 2152852357 bytes remaining: 51534238843 bytes total: 53687091200 bytes progression: 4.01 %
transferred: 3226594181 bytes remaining: 50460497019 bytes total: 53687091200 bytes progression: 6.01 %
transferred: 4305704714 bytes remaining: 49381386486 bytes total: 53687091200 bytes progression: 8.02 %
transferred: 5379446538 bytes remaining: 48307644662 bytes total: 53687091200 bytes progression: 10.02 %
TASK ERROR: clone failed: copy failed: command '/usr/bin/qemu-img convert -t writeback -p -C -f raw -O raw
iscsi://192.168.100.4/iqn.1994-11.com.netgear:storage01:168bdb93roxmox/4
iscsi://192.168.100.4/iqn.1994-11.com.netgear:storage01:168bdb93roxmox/0'
failed: got signal 13

Here are some details about my setup:

root@SRV01:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content images,iso,vztmpl,rootdir
maxfiles 0

zfs: ZFS-ReadyDATA
blocksize 64k
target iqn.1994-11.com.netgear:storage01:168bdb93roxmox
pool Volume01
iscsiprovider comstar
portal 192.168.100.4
content images

nfs: nfs-readydata
path /mnt/pve/nfs-readydata
server 192.168.100.4
export /Volume01/nfs-proxmox
options vers=3
content images,iso,vztmpl,backup
maxfiles 1

root@SRV01:~# pveversion -v
proxmox-ve-2.6.32: 3.1-114 (running kernel: 2.6.32-26-pve)
pve-manager: 3.1-24 (running version: 3.1-24/060bd5a6)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-2
pve-cluster: 3.0-8
qemu-server: 3.1-8
pve-firmware: 1.0-23
libpve-common-perl: 3.0-9
libpve-access-control: 3.0-8
libpve-storage-perl: 3.0-18
pve-libspice-server1: 0.12.4-2
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.1-1

Cheers,

alex

mir · Feb 19, 2014

What do you see in log files on the promox host and the storage server?

Can you paste:
zpool status -v
zpool get all <pool>
zfs get all <pool>/<dataset>

tuxmin · Feb 19, 2014

Hi mir,

here the requested information. I gathered this while a clone task from VM id 101 to 666 was
running:

root@Storage01:~# zpool status -v
pool: Volume01
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
Volume01 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: ONLINE
scan: resilvered 272M in 0h0m with 0 errors on Wed Nov 13 22:56:55 2013
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
/Volume01/._system/Volume01 ONLINE 0 0 0

errors: No known data errors

root@Storage01:~# zpool get all Volume01
NAME PROPERTY VALUE SOURCE
Volume01 size 8.12T -
Volume01 capacity 7% -
Volume01 altroot - default
Volume01 health ONLINE -
Volume01 guid 12705663366712162022 default
Volume01 version 28 default
Volume01 bootfs - default
Volume01 delegation on default
Volume01 autoreplace off default
Volume01 cachefile - default
Volume01 failmode continue local
Volume01 listsnapshots off default
Volume01 autoexpand on local
Volume01 dedupditto 0 default
Volume01 dedupratio 1.00x -
Volume01 free 7.50T -
Volume01 allocated 645G -
Volume01 readonly off -

root@Storage01:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
Volume01 525G 4.81T 46.6K /Volume01
Volume01/._share 1G 4.81T 74.6K /Volume01/._share
Volume01/._system 10.0G 4.82T 3.54G /Volume01/._system
Volume01/backup 250G 288G 212G /Volume01/backup
Volume01/home 46.6K 4.81T 46.6K /Volume01/home
Volume01/nfs-proxmox 68.8G 431G 68.8G /Volume01/nfs-proxmox
Volume01/vm-101-disk-1 18.1G 4.81T 18.1G -
Volume01/vm-325-disk-1 80.6G 4.81T 80.6G -
Volume01/vm-326-disk-1 29.1G 4.81T 29.1G -
Volume01/vm-327-disk-1 17.3G 4.81T 17.3G -
Volume01/vm-666-disk-1 50.2G 4.86T 128M -
rpool 436M 9.41G 430M /.system

root@Storage01:~# zfs get all Volume01/vm-666-disk-1
NAME PROPERTY VALUE SOURCE
Volume01/vm-666-disk-1 type volume -
Volume01/vm-666-disk-1 creation Wed Feb 19 19:26 2014 -
Volume01/vm-666-disk-1 used 50.2G -
Volume01/vm-666-disk-1 available 4.86T -
Volume01/vm-666-disk-1 referenced 128M -
Volume01/vm-666-disk-1 compressratio 1.09x -
Volume01/vm-666-disk-1 reservation none default
Volume01/vm-666-disk-1 volsize 50G local
Volume01/vm-666-disk-1 volblocksize 64K -
Volume01/vm-666-disk-1 checksum on default
Volume01/vm-666-disk-1 compression zle local
Volume01/vm-666-disk-1 readonly off default
Volume01/vm-666-disk-1 copies 1 default
Volume01/vm-666-disk-1 refreservation 50.2G local
Volume01/vm-666-disk-1 primarycache all default
Volume01/vm-666-disk-1 secondarycache all default
Volume01/vm-666-disk-1 usedbysnapshots 0 -
Volume01/vm-666-disk-1 usedbydataset 128M -
Volume01/vm-666-disk-1 usedbychildren 0 -
Volume01/vm-666-disk-1 usedbyrefreservation 50.1G -
Volume01/vm-666-disk-1 logbias latency default
Volume01/vm-666-disk-1 dedup off default
Volume01/vm-666-disk-1 mlslabel none default
Volume01/vm-666-disk-1 sync standard default

On the storage device I see these messages which come up the moment I trigger the clone task:

Feb 19 19:26:05 Storage01 zfs: [ID 327478 kern.notice] NOTICE: The property is refreservation
Feb 19 19:26:05 Storage01 zfs: [ID 327478 kern.notice] NOTICE: The property is refreservation
Feb 19 19:26:05 Storage01 zfs: [ID 327478 kern.notice] NOTICE: The property is stmf_sbd_lu
Feb 19 19:26:05 Storage01 zfs: [ID 327478 kern.notice] NOTICE: The property is stmf_sbd_lu
Feb 19 19:26:06 Storage01 stmf_sbd: [ID 716975 kern.warning] WARNING: prop lookup for compression failed
Feb 19 19:26:06 Storage01 stmf_sbd: [ID 716975 kern.warning] WARNING: prop lookup for compression failed
Feb 19 19:26:06 Storage01 zfs: [ID 327478 kern.notice] NOTICE: The property is compression
Feb 19 19:26:06 Storage01 zfs: [ID 327478 kern.notice] NOTICE: The property is compression

In the server logs there is not much suspicious:

root@SRV01:/var/log/pve/tasks# cat ./F/UPID:SRV01:00067184:021FD565:5304F78F:qmclone:101:runge@pve:
create full clone of drive virtio0 (ZFS-ReadyDATA:vm-101-disk-1)
transferred: 0 bytes remaining: 53687091200 bytes total: 53687091200 bytes progression: 0.00 %
TASK ERROR: clone failed: copy failed: command '/usr/bin/qemu-img convert -t writeback -p -C -f raw -O raw iscsi://192.168.100.4/iqn.1994-11.com.netgear:storage01:168bdb93

roxmox/4 iscsi://192.168.100.4/iqn.1994-11.com.netgear:storage01:168bdb93

roxmox/0' failed: got signal 13

I find a warning in /var/log/daemon.log:
Feb 19 19:57:10 SRV01 pvestatd[2835]: WARNING: command '/usr/bin/ssh -o 'BatchMode=yes' -i /etc/pve/priv/zfs/192.168.100.4_id_rsa root@192.168.100.4 zfs get -o value -Hp available,used Volume01' failed: got timeout
Feb 19 19:57:10 SRV01 pvestatd[2835]: status update time (5.035 seconds)

However, when I run the above command manually it succeeds:

root@SRV01:/var/log# /usr/bin/ssh -o 'BatchMode=yes' -i /etc/pve/priv/zfs/192.168.100.4_id_rsa root@192.168.100.4 zfs get -o value -Hp available,used Volume01
5346856248067
510048886013

Not sure if the above message is related to the clone issue as there are hundreds (456) of these log messages since the last 3 days.

Cheers,

alex

mir · Feb 19, 2014

This setting caught my attention: "compression zle local"

Could you somehow try changing compression algorithm to either lzjb or lz4. I have no experience with zle.

It could also be a (i)scsi error.

Do you know which OS runs the Netgear and what version of ZFS?

mir · Feb 19, 2014

Could you try the following:

Code:

stmfadm list-view -l $(sbdadm  list-lu |grep [COLOR=#333333]Volume01/vm-666-disk-1[/COLOR]|awk '{print $1}')
You should see:
View Entry: 0
    Host group   : All
    Target group : All
    LUN          : 4


stmfadm list-view -l $(sbdadm  list-lu |grep [COLOR=#333333]Volume01/vm-101-disk-1[/COLOR]|awk '{print $1}')
You should see:
View Entry: 0
    Host group   : All
    Target group : All
    LUN          : 0

Your failures running the command from proxmox but succeed manually could indicate that root's ssh key on one or all proxmox servers are not installed properly on the Netgear or some other permission or network related error.

I also just noticed that you have chosen a blocksize of 64k. This is very big since default is 4k and could be a problem also.

You have:
Volume01 failmode continue local I have: failmode wait
Volume01 autoexpand on local I have: autoexpand off

I just noticed you have zfs version 28. Is this Oracle ZFS or Open ZFS?
In case of Open ZFS what feature flags do you have?

They should be displayed at the bottom of a zpool get all <pool>
vMotion feature@async_destroy enabled local
vMotion feature@empty_bpobj active local
vMotion feature@lz4_compress active local
vMotion feature@multi_vdev_crash_dump enabled local
vMotion feature@spacemap_histogram active local
vMotion feature@extensible_dataset enabled local

tuxmin · Mar 25, 2014

CLOSED: cloning on ZFS fails most of the time

Hi mir,

sorry for the late response. But I got ill in the meantime and could only catch up lately. Anyway, thanks for all your suggestions. It turned out that this is an issue with the iSCSI implementation on the Netgear device or a general network issue. I was able to trigger other connection interruptions from outside ProxMox when moving large amounts of data via iSCSI. NFS is never a problem, though.

Best,

alex

Search

Search

cloning on ZFS fails most of the time

tuxmin

New Member

mir

Famous Member

tuxmin

New Member

mir

Famous Member

mir

Famous Member

tuxmin

New Member