ZFS "Connection error - Timeout."

NessageHostsINC · Feb 15, 2020

Greetings to all,

I am currently encountering a detrimental issue with ZFS on one of my Proxmox systems. When trying to clone a VM from a template, it is extremely slow however it works. The issue though is that I am unable to extend the size of the disk as when I do so I get the following error:

Upon trying to load the contents of the disk local-zfs from Proxmox GUI, I experience this message:

My disk setup is ZFS (configured by the Proxmox installer ISO) with one Samsung SSD 860 disk. This is our testing machine. What can I do to fix this issue? This has been occurring for a while on this system now and I don't know what to do.

My blessing to everyone.

Code:

$ pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active       625009664        40871040       584138624    6.54%
local-zfs     zfspool     active       901275328       317136584       584138744   35.19%

Code:

$ pveperf
CPU BOGOMIPS:      140000.00
REGEX/SECOND:      3973855
HD SIZE:           596.06 GB (rpool/ROOT/pve-1)

Code:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          125Gi        67Gi        56Gi       1.3Gi       1.6Gi        55Gi
Swap:            0B          0B          0B

Code:

$ pveversion --verbose
proxmox-ve: 6.1-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-4
pve-kernel-helper: 6.1-4
pve-kernel-5.0: 6.0-11
pve-kernel-4.15: 5.4-8
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.14-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-12
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-4
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-5
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Code:

$ zpool status
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 3 days 05:04:01 with 0 errors on Wed Feb 12 05:28:46 2020
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda3      ONLINE       0     0     0

errors: No known data errors

Chris · Feb 17, 2020

Hi,
it seems you run into timeouts (probably to the slow ZFS you mentioned). Maybe related to some I/O going on in the background? zpool iostat -v should give you some info. Also have a look in the journal journalctl -rb.
How long does zfs list -o name,volsize,origin,type,refquota -t volume,filesystem -Hr take if you run it yourself?

NessageHostsINC · Feb 18, 2020

Dear,

I deeply appreciate the time you have taken to help me out. As requested, here is the information.

This one appears to be alarmingly low. I compared the data with another server with similar specifications and it seemed to have much higher speeds.

Code:

$ zpool iostat -v
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool        335G   593G      8     49   293K  2.50M
  sda3       335G   593G      8     49   293K  2.50M
----------  -----  -----  -----  -----  -----  -----

Occasionally inside the journal log, this error presents itself, though nothing more seems to be the problem there.

Code:

pveproxy[15609]: got inotify poll request in wrong process - disabling inotify

It didn't take long to get a result from the command.

Code:

$ time zfs list -o name,volsize,origin,type,refquota -t volume,filesystem -Hr

...

real    0m0.038s
user    0m0.009s
sys     0m0.029s

Do you know what could be done to fix this? Thank you again for helping out!

My kindest regards

Chris · Feb 18, 2020

Could you please provide the config of the VM? qm config <VMID>.
Are you able to resize the disk on the command line with qm resize <VMID> <disk> <size>?
E.g. qm resize 100 scsi0 +10G resizes the disk attached to scsi0 of VM 100 by 10G.

NessageHostsINC · Feb 20, 2020

To Chris,

Thank you for your prompt and friendly response. I have collected the requested information, and the following is it presented.

Code:

$ qm config 100
agent: 1
boot: c
bootdisk: scsi0
ciuser: root
description: Test
ide2: local-zfs:vm-100-cloudinit,media=cdrom,size=4M
memory: 1024
name: testvm
net0: virtio=16:B6:90:99:F8:0D,bridge=vmbr0,rate=112.5
scsi0: local-zfs:vm-100-disk-1,size=2252M
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=1725beec-7577-4774-9b3b-1bc9401aa138
vga: serial0
vmgenid: f672b5ce-e2ba-4630-bee1-a01a9469fdf8

Code:

$ time qm resize 100 scsi0 +10G
command 'zfs set 'volsize=244109312k' rpool/data/vm-100-disk-1' failed: got timeout

real    2m17.923s
user    0m0.408s
sys     0m0.082s

Best regards.

Chris · Feb 20, 2020

Ok so the config seems fine, this is odd. So it seems that zfs commands executed directly are fine while when going over the PVE API you run into timeouts.
It seems that there is a communication problem. Could you try to check if all the pve services are running correctly? systenctl status "pve*". You could also try to restart the pveproxy and pvedaemon with systemctl restart pveproxy pvedaemon and see if this changes the behaviour.

NessageHostsINC · Feb 21, 2020

Dear Chris,

Per your thoughtful instructions, I have checked the systemctl status of pve* and everything except firewall was running. I had also restarted pveproxy and pvedaemon however the issue persists.

What else could be done to diagnose the connection between ZFS and PVE API?

Kindest wishes.

NessageHostsINC · Feb 25, 2020

To all,

May I ask whether anyone is aware of how I may test the connection between PVE API and ZFS?

My warmest regards.

Chris · Feb 26, 2020

Something is very strange here. I had a look at the code and cannot see any apparent reason why the zfs set 'volsize=244109312k' rpool/data/vm-100-disk-1 should run into a timeout, except if zfs takes to long.
You can try (at your own risk, make sure you have a backup of your data!) the following:
* Detach the disk from the VM in the WebUI's VM->Hardware tab
* Set the volsize to the correct value with zfs set 'volsize=244109312k' rpool/data/vm-100-disk-1 ATTENTION, make sure that the size is the correct one, maybe first check with zfs get volsize the current one. Never decrease the size as this could destroy the filesystem on the zvol.
* Reattach the disk to the VM, it should now show the correct size.

NessageHostsINC · Feb 28, 2020

Dear,

An interesting discovery has been made in regards to this current issue.

Upon checking zfs get volsize I found out that each time I ran the disk resize function in Proxmox, it ran on ZFS but Proxmox never recognised it.

For example, here is the output.

Code:

$ zfs get volsize
rpool/data/vm-100-cloudinit           volsize   4M       local
rpool/data/vm-100-disk-0              volsize   5.62T    local
rpool/data/vm-100-disk-1              volsize   243G     local

That must result in the issue being present with Proxmox not detecting any response from ZFS?

How may I fix this?

Kind regards

Chris · Feb 28, 2020

NessageHostsINC said:
Dear,

An interesting discovery has been made in regards to this current issue.

Upon checking zfs get volsize I found out that each time I ran the disk resize function in Proxmox, it ran on ZFS but Proxmox never recognised it.

For example, here is the output.

Code:

$ zfs get volsize rpool/data/vm-100-cloudinit volsize 4M local rpool/data/vm-100-disk-0 volsize 5.62T local rpool/data/vm-100-disk-1 volsize 243G local

That must result in the issue being present with Proxmox not detecting any response from ZFS?

How may I fix this?

Kind regards

This is indeed a very odd behavior. It seems that the ZFS command gets executed but either it does not return as expected and PVE runs into the timeout or the return is not recognized and therefore PVE runs into the timeout.
Nevertheless, I am not able to reproduce this behavior on any of my zfs based storages, so it seems related to your setup.

NessageHostsINC · Feb 28, 2020

My friend Chris,

I wholeheartedly appreciate your support through this strange journey.

Are you aware whether it is possible to change PVE timeout limits to potentially resolve the issue if ZFS is simply just running slow for resize?

Kind regards

fiona · Mar 2, 2020

Hi,

NessageHostsINC said:
My friend Chris,

I wholeheartedly appreciate your support through this strange journey.

Are you aware whether it is possible to change PVE timeout limits to potentially resolve the issue if ZFS is simply just running slow for resize?

Kind regards

sadly it's not possible to do this without patching the source code and since we have a general timeout for API calls, you might run into some other problem if you do that.

What you can do as a workaround is to resize the volume manually and then do a qm rescan --vmid <ID> to update the size of the disk in the configuration file.

NessageHostsINC · Mar 8, 2020

Dear Fabian E,

Thank you for providing the workaround. I have come to a potential hypothesis with this peculiar issue.

Code:

$ pveperf
CPU BOGOMIPS:      139996.40
REGEX/SECOND:      4146804
HD SIZE:           652.57 GB (rpool/ROOT/pve-1)

The command pveperf is not finishing even after 5 minutes on a Samsung 860 EVO 1TB on ZFS. Why may this be?

My kindest wishes

fiona · Mar 9, 2020

The next thing pveperf does after printing the HD size is buffered reads, so really nothing special. Are other I/O operations on this drive also hanging/slow? You could test with fio or simply dd. If they are, this might indicate a hardware issue or maybe a ZFS issue.

What's the output of zpool status -v and zpool get all? Is there anything that could be related in the syslog or dmesg?

Alex2801 · Jun 14, 2020

Hi together, i encountered the same Issue in my lab when creating a mountpoint for a lxd container.
After a bit of testing i noticed that a automatic ZFS scrub was running on the affected zpool. After it was finished everything works again just fine! Perhaps this information could be useful to reproduce the problem. I will test if i can reproduce it once more next week.

I also have one related question: because of the testing i now have a load of "unused" subvols, is there a easy way of listing all subvols which are actually attached to containers?

Thanks!
Alex

Search

Search

ZFS "Connection error - Timeout."

NessageHostsINC

Member

Chris

Proxmox Staff Member

NessageHostsINC

Member

Chris

Proxmox Staff Member

NessageHostsINC

Member

Chris

Proxmox Staff Member

NessageHostsINC

Member

NessageHostsINC

Member

Chris

Proxmox Staff Member

NessageHostsINC

Member

Chris

Proxmox Staff Member

NessageHostsINC

Member

fiona

Proxmox Staff Member

NessageHostsINC

Member

fiona

Proxmox Staff Member

Alex2801

New Member

We value your privacy