ZFS "Connection error - Timeout."

Aug 19, 2019
31
2
13
Greetings to all,

I am currently encountering a detrimental issue with ZFS on one of my Proxmox systems. When trying to clone a VM from a template, it is extremely slow however it works. The issue though is that I am unable to extend the size of the disk as when I do so I get the following error:
S91pRY5.png


Upon trying to load the contents of the disk local-zfs from Proxmox GUI, I experience this message:
nAlRV04.png

My disk setup is ZFS (configured by the Proxmox installer ISO) with one Samsung SSD 860 disk. This is our testing machine. What can I do to fix this issue? This has been occurring for a while on this system now and I don't know what to do.

My blessing to everyone.

Code:
$ pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active       625009664        40871040       584138624    6.54%
local-zfs     zfspool     active       901275328       317136584       584138744   35.19%

Code:
$ pveperf
CPU BOGOMIPS:      140000.00
REGEX/SECOND:      3973855
HD SIZE:           596.06 GB (rpool/ROOT/pve-1)

Code:
$ free -h
              total        used        free      shared  buff/cache   available
Mem:          125Gi        67Gi        56Gi       1.3Gi       1.6Gi        55Gi
Swap:            0B          0B          0B

Code:
$ pveversion --verbose
proxmox-ve: 6.1-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-4
pve-kernel-helper: 6.1-4
pve-kernel-5.0: 6.0-11
pve-kernel-4.15: 5.4-8
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.14-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-12
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-4
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-5
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Code:
$ zpool status
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 3 days 05:04:01 with 0 errors on Wed Feb 12 05:28:46 2020
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda3      ONLINE       0     0     0

errors: No known data errors
 
Hi,
it seems you run into timeouts (probably to the slow ZFS you mentioned). Maybe related to some I/O going on in the background? zpool iostat -v should give you some info. Also have a look in the journal journalctl -rb.
How long does zfs list -o name,volsize,origin,type,refquota -t volume,filesystem -Hr take if you run it yourself?
 
Dear,

I deeply appreciate the time you have taken to help me out. As requested, here is the information.

This one appears to be alarmingly low. I compared the data with another server with similar specifications and it seemed to have much higher speeds.
Code:
$ zpool iostat -v
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool        335G   593G      8     49   293K  2.50M
  sda3       335G   593G      8     49   293K  2.50M
----------  -----  -----  -----  -----  -----  -----

Occasionally inside the journal log, this error presents itself, though nothing more seems to be the problem there.
Code:
pveproxy[15609]: got inotify poll request in wrong process - disabling inotify

It didn't take long to get a result from the command.
Code:
$ time zfs list -o name,volsize,origin,type,refquota -t volume,filesystem -Hr

...

real    0m0.038s
user    0m0.009s
sys     0m0.029s

Do you know what could be done to fix this? Thank you again for helping out!

My kindest regards
 
Could you please provide the config of the VM? qm config <VMID>.
Are you able to resize the disk on the command line with qm resize <VMID> <disk> <size>?
E.g. qm resize 100 scsi0 +10G resizes the disk attached to scsi0 of VM 100 by 10G.
 
To Chris,

Thank you for your prompt and friendly response. I have collected the requested information, and the following is it presented.

Code:
$ qm config 100
agent: 1
boot: c
bootdisk: scsi0
ciuser: root
description: Test
ide2: local-zfs:vm-100-cloudinit,media=cdrom,size=4M
memory: 1024
name: testvm
net0: virtio=16:B6:90:99:F8:0D,bridge=vmbr0,rate=112.5
scsi0: local-zfs:vm-100-disk-1,size=2252M
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=1725beec-7577-4774-9b3b-1bc9401aa138
vga: serial0
vmgenid: f672b5ce-e2ba-4630-bee1-a01a9469fdf8
Code:
$ time qm resize 100 scsi0 +10G
command 'zfs set 'volsize=244109312k' rpool/data/vm-100-disk-1' failed: got timeout

real    2m17.923s
user    0m0.408s
sys     0m0.082s

Best regards.
 
Ok so the config seems fine, this is odd. So it seems that zfs commands executed directly are fine while when going over the PVE API you run into timeouts.
It seems that there is a communication problem. Could you try to check if all the pve services are running correctly? systenctl status "pve*". You could also try to restart the pveproxy and pvedaemon with systemctl restart pveproxy pvedaemon and see if this changes the behaviour.
 
Dear Chris,

Per your thoughtful instructions, I have checked the systemctl status of pve* and everything except firewall was running. I had also restarted pveproxy and pvedaemon however the issue persists.

What else could be done to diagnose the connection between ZFS and PVE API?

Kindest wishes.
 
Something is very strange here. I had a look at the code and cannot see any apparent reason why the zfs set 'volsize=244109312k' rpool/data/vm-100-disk-1 should run into a timeout, except if zfs takes to long.
You can try (at your own risk, make sure you have a backup of your data!) the following:
* Detach the disk from the VM in the WebUI's VM->Hardware tab
* Set the volsize to the correct value with zfs set 'volsize=244109312k' rpool/data/vm-100-disk-1 ATTENTION, make sure that the size is the correct one, maybe first check with zfs get volsize the current one. Never decrease the size as this could destroy the filesystem on the zvol.
* Reattach the disk to the VM, it should now show the correct size.
 
Dear,

An interesting discovery has been made in regards to this current issue.

Upon checking zfs get volsize I found out that each time I ran the disk resize function in Proxmox, it ran on ZFS but Proxmox never recognised it.

For example, here is the output.

Code:
$ zfs get volsize
rpool/data/vm-100-cloudinit           volsize   4M       local
rpool/data/vm-100-disk-0              volsize   5.62T    local
rpool/data/vm-100-disk-1              volsize   243G     local

That must result in the issue being present with Proxmox not detecting any response from ZFS?

How may I fix this?

Kind regards
 
Dear,

An interesting discovery has been made in regards to this current issue.

Upon checking zfs get volsize I found out that each time I ran the disk resize function in Proxmox, it ran on ZFS but Proxmox never recognised it.

For example, here is the output.

Code:
$ zfs get volsize
rpool/data/vm-100-cloudinit           volsize   4M       local
rpool/data/vm-100-disk-0              volsize   5.62T    local
rpool/data/vm-100-disk-1              volsize   243G     local

That must result in the issue being present with Proxmox not detecting any response from ZFS?

How may I fix this?

Kind regards
This is indeed a very odd behavior. It seems that the ZFS command gets executed but either it does not return as expected and PVE runs into the timeout or the return is not recognized and therefore PVE runs into the timeout.
Nevertheless, I am not able to reproduce this behavior on any of my zfs based storages, so it seems related to your setup.
 
My friend Chris,

I wholeheartedly appreciate your support through this strange journey.

Are you aware whether it is possible to change PVE timeout limits to potentially resolve the issue if ZFS is simply just running slow for resize?

Kind regards
 
Hi,

My friend Chris,

I wholeheartedly appreciate your support through this strange journey.

Are you aware whether it is possible to change PVE timeout limits to potentially resolve the issue if ZFS is simply just running slow for resize?

Kind regards

sadly it's not possible to do this without patching the source code and since we have a general timeout for API calls, you might run into some other problem if you do that.

What you can do as a workaround is to resize the volume manually and then do a qm rescan --vmid <ID> to update the size of the disk in the configuration file.
 
Dear Fabian E,

Thank you for providing the workaround. I have come to a potential hypothesis with this peculiar issue.

Code:
$ pveperf
CPU BOGOMIPS:      139996.40
REGEX/SECOND:      4146804
HD SIZE:           652.57 GB (rpool/ROOT/pve-1)

The command pveperf is not finishing even after 5 minutes on a Samsung 860 EVO 1TB on ZFS. Why may this be?

My kindest wishes
 
The next thing pveperf does after printing the HD size is buffered reads, so really nothing special. Are other I/O operations on this drive also hanging/slow? You could test with fio or simply dd. If they are, this might indicate a hardware issue or maybe a ZFS issue.

What's the output of zpool status -v and zpool get all? Is there anything that could be related in the syslog or dmesg?
 
Hi together, i encountered the same Issue in my lab when creating a mountpoint for a lxd container.
After a bit of testing i noticed that a automatic ZFS scrub was running on the affected zpool. After it was finished everything works again just fine! Perhaps this information could be useful to reproduce the problem. I will test if i can reproduce it once more next week.

I also have one related question: because of the testing i now have a load of "unused" subvols, is there a easy way of listing all subvols which are actually attached to containers?

Thanks!
Alex
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!