Different block size between nodes

orion8888 · Nov 28, 2022

Hi Everyone,

I have an unusual problem with the ZFS virtual machine block size (zvol).
Proxmox version 7.3

I have a configured proxmox cluster consisting of two servers plus a qdevice device.
I set up a datastore (ZFS) from the webgui of proxmox called SSD_ZPOOl_1 and set its block size to 1024k for both nodes in the cluster, proxmox successfully added a datastore.
I added a new virtual machine to this SSD_ZPOOL_1 resource and checked the block size of this newly created machine with the command "zfs get volblocksize" the size of the block was correct, it showed 1M (1024k). Then I added a replication of this virtual machine from node A to B, the replication was done to the second node, I checked the block size of this machine after replication and it showed that the block size is 128k and now if it migrates the virtual machine from node A to B, it is in reverse migration from the node B on A throws an error and I don't want to migrate back.

I suppose it is the fault that the SSD_ZPOOL_1 resource is set to the block size of 1024k on one node and the block size is set to 128k on the other node after replication, but the question is why the block size is changed when replicating to the same resource on the second node?

Anyone know what's going on?
I will add that the Large Block option is enabled for the SSD_ZPOOL_1 pool.

Best Regards
Orion8888

mira · Nov 28, 2022

From the `zfsprops` man page (`man zfsprops`):

Code:

     volblocksize          For volumes, specifies the block size of the volume.  The blocksize cannot be changed once the volume has been written, so it should be set at volume creation time.  The default blocksize for volumes is 8
                           Kbytes.  Any power of 2 from 512 bytes to 128 Kbytes is valid.

I'm not sure how you created one with a volblocksize of 1MiB, but at least based on the man page it should not be possible.

Could you explain it in details how you did that?

Were the pools on both machines created with the same `ashift`?

Dunuin · Nov 28, 2022

mira said:
Could you explain it in details how you did that?

That can easily be done on a PVE node (pool is using ashift=12):

Code:

root@j3710:~# zfs create -V 1G -o volblocksize=1M VMpool/1Mzvol
root@j3710:~# zfs get volblocksize VMpool/1Mzvol
NAME           PROPERTY      VALUE     SOURCE
VMpool/1Mzvol  volblocksize  1M        -

Would be a terrible idea because of the massive overhead when doing small reads/writes, but still possible

orion8888 · Nov 29, 2022

mira said:
From the `zfsprops` man page (`man zfsprops`):

Code:

volblocksize For volumes, specifies the block size of the volume. The blocksize cannot be changed once the volume has been written, so it should be set at volume creation time. The default blocksize for volumes is 8 Kbytes. Any power of 2 from 512 bytes to 128 Kbytes is valid.

I'm not sure how you created one with a volblocksize of 1MiB, but at least based on the man page it should not be possible.

Could you explain it in details how you did that?

Were the pools on both machines created with the same `ashift`?

It's very easy to do that

I just did it from the webgui by entering the appropriate value in the block size field when adding a new datastore.

The question is why when I start replication, after replication on the second node, the block size is 128k and it should be 1M.

orion8888 · Nov 29, 2022

Dunuin said:
That can easily be done on a PVE node (pool is using ashift=12):

Code:

root@j3710:~# zfs create -V 1G -o volblocksize=1M VMpool/1Mzvol root@j3710:~# zfs get volblocksize VMpool/1Mzvol NAME PROPERTY VALUE SOURCE VMpool/1Mzvol volblocksize 1M -

Would be a terrible idea because of the massive overhead when doing small reads/writes, but still possible

I have a virtual machine to restore which takes up 6TB of disk space after the restore. The machine was backed up with vzdump compressed to zstd format. I've found that restoring and backing up with vzdump with a larger block is much faster.

Restoring a 6TB machine from zstd format to a datastore with a 128k block took over 13 hours and restoring a machine to a datastore with a 1M block took over 8 hours.

Dunuin · Nov 29, 2022

orion8888 said:
I have a virtual machine to restore which takes up 6TB of disk space after the restore. The machine was backed up with vzdump compressed to zstd format. I've found that restoring and backing up with vzdump with a larger block is much faster.

Restoring a 6TB machine from zstd format to a datastore with a 128k block took over 13 hours and restoring a machine to a datastore with a 1M block took over 8 hours.

You should have a look as the proxmox backup server. It will be waaaaay faster as it can do incremental backups, so you only need to backup the part of the 6TB disk that has changed since the last backup. So you probably only have to only backup a few GBs instead of 6TB every time.

orion8888 said:
The question is why when I start replication, after replication on the second node, the block size is 128k and it should be 1M.

Just a guess: replication is maybe following the manual limiting the volblocksize to 128K while the webUI and pvesm doesn`'t care about that limit and will allow a 1M volblocksize? I bet @mira could check that easily in the source code.

orion8888 · Nov 29, 2022

Dunuin said:
You should have a look as the proxmox backup server. It will be waaaaay faster as it can do incremental backups, so you only need to backup the part of the 6TB disk that has changed since the last backup. So you probably only have to only backup a few GBs instead of 6TB every time.

I know the proxmox backup server (PBS) solution, I even have it set up on a test environment for learning. From what I remember, the first backup on PBS must also be made in its entirety and only the next ones are incremental.

But unfortunately the system I want to transfer is on the old version of proxmox 5.3 and there only backup using vzdump is available.
I have already prepared a new production environment (PVE-7.3) on new servers for this system, I just need to transfer it.
In the near future, he plans to introduce PBS as a basic backup for proxmox environments.

fiona · Nov 29, 2022

Hi,

orion8888 said:
I suppose it is the fault that the SSD_ZPOOL_1 resource is set to the block size of 1024k on one node and the block size is set to 128k on the other node after replication, but the question is why the block size is changed when replicating to the same resource on the second node?

I think you'd need to pass the -L option to the zfs send command to generate a stream with blocks larger than 128K, which Proxmox VE currently doesn't do. Feel free to open up a feature request on our bugtracker: https://bugzilla.proxmox.com/

EDIT:
Hmm, it's not actually enough to pass -L:

Code:

root@pve701 ~ # zfs create myzpool/test -V 1G -b 1M
root@pve701 ~ # zfs snap myzpool/test@ohsnap
root@pve701 ~ # zfs send myzpool/test@ohsnap -L -Rpv | zfs recv -F myzpool/test2     
full send of myzpool/test@ohsnap estimated size is 29.1K
total estimated size is 29.1K
root@pve701 ~ # zfs get volblocksize myzpool/test2 
NAME           PROPERTY      VALUE     SOURCE
myzpool/test2  volblocksize  128K      -
root@pve701 ~ # zfs get volblocksize myzpool/test 
NAME          PROPERTY      VALUE     SOURCE
myzpool/test  volblocksize  1M        -

Search

Search

Different block size between nodes

orion8888

Member

Attachments

mira

Proxmox Staff Member

Dunuin

Distinguished Member

orion8888

Member

Attachments

orion8888

Member

Dunuin

Distinguished Member

orion8888

Member

fiona

Proxmox Staff Member

We value your privacy