Different block size between nodes

orion8888

Member
Sep 24, 2020
5
0
21
36
Hi Everyone,

I have an unusual problem with the ZFS virtual machine block size (zvol).
Proxmox version 7.3

I have a configured proxmox cluster consisting of two servers plus a qdevice device.
I set up a datastore (ZFS) from the webgui of proxmox called SSD_ZPOOl_1 and set its block size to 1024k for both nodes in the cluster, proxmox successfully added a datastore.
I added a new virtual machine to this SSD_ZPOOL_1 resource and checked the block size of this newly created machine with the command "zfs get volblocksize" the size of the block was correct, it showed 1M (1024k). Then I added a replication of this virtual machine from node A to B, the replication was done to the second node, I checked the block size of this machine after replication and it showed that the block size is 128k and now if it migrates the virtual machine from node A to B, it is in reverse migration from the node B on A throws an error and I don't want to migrate back.

I suppose it is the fault that the SSD_ZPOOL_1 resource is set to the block size of 1024k on one node and the block size is set to 128k on the other node after replication, but the question is why the block size is changed when replicating to the same resource on the second node?

Anyone know what's going on?
I will add that the Large Block option is enabled for the SSD_ZPOOL_1 pool.

Best Regards
Orion8888
 

Attachments

  • PX_BUG_REPLICATION_BLOCK_SIZE_1024k_node_B.png
    PX_BUG_REPLICATION_BLOCK_SIZE_1024k_node_B.png
    24.8 KB · Views: 25
  • PX_BUG_REPLICATION_BLOCK_SIZE_128k_node_A.png
    PX_BUG_REPLICATION_BLOCK_SIZE_128k_node_A.png
    19.6 KB · Views: 25
Last edited:
From the `zfsprops` man page (`man zfsprops`):
Code:
     volblocksize          For volumes, specifies the block size of the volume.  The blocksize cannot be changed once the volume has been written, so it should be set at volume creation time.  The default blocksize for volumes is 8
                           Kbytes.  Any power of 2 from 512 bytes to 128 Kbytes is valid.

I'm not sure how you created one with a volblocksize of 1MiB, but at least based on the man page it should not be possible.

Could you explain it in details how you did that?

Were the pools on both machines created with the same `ashift`?
 
Could you explain it in details how you did that?
That can easily be done on a PVE node (pool is using ashift=12):
Code:
root@j3710:~# zfs create -V 1G -o volblocksize=1M VMpool/1Mzvol
root@j3710:~# zfs get volblocksize VMpool/1Mzvol
NAME           PROPERTY      VALUE     SOURCE
VMpool/1Mzvol  volblocksize  1M        -

Would be a terrible idea because of the massive overhead when doing small reads/writes, but still possible ;)
 
Last edited:
From the `zfsprops` man page (`man zfsprops`):
Code:
     volblocksize          For volumes, specifies the block size of the volume.  The blocksize cannot be changed once the volume has been written, so it should be set at volume creation time.  The default blocksize for volumes is 8
                           Kbytes.  Any power of 2 from 512 bytes to 128 Kbytes is valid.

I'm not sure how you created one with a volblocksize of 1MiB, but at least based on the man page it should not be possible.

Could you explain it in details how you did that?

Were the pools on both machines created with the same `ashift`?

It's very easy to do that :)
I just did it from the webgui by entering the appropriate value in the block size field when adding a new datastore.

The question is why when I start replication, after replication on the second node, the block size is 128k and it should be 1M.
 

Attachments

  • Create_1024k_block.png
    Create_1024k_block.png
    9 KB · Views: 22
That can easily be done on a PVE node (pool is using ashift=12):
Code:
root@j3710:~# zfs create -V 1G -o volblocksize=1M VMpool/1Mzvol
root@j3710:~# zfs get volblocksize VMpool/1Mzvol
NAME           PROPERTY      VALUE     SOURCE
VMpool/1Mzvol  volblocksize  1M        -

Would be a terrible idea because of the massive overhead when doing small reads/writes, but still possible ;)

I have a virtual machine to restore which takes up 6TB of disk space after the restore. The machine was backed up with vzdump compressed to zstd format. I've found that restoring and backing up with vzdump with a larger block is much faster.

Restoring a 6TB machine from zstd format to a datastore with a 128k block took over 13 hours and restoring a machine to a datastore with a 1M block took over 8 hours.
 
I have a virtual machine to restore which takes up 6TB of disk space after the restore. The machine was backed up with vzdump compressed to zstd format. I've found that restoring and backing up with vzdump with a larger block is much faster.

Restoring a 6TB machine from zstd format to a datastore with a 128k block took over 13 hours and restoring a machine to a datastore with a 1M block took over 8 hours.
You should have a look as the proxmox backup server. It will be waaaaay faster as it can do incremental backups, so you only need to backup the part of the 6TB disk that has changed since the last backup. So you probably only have to only backup a few GBs instead of 6TB every time.

The question is why when I start replication, after replication on the second node, the block size is 128k and it should be 1M.
Just a guess: replication is maybe following the manual limiting the volblocksize to 128K while the webUI and pvesm doesn`'t care about that limit and will allow a 1M volblocksize? I bet @mira could check that easily in the source code.
 
Last edited:
You should have a look as the proxmox backup server. It will be waaaaay faster as it can do incremental backups, so you only need to backup the part of the 6TB disk that has changed since the last backup. So you probably only have to only backup a few GBs instead of 6TB every time.

I know the proxmox backup server (PBS) solution, I even have it set up on a test environment for learning. From what I remember, the first backup on PBS must also be made in its entirety and only the next ones are incremental.

But unfortunately the system I want to transfer is on the old version of proxmox 5.3 and there only backup using vzdump is available.
I have already prepared a new production environment (PVE-7.3) on new servers for this system, I just need to transfer it.
In the near future, he plans to introduce PBS as a basic backup for proxmox environments.
 
Hi,
I suppose it is the fault that the SSD_ZPOOL_1 resource is set to the block size of 1024k on one node and the block size is set to 128k on the other node after replication, but the question is why the block size is changed when replicating to the same resource on the second node?
I think you'd need to pass the -L option to the zfs send command to generate a stream with blocks larger than 128K, which Proxmox VE currently doesn't do. Feel free to open up a feature request on our bugtracker: https://bugzilla.proxmox.com/

EDIT:
Hmm, it's not actually enough to pass -L:
Code:
root@pve701 ~ # zfs create myzpool/test -V 1G -b 1M
root@pve701 ~ # zfs snap myzpool/test@ohsnap
root@pve701 ~ # zfs send myzpool/test@ohsnap -L -Rpv | zfs recv -F myzpool/test2     
full send of myzpool/test@ohsnap estimated size is 29.1K
total estimated size is 29.1K
root@pve701 ~ # zfs get volblocksize myzpool/test2 
NAME           PROPERTY      VALUE     SOURCE
myzpool/test2  volblocksize  128K      -
root@pve701 ~ # zfs get volblocksize myzpool/test 
NAME          PROPERTY      VALUE     SOURCE
myzpool/test  volblocksize  1M        -
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!