Replication seems to think thinly provisioned ZFS volume is thickly provisioned

Chazz

Member
Feb 3, 2022
12
1
8
45
Boy, I did NOT know how to write a good title for this. Something weird is going on, or I'm losing my marbles.

Two of my virtual machines have thinly provisioned ZFS volumes of 200GB each, but whenever migrating between nodes, the process takes a long time and appears to be replicating the full 400GB. When I say "appears", that's how it looks in the progress window and by how long it takes. Making things weirder, I don't see a corresponding disk usage on either node that shows that much data. It's almost as if it's replicating the full 200GB into "thin air".

Bash:
zpc/pve/vm-111-disk-0      40.8G  4.27T     40.8G  -
zpc/pve/vm-113-disk-0      8.46G  4.27T     8.46G  -

Here are the properties for VM 111. It's basically the same for 113:

Bash:
root@host:~# zfs get all zpc/pve/vm-111-disk-0
NAME                   PROPERTY              VALUE                  SOURCE
zpc/pve/vm-111-disk-0  type                  volume                 -
zpc/pve/vm-111-disk-0  creation              Mon May 15 22:54 2023  -
zpc/pve/vm-111-disk-0  used                  40.8G                  -
zpc/pve/vm-111-disk-0  available             4.27T                  -
zpc/pve/vm-111-disk-0  referenced            40.8G                  -
zpc/pve/vm-111-disk-0  compressratio         1.75x                  -
zpc/pve/vm-111-disk-0  reservation           none                   default
zpc/pve/vm-111-disk-0  volsize               200G                   local
zpc/pve/vm-111-disk-0  volblocksize          8K                     default
zpc/pve/vm-111-disk-0  checksum              on                     default
zpc/pve/vm-111-disk-0  compression           lz4                    inherited from zpc
zpc/pve/vm-111-disk-0  readonly              off                    default
zpc/pve/vm-111-disk-0  createtxg             1449614                -
zpc/pve/vm-111-disk-0  copies                1                      default
zpc/pve/vm-111-disk-0  refreservation        none                   default
zpc/pve/vm-111-disk-0  guid                  10072052703899627452   -
zpc/pve/vm-111-disk-0  primarycache          all                    default
zpc/pve/vm-111-disk-0  secondarycache        all                    default
zpc/pve/vm-111-disk-0  usedbysnapshots       0B                     -
zpc/pve/vm-111-disk-0  usedbydataset         40.8G                  -
zpc/pve/vm-111-disk-0  usedbychildren        0B                     -
zpc/pve/vm-111-disk-0  usedbyrefreservation  0B                     -
zpc/pve/vm-111-disk-0  logbias               latency                default
zpc/pve/vm-111-disk-0  objsetid              2315                   -
zpc/pve/vm-111-disk-0  dedup                 off                    default
zpc/pve/vm-111-disk-0  mlslabel              none                   default
zpc/pve/vm-111-disk-0  sync                  standard               default
zpc/pve/vm-111-disk-0  refcompressratio      1.75x                  -
zpc/pve/vm-111-disk-0  written               40.8G                  -
zpc/pve/vm-111-disk-0  logicalused           53.4G                  -
zpc/pve/vm-111-disk-0  logicalreferenced     53.4G                  -
zpc/pve/vm-111-disk-0  volmode               default                default
zpc/pve/vm-111-disk-0  snapshot_limit        none                   default
zpc/pve/vm-111-disk-0  snapshot_count        none                   default
zpc/pve/vm-111-disk-0  snapdev               hidden                 default
zpc/pve/vm-111-disk-0  context               none                   default
zpc/pve/vm-111-disk-0  fscontext             none                   default
zpc/pve/vm-111-disk-0  defcontext            none                   default
zpc/pve/vm-111-disk-0  rootcontext           none                   default
zpc/pve/vm-111-disk-0  redundant_metadata    all                    default
zpc/pve/vm-111-disk-0  encryption            off                    default
zpc/pve/vm-111-disk-0  keylocation           none                   default
zpc/pve/vm-111-disk-0  keyformat             none                   default
zpc/pve/vm-111-disk-0  pbkdf2iters           0                      default

Below is the used storage. If both of those drives REALLY WERE fully 200GB each, I'd be well into the 600GB+ range.

1684270314566.png

When these drives were originally created, I think they were thick provisioned. To correct that, I followed the instructions here: https://forum.proxmox.com/threads/zfs-enable-thin-provisioning.41549/. Everything seems fine except for the replication.

Does this make sense? Am I off my rocker?
 
So to recap:
You have two machines, both with a ZFS pool and replication enabled and if you migrate the VM, the whole underlying disk is copied instead of just the difference?
 
Actually, look at that replication log, do you eventually notice a sudden massive increase in the read speed and a large drop in written data, even 0?
It still has to check all the underlying blocks, but skips ones not allocated; on the surface it looks like it's copying the full VM amount, but it's not really. [I notice this a lot when I run backups.]

Example:
1684318085020.png
 
So to recap:
You have two machines, both with a ZFS pool and replication enabled and if you migrate the VM, the whole underlying disk is copied instead of just the difference?
Yes, except the result is not the full disk. Imagine replication taking the time to complete 200GB, but in the end, the result is 8GB. That what seems to be happening.

Actually, look at that replication log, do you eventually notice a sudden massive increase in the read speed and a large drop in written data, even 0?
It still has to check all the underlying blocks, but skips ones not allocated; on the surface it looks like it's copying the full VM amount, but it's not really. [I notice this a lot when I run backups.]

Example:
View attachment 50478

I'll run this now and report back in a bit. Thanks for taking the time.
 
Given your reply, I decided to first see if this was user error, so I checked on replication for these VMs. To my surprise it was not enabled. Therefore when I was migrating them between machines, I'm guessing they had no "diff" to migrate from? Once replication was configured and allowed to complete, I attempted migrating to my second node, which took less than a minute. So I think I'm squared away. Thanks for your time.

Code:
2023-05-17 14:01:48 use dedicated network address for sending migration traffic (192.168.20.2)
2023-05-17 14:01:48 starting migration of VM 113 to node **second node**
2023-05-17 14:01:48 found local, replicated disk 'local-zpc:vm-113-disk-0' (in current VM config)
2023-05-17 14:01:48 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2023-05-17 14:01:48 replicating disk images
2023-05-17 14:01:48 start replication job
2023-05-17 14:01:48 guest => VM 113, running => 870790
2023-05-17 14:01:48 volumes => local-zpc:vm-113-disk-0
2023-05-17 14:01:50 create snapshot '__replicate_113-0_1684357308__' on local-zpc:vm-113-disk-0
2023-05-17 14:01:50 using secure transmission, rate limit: none
2023-05-17 14:01:50 incremental sync 'local-zpc:vm-113-disk-0' (__replicate_113-0_1684357275__ => __replicate_113-0_1684357308__)
2023-05-17 14:01:52 send from @__replicate_113-0_1684357275__ to zpc/pve/vm-113-disk-0@__replicate_113-0_1684357308__ estimated size is 4.02M
2023-05-17 14:01:52 total estimated size is 4.02M
2023-05-17 14:01:53 successfully imported 'local-zpc:vm-113-disk-0'
2023-05-17 14:01:53 delete previous replication snapshot '__replicate_113-0_1684357275__' on local-zpc:vm-113-disk-0
2023-05-17 14:01:54 (remote_finalize_local_job) delete stale replication snapshot '__replicate_113-0_1684357275__' on local-zpc:vm-113-disk-0
2023-05-17 14:01:54 end replication job
2023-05-17 14:01:54 starting VM 113 on remote node '**second node**'
2023-05-17 14:02:01 volume 'local-zpc:vm-113-disk-0' is 'local-zpc:vm-113-disk-0' on the target
2023-05-17 14:02:01 start remote tunnel
2023-05-17 14:02:02 ssh tunnel ver 1
2023-05-17 14:02:02 starting storage migration
2023-05-17 14:02:02 scsi0: start migration to nbd:unix:/run/qemu-server/113_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 448.0 KiB of 1.9 MiB (22.58%) in 0s
drive-scsi0: transferred 1.9 MiB of 1.9 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2023-05-17 14:02:03 starting online/live migration on unix:/run/qemu-server/113.migrate
2023-05-17 14:02:03 set migration capabilities
2023-05-17 14:02:03 migration downtime limit: 100 ms
2023-05-17 14:02:03 migration cachesize: 1.0 GiB
2023-05-17 14:02:03 set migration parameters
2023-05-17 14:02:03 start migrate command to unix:/run/qemu-server/113.migrate
2023-05-17 14:02:04 migration active, transferred 153.4 MiB of 6.0 GiB VM-state, 179.7 MiB/s
2023-05-17 14:02:05 migration active, transferred 363.2 MiB of 6.0 GiB VM-state, 200.1 MiB/s
2023-05-17 14:02:06 migration active, transferred 714.7 MiB of 6.0 GiB VM-state, 351.0 MiB/s
2023-05-17 14:02:07 migration active, transferred 1.0 GiB of 6.0 GiB VM-state, 376.3 MiB/s
2023-05-17 14:02:08 migration active, transferred 1.4 GiB of 6.0 GiB VM-state, 357.2 MiB/s
2023-05-17 14:02:09 migration active, transferred 1.8 GiB of 6.0 GiB VM-state, 418.3 MiB/s
2023-05-17 14:02:10 migration active, transferred 2.2 GiB of 6.0 GiB VM-state, 373.9 MiB/s
2023-05-17 14:02:11 migration active, transferred 2.5 GiB of 6.0 GiB VM-state, 412.7 MiB/s
2023-05-17 14:02:12 migration active, transferred 2.9 GiB of 6.0 GiB VM-state, 403.9 MiB/s
2023-05-17 14:02:13 migration active, transferred 3.3 GiB of 6.0 GiB VM-state, 411.1 MiB/s
2023-05-17 14:02:14 migration active, transferred 3.7 GiB of 6.0 GiB VM-state, 418.3 MiB/s
2023-05-17 14:02:15 migration active, transferred 4.1 GiB of 6.0 GiB VM-state, 444.2 MiB/s
2023-05-17 14:02:17 migration active, transferred 4.7 GiB of 6.0 GiB VM-state, 376.3 MiB/s
2023-05-17 14:02:17 average migration speed: 440.1 MiB/s - downtime 202 ms
2023-05-17 14:02:17 migration status: completed
all 'mirror' jobs are ready
drive-scsi0: Completing block job_id...
drive-scsi0: Completed successfully.
drive-scsi0: mirror-job finished
2023-05-17 14:02:18 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=hostalias' root@192.168.20.2 pvesr set-state 113 \''{"local/lsnhg":{"last_iteration":1684357308,"storeid_list":["local-zpc"],"duration":6.392624,"last_node":"lsnhg","last_try":1684357308,"last_sync":1684357308,"fail_count":0}}'\'
2023-05-17 14:02:20 stopping NBD storage migration server on target.
2023-05-17 14:02:26 migration finished successfully (duration 00:00:39)
TASK OK
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!