Speedy ZFS Backups

Gurn_Blanston · Mar 15, 2017

Greetings,

For several months, our small PVE environment was limited to 1Gbit ethernet and so that was generally our backup performance limiting factor. Then we got old used Infiniband for cheap on eBay. When it worked, you could do Iperf and see 15+ Gbit performance from memory buffer to memory buffer. However, in practice this never translated into fast VZdump backups. For example, we have a Windows 12 server with a 100 GB OS zvol and a 1 TB database zvol. A VZdump backup of this server using lz4 compression takes 3 hours and results in a roughly 400GB backup file.

When I study the iostat output, it appears to mostly be idle with brief surges of expected throughput. Searching through older PVE posts, I found some discussion around using a local storage to "stage" the vzdump process, which I tried but it doesn't seem to do anything much with the local staging target.

Regardless of whether one uses compression or not, it still seems to take 3 hours. We now have 10Gbit Ethernet so I can no longer blame Infiniband.

If I copy this same 400 GB .lzo file from one host to another over NFS (at 10G), I get throughput of around 400 MB/s. This is pretty close to the maximum sequential write performance for the target zpool and the kind of speed I want in backups and recovery!

With ZFS there are some alternative backup mechanisms. I am now experimenting with zfs send/receive but so far SSH seems to be another bottleneck keeping us from getting much more than about 150 MB/s of throughput. There are options like mbuffer that sidestep the SSH transport but they all seem to require synchronized commands on both the sending and receiving server making it complicated to automate for a noob such as myself. Testing zfs send/receive using mbuffer provides about 250 MB/s (compressed, not actual), which is better but still not anywhere close to being as fast as a bulk NFS file transfer.

I might get better zfs send/receive throughput with 128k recordsize/block size but NFS seems to like 16k best for bulk file transfers on our system. I may try changing record/block size and see if it makes a big difference but I don't expect that it will. The application vendor for the server being backed up requests a 16k block size for their Pervasive based database so we are pretty much stuck with it anyway. I suppose we could make a special 128k dataset on the receiving end, which might cut down on IP overhead.

Proxmox itself uses zfs send/receive to migrate vms from one host to another as long as your zpools and datasets have the same names on each host. Furthermore, there is a setting whereby you can do this migration in the clear, without SSH (nonsecure). How is PVE doing that? Is there a syntax I can use at the CLI to make at-will zfs copies from a PVE host to a non-PVE linux host? In this case, it is running Ubuntu 16 LTS, like PVE, but not part of the PVE cluster itself. It is just a NAS essentially that happens to use ZFS.

There is PVE-zsync, which is handy for keeping a few snapshots worth of replicas but I believe it too goes over SSH which is seems to be a throughput killer.

Lastly, if any of you know how to make vzdump work at line speed then that would be my preferred option since there is already a handy backup schedule interface built into PVE. For it to be useful to us, however, it needs to be able to handle a continuous 300 or more MB/s and I can't make it go more than about 100MB/s if even that.

Let me summarize what I want to learn:

How do I perform zfs send/receive in a single command on a single host yet avoid the penalty of SSH?
Can vzdump be made to go faster? Closer to line speed, assuming my zpools can handle the uptake?
If neither option is optimum for getting the most out of 10G have any of you found a package that is?

Most Sincerely,

GB

fabian · Mar 16, 2017

are you backing up a running VM, or a stopped one? what kind of source storage are you using? note that vzdump for VMs is not a simple "copy disk image from A to B", so it will always be slower than that.

Gurn_Blanston · Mar 16, 2017

The server I am using as an example is running during backup . It resides on a zpool made of two raidz1 of 4 2TB Samsung Evo SSDs each. There are three virtual drives on the VM, which are zvols in this pool.
root@pve-2:~# zpool status zpool_fast
pool: zpool_fast
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
zpool_fast ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_2TB_S2HCNXAGB01190Z ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_2TB_S2HCNXAGB01225K ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_2TB_S2HCNXAGB00581N ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_2TB_S2HCNXAGB01184P ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_2TB_S2HCNXAGB01215P ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_2TB_S2HCNXAGB01185B ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_2TB_S2HCNXAGB00575R ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_2TB_S2HCNXAGB01193D ONLINE 0 0 0
spares
ata-Samsung_SSD_850_EVO_2TB_S2HCNXAGB01019H AVAIL

errors: No known data errors

#zfs list
zpool_fast/zfs_16k 1.55T 1.45T 25.4K /zpool_fast/zfs_16k
zpool_fast/zfs_16k/vm-100-disk-1 1.48T 2.48T 450G -
zpool_fast/zfs_16k/vm-100-disk-2 71.0G 1.45T 43.5G -
zpool_fast/zfs_16k/vm-100-disk-3 301K 1.45T 301K -

The backup target is a Precision T7500 running Ubuntu 16. It has 192 GB of RAM and is stuffed with 6 5TB 7200 RPM spindles configured thusly:

root@NASty:/home/devl# zpool status
pool: nastyzpool
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
nastyzpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-TOSHIBA_HDWE150_46I1KBEFF57D ONLINE 0 0 0
ata-TOSHIBA_HDWE150_46KHKB8YF57D ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-TOSHIBA_HDWE150_Z5O2KH41F57D ONLINE 0 0 0
ata-TOSHIBA_HDWE150_Z5O2KH3SF57D ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
ata-TOSHIBA_HDWE150_66QBK4CKF57D ONLINE 0 0 0
ata-TOSHIBA_HDWE150_66SAK8YLF57D ONLINE 0 0 0
logs
ata-Samsung_SSD_850_PRO_256GB_S39KNX0J109053N-part1 ONLINE 0 0 0
cache
ata-Samsung_SSD_850_PRO_256GB_S39KNX0J109053N-part2 ONLINE 0 0 0

errors: No known data errors

I may add a fourth vdev but stuffing 8 spindles plus some drives for OS is a tight fit and I doubt I can talk anyone into purchasing a new chassis. Still, this arrangement can handle sequential writes steadily at almost 500 MB/s. Ignore the slog/cache. I am experimenting and will most likely drop the cache since we have plenty of RAM. We have an Acard RAM based ZIL but it is pretty large so I am using an SSD instead.

The VZdump backup target is an NFS shared dataset on this zpool. I have sync disabled and logias set to throughput, also noatime for this particular NFS share. Here is how PVE sees it:

root@pve-2:~# pveperf /mnt/pve/nfs-16k-throughput
CPU BOGOMIPS: 192032.40
REGEX/SECOND: 1914613
HD SIZE: 3450.60 GB (10.0.0.90:/nastyzpool/nfs-16k-throughput)
FSYNCS/SECOND: 1701.78
DNS EXT: 50.70 ms
DNS INT: 0.79 ms (secretstuff.local)

At any rate, whether the target be local to the host or remove over a network transport, I can never get more than 100 MB/s out of VZdump. Can I perform more than one VZdump backup at a time since I seem to have excess throughput? I may have tried this before and gotten file locking errors but I don't remember for sure. I am less afraid of getting an error that causes the backup to abort than I am of NOT getting an abort and having it bring the host to its knees for nine hours while it tries to grind out the jobs.

Just to prove my statement about local storage, I am doing a backup right now to a local zpool, also made of three mirrored spindle vdevs. Like the NFS based target, the local zpool seems to perform writes at about 1/3 of full speed. Can VZdump do it's processing in a RAM disk maybe? We spent quite a chunk of change on the cluster hardware with the motivation of making things like backups/recovery faster, a lot faster. I read some discussions around making VZdump have a higher scheduling priority but I think the IONICE settings no longer have any effect but I forget why. It seems like for every technical issue we solve, there is another one waiting just beyond it. While writing this scree, the test backup completed. It took 28 minutes and 2 seconds. Zvol sizes all add up to 629GB (much of it being empty space). So, VZdump reports a throughput of 384MB/s. Here MB stands for megabogus bytes. The actual size on disk is 136.7GB so that is what I am counting as true throughput, which comes out to a feable trickle of 88MB/s. I realize that VZdump considers every byte to be equal and actually has to handle even the whitespace, which it seems to do very quickly, but I consider it to be cheating to count it as actual storage throughput. Just to punish any of you still reading even further I am including the pve output of the whole job so you can see what I mean about what I am guessing is the whitespace. I want this whole 28 minute process to take nine minutes. How can I make that happen?

Code:

INFO: starting new backup job: vzdump 102 --compress lzo --storage pve1-zfs-backup --remove 0 --node pve-1 --mode snapshot
INFO: Starting Backup of VM 102 (qemu)
INFO: status = running
INFO: update VM 102: -lock backup
INFO: VM Name: dcfs2
INFO: include disk 'virtio0' 'zpool_fast_128k:vm-102-disk-2' 100G
INFO: exclude disk 'virtio1' 'zpool_fast_128k:vm-102-disk-3' (backup=no)
INFO: include disk 'virtio2' 'zpool_fast_128k:vm-102-disk-1' 500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/pve1zpool2/backup//dump/vzdump-qemu-102-2017_03_16-12_13_23.vma.lzo'
INFO: started backup task 'a7b6a19a-573c-4dcb-8823-bb9de00f43a4'
INFO: status: 0% (767819776/644245094400), sparse 0% (96894976), duration 3, 255/223 MB/s
INFO: status: 1% (6564478976/644245094400), sparse 0% (216211456), duration 36, 175/172 MB/s
INFO: status: 2% (13014925312/644245094400), sparse 0% (1629220864), duration 63, 238/186 MB/s
INFO: status: 3% (19476774912/644245094400), sparse 0% (1715650560), duration 101, 170/167 MB/s
INFO: status: 4% (26779058176/644245094400), sparse 1% (7034912768), duration 116, 486/132 MB/s
INFO: status: 5% (33480835072/644245094400), sparse 2% (13736689664), duration 121, 1340/0 MB/s
INFO: status: 6% (39880163328/644245094400), sparse 3% (20136017920), duration 126, 1279/0 MB/s
INFO: status: 7% (46303084544/644245094400), sparse 4% (26558939136), duration 131, 1284/0 MB/s
INFO: status: 8% (52497022976/644245094400), sparse 5% (32752877568), duration 137, 1032/0 MB/s
INFO: status: 9% (58898055168/644245094400), sparse 6% (39153909760), duration 142, 1280/0 MB/s
INFO: status: 10% (65187086336/644245094400), sparse 7% (45442940928), duration 147, 1257/0 MB/s
INFO: status: 11% (71397146624/644245094400), sparse 8% (51653001216), duration 152, 1242/0 MB/s
INFO: status: 12% (78430797824/644245094400), sparse 9% (58686652416), duration 158, 1172/0 MB/s
INFO: status: 13% (84843888640/644245094400), sparse 10% (65099743232), duration 163, 1282/0 MB/s
INFO: status: 14% (91231617024/644245094400), sparse 11% (71487471616), duration 168, 1277/0 MB/s
INFO: status: 15% (97615872000/644245094400), sparse 12% (77871726592), duration 173, 1276/0 MB/s
INFO: status: 16% (104029814784/644245094400), sparse 13% (84285669376), duration 178, 1282/0 MB/s
INFO: status: 17% (109524549632/644245094400), sparse 13% (87661473792), duration 192, 392/151 MB/s
INFO: status: 18% (116073824256/644245094400), sparse 13% (88755093504), duration 222, 218/181 MB/s
INFO: status: 19% (122516668416/644245094400), sparse 13% (89561497600), duration 249, 238/208 MB/s
INFO: status: 20% (128940507136/644245094400), sparse 13% (89770745856), duration 282, 194/188 MB/s
INFO: status: 21% (135436566528/644245094400), sparse 14% (90343931904), duration 313, 209/191 MB/s
INFO: status: 22% (141902217216/644245094400), sparse 14% (90675912704), duration 344, 208/197 MB/s
INFO: status: 23% (148330315776/644245094400), sparse 14% (90683244544), duration 372, 229/229 MB/s
INFO: status: 24% (154681475072/644245094400), sparse 14% (90704252928), duration 402, 211/211 MB/s
INFO: status: 25% (161245954048/644245094400), sparse 14% (90747977728), duration 434, 205/203 MB/s
INFO: status: 26% (167517749248/644245094400), sparse 14% (91717914624), duration 461, 232/196 MB/s
INFO: status: 27% (174116438016/644245094400), sparse 14% (92002959360), duration 490, 227/217 MB/s
INFO: status: 28% (180441448448/644245094400), sparse 14% (92771303424), duration 518, 225/198 MB/s
INFO: status: 29% (186937507840/644245094400), sparse 14% (94841249792), duration 546, 232/158 MB/s
INFO: status: 30% (193323335680/644245094400), sparse 14% (95131381760), duration 574, 228/217 MB/s
INFO: status: 31% (199822147584/644245094400), sparse 15% (97007603712), duration 599, 259/184 MB/s
INFO: status: 32% (206304051200/644245094400), sparse 15% (98377969664), duration 629, 216/170 MB/s
INFO: status: 33% (212651868160/644245094400), sparse 15% (99034341376), duration 660, 204/183 MB/s
INFO: status: 34% (219075706880/644245094400), sparse 15% (100108836864), duration 692, 200/167 MB/s
INFO: status: 35% (225488142336/644245094400), sparse 15% (100301127680), duration 725, 194/188 MB/s
INFO: status: 36% (231961395200/644245094400), sparse 15% (101449129984), duration 755, 215/177 MB/s
INFO: status: 37% (238392836096/644245094400), sparse 15% (101538258944), duration 778, 279/275 MB/s
INFO: status: 38% (244911702016/644245094400), sparse 15% (101538467840), duration 801, 283/283 MB/s
INFO: status: 39% (251453374464/644245094400), sparse 15% (101538697216), duration 824, 284/284 MB/s
INFO: status: 40% (257812594688/644245094400), sparse 15% (102070624256), duration 860, 176/161 MB/s
INFO: status: 41% (264160411648/644245094400), sparse 15% (102666985472), duration 908, 132/119 MB/s
INFO: status: 42% (270774304768/644245094400), sparse 16% (103794282496), duration 944, 183/152 MB/s
INFO: status: 43% (277099315200/644245094400), sparse 16% (103794282496), duration 974, 210/210 MB/s
INFO: status: 44% (283694202880/644245094400), sparse 16% (103956340736), duration 1005, 212/207 MB/s
INFO: status: 45% (289951907840/644245094400), sparse 16% (106316496896), duration 1027, 284/177 MB/s
INFO: status: 46% (296560885760/644245094400), sparse 16% (107052527616), duration 1060, 200/177 MB/s
INFO: status: 47% (303056945152/644245094400), sparse 16% (108136714240), duration 1086, 249/208 MB/s
INFO: status: 48% (309321138176/644245094400), sparse 16% (109062623232), duration 1112, 240/205 MB/s
INFO: status: 49% (315703164928/644245094400), sparse 17% (109905002496), duration 1145, 193/167 MB/s
INFO: status: 50% (322180218880/644245094400), sparse 17% (110447050752), duration 1195, 129/118 MB/s
INFO: status: 51% (328581251072/644245094400), sparse 17% (110880829440), duration 1244, 130/121 MB/s
INFO: status: 52% (335126724608/644245094400), sparse 17% (111388708864), duration 1278, 192/177 MB/s
INFO: status: 53% (341459337216/644245094400), sparse 17% (111587766272), duration 1326, 131/127 MB/s
INFO: status: 54% (348004810752/644245094400), sparse 17% (111978987520), duration 1376, 130/123 MB/s
INFO: status: 55% (354367832064/644245094400), sparse 17% (112362356736), duration 1421, 141/132 MB/s
INFO: status: 56% (361193734144/644245094400), sparse 18% (117329674240), duration 1439, 379/103 MB/s
INFO: status: 57% (367359688704/644245094400), sparse 19% (123495628800), duration 1444, 1233/0 MB/s
INFO: status: 58% (374593421312/644245094400), sparse 20% (130729361408), duration 1450, 1205/0 MB/s
INFO: status: 59% (380554641408/644245094400), sparse 21% (136690581504), duration 1455, 1192/0 MB/s
INFO: status: 60% (386642018304/644245094400), sparse 22% (142777958400), duration 1460, 1217/0 MB/s
INFO: status: 61% (394024058880/644245094400), sparse 23% (150159998976), duration 1466, 1230/0 MB/s
INFO: status: 62% (400146890752/644245094400), sparse 24% (156282830848), duration 1471, 1224/0 MB/s
INFO: status: 63% (406166831104/644245094400), sparse 25% (162302771200), duration 1476, 1203/0 MB/s
INFO: status: 64% (413400825856/644245094400), sparse 26% (169536765952), duration 1482, 1205/0 MB/s
INFO: status: 65% (419549741056/644245094400), sparse 27% (175685681152), duration 1487, 1229/0 MB/s
INFO: status: 66% (425581412352/644245094400), sparse 28% (181717352448), duration 1492, 1206/0 MB/s
INFO: status: 67% (431668920320/644245094400), sparse 29% (187804860416), duration 1497, 1217/0 MB/s
INFO: status: 68% (438989946880/644245094400), sparse 30% (195125886976), duration 1503, 1220/0 MB/s
INFO: status: 69% (445051109376/644245094400), sparse 31% (201187049472), duration 1509, 1010/0 MB/s
INFO: status: 70% (451027992576/644245094400), sparse 32% (207163932672), duration 1514, 1195/0 MB/s
INFO: status: 71% (458230988800/644245094400), sparse 33% (214366928896), duration 1520, 1200/0 MB/s
INFO: status: 72% (464245948416/644245094400), sparse 34% (220381888512), duration 1525, 1202/0 MB/s
INFO: status: 73% (470383656960/644245094400), sparse 35% (226519597056), duration 1530, 1227/0 MB/s
INFO: status: 74% (477398761472/644245094400), sparse 36% (233534701568), duration 1536, 1169/0 MB/s
INFO: status: 75% (483477815296/644245094400), sparse 37% (239613755392), duration 1541, 1215/0 MB/s
INFO: status: 76% (490725900288/644245094400), sparse 38% (246861840384), duration 1547, 1208/0 MB/s
INFO: status: 77% (496700293120/644245094400), sparse 39% (252836233216), duration 1552, 1194/0 MB/s
INFO: status: 78% (502760538112/644245094400), sparse 40% (258896478208), duration 1557, 1212/0 MB/s
INFO: status: 79% (510002921472/644245094400), sparse 41% (266138861568), duration 1563, 1207/0 MB/s
INFO: status: 80% (516151181312/644245094400), sparse 42% (272287121408), duration 1568, 1229/0 MB/s
INFO: status: 81% (522092937216/644245094400), sparse 43% (278228877312), duration 1573, 1188/0 MB/s
INFO: status: 82% (529151098880/644245094400), sparse 44% (285287038976), duration 1579, 1176/0 MB/s
INFO: status: 83% (535196336128/644245094400), sparse 45% (291332276224), duration 1584, 1209/0 MB/s
INFO: status: 84% (541169811456/644245094400), sparse 46% (297305751552), duration 1589, 1194/0 MB/s
INFO: status: 85% (548216897536/644245094400), sparse 47% (304352837632), duration 1595, 1174/0 MB/s
INFO: status: 86% (554084139008/644245094400), sparse 48% (310220079104), duration 1600, 1173/0 MB/s
INFO: status: 87% (561206067200/644245094400), sparse 49% (317342007296), duration 1606, 1186/0 MB/s
INFO: status: 88% (567058759680/644245094400), sparse 50% (323194699776), duration 1611, 1170/0 MB/s
INFO: status: 89% (574299176960/644245094400), sparse 51% (330435117056), duration 1617, 1206/0 MB/s
INFO: status: 90% (580025909248/644245094400), sparse 52% (336161849344), duration 1623, 954/0 MB/s
INFO: status: 91% (587149934592/644245094400), sparse 53% (343285874688), duration 1629, 1187/0 MB/s
INFO: status: 92% (592905109504/644245094400), sparse 54% (349041049600), duration 1634, 1151/0 MB/s
INFO: status: 93% (599997546496/644245094400), sparse 55% (356133486592), duration 1640, 1182/0 MB/s
INFO: status: 94% (606020698112/644245094400), sparse 56% (362156638208), duration 1645, 1204/0 MB/s
INFO: status: 95% (613091508224/644245094400), sparse 57% (369227448320), duration 1651, 1178/0 MB/s
INFO: status: 96% (618891182080/644245094400), sparse 58% (375027122176), duration 1656, 1159/0 MB/s
INFO: status: 97% (625802346496/644245094400), sparse 59% (381938286592), duration 1662, 1151/0 MB/s
INFO: status: 98% (631941038080/644245094400), sparse 60% (388076978176), duration 1667, 1227/0 MB/s
INFO: status: 99% (638214930432/644245094400), sparse 61% (394350870528), duration 1672, 1254/0 MB/s
INFO: status: 100% (644245094400/644245094400), sparse 62% (400381026304), duration 1677, 1206/0 MB/s
INFO: transferred 644245 MB in 1677 seconds (384 MB/s)
INFO: archive file size: 136.73GB
INFO: Finished Backup of VM 102 (00:28:02)
INFO: Backup job finished successfully
TASK OK

I still would like to know how PVE migrates VMs "unsecurely" so I can do the same with zfs send/receive. That seems to be the best option remaining that I know about.

PVE-zsync is nice in that you only have to copy over the deltas since the last snapshot but if you want to use it as a true backup with lots of recovery points, you will need to maintain all those snapshots on your production vms along with the replica snapshots. To have a month's worth of recovery points, your production zvols (vms) would also have at least thirty snapshots. If you tell me this is good and doesn't come at some performance penalty, I will take it under advisement. To my caveman brain, multiplying 30 times the number of virtual disk in our cluster, makes my stomach churn. I would wear out a keyboard trying to remove all of those snapshots if I ever had to. Not to mention, there must be some sort of performance penalty, however much less than in alternative file system/volume managers, to keeping lots of snapshots. I know in VMware, this was strongly discouraged due to the performance hit and while much of this is mitigated in ZFS it still seems like something to avoid.

I read a post here whereby some PVE user unlocked what he thought was a secret feature of VZdump, incremental backups. It turned out that it was known but deliberately turned off to keep PVE simple. It isn't the PVE authors' intent that VZdump be a complete backup solution. OK. I will accept that judgement, so what are other PVE users doing for backups? In my VMWare days, I used VEEAM but since we are gradually transforming into mostly Linux shop I am looking for a true backup product that fits PVE and maybe more importantly ZFS. I would hate to resort to an rsync based product and I doubt it would perform better than VZdump. I will Google around in the meantime but surely someone out there in PVE land has found a solution that doesn't involve creating one from scratch. So again to summarize, my original three questions:

How do I perform zfs send/receive in a single command on a single host yet avoid the penalty of SSH?
Can vzdump be made to go faster? Closer to line speed, assuming my zpools can handle the uptake?
If neither option is optimum for getting the most out of 10G have any of you found a package that is?

Thank any of you still reading at this point!

GB

fabian · Mar 17, 2017

Gurn_Blanston said:
...

At any rate, whether the target be local to the host or remove over a network transport, I can never get more than 100 MB/s out of VZdump. Can I perform more than one VZdump backup at a time since I seem to have excess throughput? I may have tried this before and gotten file locking errors but I don't remember for sure. I am less afraid of getting an error that causes the backup to abort than I am of NOT getting an abort and having it bring the host to its knees for nine hours while it tries to grind out the jobs.

the vzdump process takes a lock for the whole node - so only one vzdump process may run at a time.

I read some discussions around making VZdump have a higher scheduling priority but I think the IONICE settings no longer have any effect but I forget why. It seems like for every technical issue we solve, there is another one waiting just beyond it. While writing this scree, the test backup completed. It took 28 minutes and 2 seconds. Zvol sizes all add up to 629GB (much of it being empty space). So, VZdump reports a throughput of 384MB/s. Here MB stands for megabogus bytes. The actual size on disk is 136.7GB so that is what I am counting as true throughput, which comes out to a feable trickle of 88MB/s. I realize that VZdump considers every byte to be equal and actually has to handle even the whitespace, which it seems to do very quickly, but I consider it to be cheating to count it as actual storage throughput. Just to punish any of you still reading even further I am including the pve output of the whole job so you can see what I mean about what I am guessing is the whitespace. I want this whole 28 minute process to take nine minutes. How can I make that happen?

Code:

INFO: starting new backup job: vzdump 102 --compress lzo --storage pve1-zfs-backup --remove 0 --node pve-1 --mode snapshot
INFO: Starting Backup of VM 102 (qemu)
INFO: status = running
INFO: update VM 102: -lock backup
INFO: VM Name: dcfs2
INFO: include disk 'virtio0' 'zpool_fast_128k:vm-102-disk-2' 100G
INFO: exclude disk 'virtio1' 'zpool_fast_128k:vm-102-disk-3' (backup=no)
INFO: include disk 'virtio2' 'zpool_fast_128k:vm-102-disk-1' 500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/pve1zpool2/backup//dump/vzdump-qemu-102-2017_03_16-12_13_23.vma.lzo'
INFO: started backup task 'a7b6a19a-573c-4dcb-8823-bb9de00f43a4'
INFO: status: 0% (767819776/644245094400), sparse 0% (96894976), duration 3, 255/223 MB/s
INFO: status: 1% (6564478976/644245094400), sparse 0% (216211456), duration 36, 175/172 MB/s
INFO: status: 2% (13014925312/644245094400), sparse 0% (1629220864), duration 63, 238/186 MB/s
INFO: status: 3% (19476774912/644245094400), sparse 0% (1715650560), duration 101, 170/167 MB/s
INFO: status: 4% (26779058176/644245094400), sparse 1% (7034912768), duration 116, 486/132 MB/s
INFO: status: 5% (33480835072/644245094400), sparse 2% (13736689664), duration 121, 1340/0 MB/s
INFO: status: 6% (39880163328/644245094400), sparse 3% (20136017920), duration 126, 1279/0 MB/s
INFO: status: 7% (46303084544/644245094400), sparse 4% (26558939136), duration 131, 1284/0 MB/s
INFO: status: 8% (52497022976/644245094400), sparse 5% (32752877568), duration 137, 1032/0 MB/s
INFO: status: 9% (58898055168/644245094400), sparse 6% (39153909760), duration 142, 1280/0 MB/s
INFO: status: 10% (65187086336/644245094400), sparse 7% (45442940928), duration 147, 1257/0 MB/s
INFO: status: 11% (71397146624/644245094400), sparse 8% (51653001216), duration 152, 1242/0 MB/s
INFO: status: 12% (78430797824/644245094400), sparse 9% (58686652416), duration 158, 1172/0 MB/s
INFO: status: 13% (84843888640/644245094400), sparse 10% (65099743232), duration 163, 1282/0 MB/s
INFO: status: 14% (91231617024/644245094400), sparse 11% (71487471616), duration 168, 1277/0 MB/s
INFO: status: 15% (97615872000/644245094400), sparse 12% (77871726592), duration 173, 1276/0 MB/s
INFO: status: 16% (104029814784/644245094400), sparse 13% (84285669376), duration 178, 1282/0 MB/s
INFO: status: 17% (109524549632/644245094400), sparse 13% (87661473792), duration 192, 392/151 MB/s
INFO: status: 18% (116073824256/644245094400), sparse 13% (88755093504), duration 222, 218/181 MB/s
INFO: status: 19% (122516668416/644245094400), sparse 13% (89561497600), duration 249, 238/208 MB/s
INFO: status: 20% (128940507136/644245094400), sparse 13% (89770745856), duration 282, 194/188 MB/s
INFO: status: 21% (135436566528/644245094400), sparse 14% (90343931904), duration 313, 209/191 MB/s
INFO: status: 22% (141902217216/644245094400), sparse 14% (90675912704), duration 344, 208/197 MB/s
INFO: status: 23% (148330315776/644245094400), sparse 14% (90683244544), duration 372, 229/229 MB/s
INFO: status: 24% (154681475072/644245094400), sparse 14% (90704252928), duration 402, 211/211 MB/s
INFO: status: 25% (161245954048/644245094400), sparse 14% (90747977728), duration 434, 205/203 MB/s
INFO: status: 26% (167517749248/644245094400), sparse 14% (91717914624), duration 461, 232/196 MB/s
INFO: status: 27% (174116438016/644245094400), sparse 14% (92002959360), duration 490, 227/217 MB/s
INFO: status: 28% (180441448448/644245094400), sparse 14% (92771303424), duration 518, 225/198 MB/s
INFO: status: 29% (186937507840/644245094400), sparse 14% (94841249792), duration 546, 232/158 MB/s
INFO: status: 30% (193323335680/644245094400), sparse 14% (95131381760), duration 574, 228/217 MB/s
INFO: status: 31% (199822147584/644245094400), sparse 15% (97007603712), duration 599, 259/184 MB/s
INFO: status: 32% (206304051200/644245094400), sparse 15% (98377969664), duration 629, 216/170 MB/s
INFO: status: 33% (212651868160/644245094400), sparse 15% (99034341376), duration 660, 204/183 MB/s
INFO: status: 34% (219075706880/644245094400), sparse 15% (100108836864), duration 692, 200/167 MB/s
INFO: status: 35% (225488142336/644245094400), sparse 15% (100301127680), duration 725, 194/188 MB/s
INFO: status: 36% (231961395200/644245094400), sparse 15% (101449129984), duration 755, 215/177 MB/s
INFO: status: 37% (238392836096/644245094400), sparse 15% (101538258944), duration 778, 279/275 MB/s
INFO: status: 38% (244911702016/644245094400), sparse 15% (101538467840), duration 801, 283/283 MB/s
INFO: status: 39% (251453374464/644245094400), sparse 15% (101538697216), duration 824, 284/284 MB/s
INFO: status: 40% (257812594688/644245094400), sparse 15% (102070624256), duration 860, 176/161 MB/s
INFO: status: 41% (264160411648/644245094400), sparse 15% (102666985472), duration 908, 132/119 MB/s
INFO: status: 42% (270774304768/644245094400), sparse 16% (103794282496), duration 944, 183/152 MB/s
INFO: status: 43% (277099315200/644245094400), sparse 16% (103794282496), duration 974, 210/210 MB/s
INFO: status: 44% (283694202880/644245094400), sparse 16% (103956340736), duration 1005, 212/207 MB/s
INFO: status: 45% (289951907840/644245094400), sparse 16% (106316496896), duration 1027, 284/177 MB/s
INFO: status: 46% (296560885760/644245094400), sparse 16% (107052527616), duration 1060, 200/177 MB/s
INFO: status: 47% (303056945152/644245094400), sparse 16% (108136714240), duration 1086, 249/208 MB/s
INFO: status: 48% (309321138176/644245094400), sparse 16% (109062623232), duration 1112, 240/205 MB/s
INFO: status: 49% (315703164928/644245094400), sparse 17% (109905002496), duration 1145, 193/167 MB/s
INFO: status: 50% (322180218880/644245094400), sparse 17% (110447050752), duration 1195, 129/118 MB/s
INFO: status: 51% (328581251072/644245094400), sparse 17% (110880829440), duration 1244, 130/121 MB/s
INFO: status: 52% (335126724608/644245094400), sparse 17% (111388708864), duration 1278, 192/177 MB/s
INFO: status: 53% (341459337216/644245094400), sparse 17% (111587766272), duration 1326, 131/127 MB/s
INFO: status: 54% (348004810752/644245094400), sparse 17% (111978987520), duration 1376, 130/123 MB/s
INFO: status: 55% (354367832064/644245094400), sparse 17% (112362356736), duration 1421, 141/132 MB/s
INFO: status: 56% (361193734144/644245094400), sparse 18% (117329674240), duration 1439, 379/103 MB/s
INFO: status: 57% (367359688704/644245094400), sparse 19% (123495628800), duration 1444, 1233/0 MB/s
INFO: status: 58% (374593421312/644245094400), sparse 20% (130729361408), duration 1450, 1205/0 MB/s
INFO: status: 59% (380554641408/644245094400), sparse 21% (136690581504), duration 1455, 1192/0 MB/s
INFO: status: 60% (386642018304/644245094400), sparse 22% (142777958400), duration 1460, 1217/0 MB/s
INFO: status: 61% (394024058880/644245094400), sparse 23% (150159998976), duration 1466, 1230/0 MB/s
INFO: status: 62% (400146890752/644245094400), sparse 24% (156282830848), duration 1471, 1224/0 MB/s
INFO: status: 63% (406166831104/644245094400), sparse 25% (162302771200), duration 1476, 1203/0 MB/s
INFO: status: 64% (413400825856/644245094400), sparse 26% (169536765952), duration 1482, 1205/0 MB/s
INFO: status: 65% (419549741056/644245094400), sparse 27% (175685681152), duration 1487, 1229/0 MB/s
INFO: status: 66% (425581412352/644245094400), sparse 28% (181717352448), duration 1492, 1206/0 MB/s
INFO: status: 67% (431668920320/644245094400), sparse 29% (187804860416), duration 1497, 1217/0 MB/s
INFO: status: 68% (438989946880/644245094400), sparse 30% (195125886976), duration 1503, 1220/0 MB/s
INFO: status: 69% (445051109376/644245094400), sparse 31% (201187049472), duration 1509, 1010/0 MB/s
INFO: status: 70% (451027992576/644245094400), sparse 32% (207163932672), duration 1514, 1195/0 MB/s
INFO: status: 71% (458230988800/644245094400), sparse 33% (214366928896), duration 1520, 1200/0 MB/s
INFO: status: 72% (464245948416/644245094400), sparse 34% (220381888512), duration 1525, 1202/0 MB/s
INFO: status: 73% (470383656960/644245094400), sparse 35% (226519597056), duration 1530, 1227/0 MB/s
INFO: status: 74% (477398761472/644245094400), sparse 36% (233534701568), duration 1536, 1169/0 MB/s
INFO: status: 75% (483477815296/644245094400), sparse 37% (239613755392), duration 1541, 1215/0 MB/s
INFO: status: 76% (490725900288/644245094400), sparse 38% (246861840384), duration 1547, 1208/0 MB/s
INFO: status: 77% (496700293120/644245094400), sparse 39% (252836233216), duration 1552, 1194/0 MB/s
INFO: status: 78% (502760538112/644245094400), sparse 40% (258896478208), duration 1557, 1212/0 MB/s
INFO: status: 79% (510002921472/644245094400), sparse 41% (266138861568), duration 1563, 1207/0 MB/s
INFO: status: 80% (516151181312/644245094400), sparse 42% (272287121408), duration 1568, 1229/0 MB/s
INFO: status: 81% (522092937216/644245094400), sparse 43% (278228877312), duration 1573, 1188/0 MB/s
INFO: status: 82% (529151098880/644245094400), sparse 44% (285287038976), duration 1579, 1176/0 MB/s
INFO: status: 83% (535196336128/644245094400), sparse 45% (291332276224), duration 1584, 1209/0 MB/s
INFO: status: 84% (541169811456/644245094400), sparse 46% (297305751552), duration 1589, 1194/0 MB/s
INFO: status: 85% (548216897536/644245094400), sparse 47% (304352837632), duration 1595, 1174/0 MB/s
INFO: status: 86% (554084139008/644245094400), sparse 48% (310220079104), duration 1600, 1173/0 MB/s
INFO: status: 87% (561206067200/644245094400), sparse 49% (317342007296), duration 1606, 1186/0 MB/s
INFO: status: 88% (567058759680/644245094400), sparse 50% (323194699776), duration 1611, 1170/0 MB/s
INFO: status: 89% (574299176960/644245094400), sparse 51% (330435117056), duration 1617, 1206/0 MB/s
INFO: status: 90% (580025909248/644245094400), sparse 52% (336161849344), duration 1623, 954/0 MB/s
INFO: status: 91% (587149934592/644245094400), sparse 53% (343285874688), duration 1629, 1187/0 MB/s
INFO: status: 92% (592905109504/644245094400), sparse 54% (349041049600), duration 1634, 1151/0 MB/s
INFO: status: 93% (599997546496/644245094400), sparse 55% (356133486592), duration 1640, 1182/0 MB/s
INFO: status: 94% (606020698112/644245094400), sparse 56% (362156638208), duration 1645, 1204/0 MB/s
INFO: status: 95% (613091508224/644245094400), sparse 57% (369227448320), duration 1651, 1178/0 MB/s
INFO: status: 96% (618891182080/644245094400), sparse 58% (375027122176), duration 1656, 1159/0 MB/s
INFO: status: 97% (625802346496/644245094400), sparse 59% (381938286592), duration 1662, 1151/0 MB/s
INFO: status: 98% (631941038080/644245094400), sparse 60% (388076978176), duration 1667, 1227/0 MB/s
INFO: status: 99% (638214930432/644245094400), sparse 61% (394350870528), duration 1672, 1254/0 MB/s
INFO: status: 100% (644245094400/644245094400), sparse 62% (400381026304), duration 1677, 1206/0 MB/s
INFO: transferred 644245 MB in 1677 seconds (384 MB/s)
INFO: archive file size: 136.73GB
INFO: Finished Backup of VM 102 (00:28:02)
INFO: Backup job finished successfully
TASK OK

if you look at the output, you can see that reading is actually the limitting factor (the "X/Y MB/s" is read/write rate): when not skipping over holes (the lines where you have > 1GB/s read, and 0MB/s write), the write rate is equal or almost equal to the read rate. the 136.73GB is what you end up with after compression, the final rate 384MB/s refers to how much you logically transferred out of the VM.

an actual backup where the target is slow looks like this (keep in mind, this is again a very sparse VM disk!):

Code:

...
INFO: started backup task 'cb4168cd-ad66-4e8b-8477-69a836700e4d'
INFO: status: 1% (426115072/34359738368), sparse 0% (203313152), duration 3, 142/74 MB/s
INFO: status: 2% (817692672/34359738368), sparse 0% (296427520), duration 7, 97/74 MB/s
INFO: status: 3% (1296629760/34359738368), sparse 1% (541806592), duration 10, 159/77 MB/s
INFO: status: 16% (5740298240/34359738368), sparse 14% (4956479488), duration 13, 1481/9 MB/s
INFO: status: 25% (8921677824/34359738368), sparse 23% (7978250240), duration 17, 795/39 MB/s
INFO: status: 31% (10924523520/34359738368), sparse 28% (9864531968), duration 20, 667/38 MB/s
INFO: status: 38% (13068664832/34359738368), sparse 34% (11759931392), duration 23, 714/82 MB/s
INFO: status: 44% (15324545024/34359738368), sparse 40% (13808025600), duration 27, 563/51 MB/s
INFO: status: 51% (17552179200/34359738368), sparse 46% (15889768448), duration 30, 742/48 MB/s
INFO: status: 55% (18979356672/34359738368), sparse 49% (17154867200), duration 33, 475/54 MB/s
INFO: status: 57% (19631636480/34359738368), sparse 50% (17512181760), duration 37, 163/73 MB/s
INFO: status: 58% (20165033984/34359738368), sparse 51% (17830518784), duration 40, 177/71 MB/s
INFO: status: 60% (20636631040/34359738368), sparse 52% (18072412160), duration 43, 157/76 MB/s
INFO: status: 61% (21248540672/34359738368), sparse 53% (18481831936), duration 46, 203/67 MB/s
INFO: status: 63% (21826371584/34359738368), sparse 54% (18829721600), duration 49, 192/76 MB/s
INFO: status: 64% (22316646400/34359738368), sparse 55% (19093024768), duration 53, 122/56 MB/s
INFO: status: 66% (22848995328/34359738368), sparse 56% (19350163456), duration 56, 177/91 MB/s
INFO: status: 67% (23120445440/34359738368), sparse 56% (19391524864), duration 59, 90/76 MB/s
INFO: status: 69% (23850647552/34359738368), sparse 57% (19890466816), duration 63, 182/57 MB/s
INFO: status: 83% (28800188416/34359738368), sparse 72% (24824401920), duration 66, 1649/5 MB/s
INFO: status: 97% (33605943296/34359738368), sparse 86% (29630156800), duration 69, 1601/0 MB/s
INFO: status: 100% (34359738368/34359738368), sparse 88% (30383951872), duration 70, 753/0 MB/s
INFO: transferred 34359 MB in 70 seconds (490 MB/s)
INFO: archive file size: 3.71GB

I still would like to know how PVE migrates VMs "unsecurely" so I can do the same with zfs send/receive. That seems to be the best option remaining that I know about.

send-receive is always over SSH in PVE. "unsecure" migration is only relevant for live-migrating currently - instead of tunneling a socket over SSH, we can use plain TCP.

How do I perform zfs send/receive in a single command on a single host yet avoid the penalty of SSH?

Can vzdump be made to go faster? Closer to line speed, assuming my zpools can handle the uptake?

If neither option is optimum for getting the most out of 10G have any of you found a package that is?

zfs send prints the stream to stdout, zfs receive reads it from stdin. you can create your own backup mechanism using zfs if you deem vzdump too slow. if your network is private, you can use a plain socket/netcat/socat/... instead of SSH, or you could try to optimize the SSH parameters (e.g., choose an encryption algorithm that is implemented in your chipset in hardware). you can of course also zfs send into a file, and later zfs receive from it - that seems a bit roundabout, but allows to use NFS or CIFS as "transport"

I think you are already at the limit of how fast vzdump can read from your pool. you could compare with the performance when the pool is otherwise completely idle and VM is not running to see whether less load improves the situation.

Gurn_Blanston · Mar 23, 2017

Hello Group,

I think I have stumbled onto what for us seems to be a good solution. I resorted to Googling for a solution that uses ZFS features and tried a few of them. The one that I really liked was Znapzend. The one complaint I have about it is that I have to remember to substitute "z" for "s" and the cuteness wears thin after a while. This is not the first mention of znapzend in the PVE forums it turns out.

https://forum.proxmox.com/threads/p...e-znapsend-or-sanoid.27663/page-2#post-158706

https://forum.proxmox.com/threads/znapzend-backup-generator-script.33309/

http://www.znapzend.org/
https://forum.proxmox.com/threads/znapzend-backup-generator-script.33309/
It is a Perl script but you install it using make. For this you need a few things such as a Perl C compiler that is not in the PVE image. What makes it fit our needs is that you can select two destinations for your zfs snapshots, and each destination can have its own "progressive thinning" scheme. In our situation, we have three nodes, one of which is pretty much just there for quorum purposes.

PVE-1 - vms are in two datasets - zpool_fast/zfs-16k, and zpool_fast/zfs-128k
- there is a spindle based zpool to accept backups from pve-2 (pve1zpool2/znapzend-backups)

PVE-2 - (vms are in two datasets - zpool_fast/zfs-16k, and zpool_fast/zfs-128k)
- there is a spindle based zpool to accept backups from pve-1 (pve2zpool2/znapzend-backups)

PVE-3q - No storage on this one, just for quorum.

We also have a fourth server running Ubuntu 16 with 15T zpool. It happens to be a NFS server as well but Znapzend doesn't care about NFS. It just needs to be hosting a zpool. So originally, we were using pve-zsync to make replicas of VMs on fast SSD zpool on PVE-1 to a spindle based zpool on pve-2 and vice versa. This would be our first line of recovery in the event of losing either a host or the host's SSD zpool. I was going to use the NFS server for VZdump backups of both PVE-1 and PVE-2 but there are a few drawbacks to this arrangement. For one, it is not as fast as I expected/hoped. For another thing, each backup is a full backup so 15 TB doesn't cut it if your nightly backups are one TB or more and you want to maintain recovery points.

So I have been messing around with Znapzend for the past few days and have put it into production. Here is what it does for us:

You configure the backup jobs with the znapzendzetup command, which creates custom fields for your source dataset. This could be (and maybe should be) the root zpool. It is in these fields where the backup config lies. The source keeps 24 hourly snapshots. Then destination-a is the slow zpool on pve-2 and it keeps an hourly snapshot for 24 hours as well. Destination-b is a dataset on the backup server and it keeps the last twentyfour hourly snapshots but also a daily for the past 30 days and a monthly for the next five years. This is what they guy who wrote the script calls "progressive thinning". There is a recursive switch, --recursive, that tells the job to also send/receive the zvols resident in the specified dataset.

Here is one of the two jobs running on host pve-1:

Code:

znapzendzetup create --recursive --mbuffer=/usr/bin/mbuffer --mbuffersize=1G --tsformat='%Y-%m-%d-%H%M%S' SRC '1d=>1h' zpool_fast/zfs_16k DST:a '1d=>1h,30d=>1d,1y=>1m,5y=>1y' root@10.0.0.90:nastyzpool/znapzend-backups-pve1 DST:b '1d=>1h' root@10.0.0.70:pve2zpool2/znapzend-backups

Here is the job in a more readable format using the znapzendzetup export command:

Code:

znapzendzetup export zpool_fast/zfs_16k
dst_a=root@10.0.0.90:nastyzpool/znapzend-backups-pve1
dst_a_plan=1days=>1hours,30days=>1days,1years=>1months,5years=>1years
dst_b=root@10.0.0.70:pve2zpool2/znapzend-backups
dst_b_plan=1days=>1hours
enabled=on
mbuffer=/usr/bin/mbuffer
mbuffer_size=1G
post_znap_cmd=off
pre_znap_cmd=off
recursive=on
src=zpool_fast/zfs_16k
src_plan=1days=>1hours
tsformat=%Y-%m-%d-%H%M%S
zend_delay=0
NOTE: if you have modified your configuration, send a HUP signal
(pkill -HUP znapzend) to your znapzend daemon for it to notice the change.

At long last I am getting our money's worth our of our 10G! I let both pve-1 and pve-2 start backing up at the same time and can see in the PVE WEBUI that finally, we are pushing 700 MB/s and more! Notice that Znapzend uses mbuffer!

I created four jobs because we have four datasets where production VMs are kept and I was a little hesitant to apply anything to the root zpool. Two jobs on PVE-1 and two jobs on PVE-2, one is the 16k recordsize dataset and the other is 128k recordsize dataset. These then use a single destination dataset on 10.0.0.90 for each host. So I am not sure what is happening when two source datasets (16k and 128k) and both try to replicate to a single destination dataset. Znapzend doesn't complain (sends its log output to syslog) and the VM zvols that reside in the two datasets are getting backed up but if I wanted to recover the entire source dataset, ie zpool_fast/zfs_128k, I have no idea what I would get. I don't know why it works at all actually. I think I am going to redo these four jobs into two jobs at the root zpool level (one job on each host with each having their own dedicated destination zfs dataset on the backup server). Thus regardless of what datasets I make on the production zpools, the whole pools and all of their nested contents are backed up. Here is a snippet of what the backups look like on the backup server. Notice the snapshots of the root dataset nastyzpool/znapzend-backups-pve1. There are two backup jobs that are both trying to send different datasets to this single destination. You can't see this from the zfs list but I am sure it is screwy with the dataset first being overwritten by the zpool_fast/zfs-128k source then by the zfs-16k source. It took me a while to come around to thinking in terms of the entire dataset rather than an individual vm zvol basis.

Code:

nastyzpool/znapzend-backups-pve1                                            563G  3.45T    96K  /nastyzpool/znapzend-backups-pve1
nastyzpool/znapzend-backups-pve1@2017-03-23-000000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-010000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-020000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-030000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-040000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-050000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-060000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-070000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-080000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-090000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-100000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-110000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-120000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-130000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-140000                            8K      -    96K  -
nastyzpool/znapzend-backups-pve1@2017-03-23-150000                             0      -    96K  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1                              172G  3.45T   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-000000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-010000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-020000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-030000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-040000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-050000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-060000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-070000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-080000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-090000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-100000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-110000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-120000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-130000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-140000              8K      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-1@2017-03-23-150000               0      -   172G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2                             14.1G  3.45T  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-000000           2.91M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-010000           1.68M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-020000           1.75M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-030000           1.55M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-040000           1.43M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-050000           1.42M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-060000           1.40M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-070000           1.60M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-080000           1.67M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-090000           1.68M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-100000           1.46M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-110000           1.34M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-120000           1.28M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-130000           1.30M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-140000           1.28M      -  14.0G  -
nastyzpool/znapzend-backups-pve1/vm-102-disk-2@2017-03-23-150000

OK. It wasn't my intention to make a complete run down of Znapzend but I wanted to point out that you need to be careful when backing up multiple hosts to a single backup server with zfs. Try to keep everything segregated because ZFS will let you mush multiple sources into a single destination dataset namespace and you would never know.

I am going to try and clean this up after hours and I will send another post by way of followup.

Sincerely,

GB

dswartz · Mar 27, 2017

Sounds interesting, please do update us!

Gurn_Blanston · Mar 28, 2017

Greetings,

I did clean up my backup sources. I am just using the root zpool's where the production VMs live. Each of the two hosts back up to each other and to their own destination zfs dataset on the backup machine.

I let them both run at the same time and now I have a good sense of what our system can handle if really pushed. From the VM's point of view, everything was just fine while the spindles were thrashing their hearts out. Here is the network graph from one of the PVE hosts.

Notice that at one point, there were 500MB/s going out and 500 MB/s coming in! Once the initial sync completed, load goes down to a trickle every hour. As long as nothing is going on with production, that is. Here is yesterday's history during production. You can see that every hour there is a spike.

This is from PVE-1, which is the backup target for PVE-2, where a virtual Windows 2012R2 server gets a nightly series of downloads to a Market database. Our system seems to be keeping up so far but I am surprised that there is this much activity. The downloads are not all that huge, a few 100s of MB.

I performed a test recovery of a clone I made of a Windows 2016 server. The process goes like this:

Create a new VM with the same virtual hardware and same storage type, i.e., VIRTIO, IDE, etc.
From the backup server use zfs send/receive to overwrite the disks you put in your new VM. Here is an example:
zfs send nastyzpool/znapzend/zpool_fast-pve1/zfs_128k/vm-200-disk-1@2017-03-24-140000 | ssh root@10.0.0.60 "zfs receive -F pve1zpool2/zfs_recovery/vm-401-disk-1"
Run this on the host where the new vm lives: qm rescan

That is it. It just starts up. Careful about keeping all the settings such as Spice. I had to power it off and set the graphics back to spice because I forgot about it and left the video at default. Notice that you need the -F switch because you are overwriting existing (empty) zvols.

While this worked fine, sending the empty 1 TB data virtual disk, vm-200-disk-2, took many hours and the transfer rate seemed to go at about 1 MB/s compared to 300 MB/s that vm-200-disk-1 enjoyed. I ctl-c'd the first attempt thinking it was some sort of fluke but alas it was still super super slow. If anyone has any notion as to why this was, I would love to hear about it. My restoration target was a non-backed up zpool on pve-1 so I don't think it was the ongoing hourly backup interfering.

I am going to try another recovery and this time a more real-life VM with real content.

Sincerely,

GB

Gurn_Blanston · Mar 31, 2017

So I made a fresh Win2016 VM with a 500 GB OS and data virtual disk then robocopied a bunch of real production files over to it. About 155 GB over 515,000 files. Average file size is about 300k. Both the source VM and the target were on the same host, which enjoys a 10G virtual NIC using VIRTIO. File transfer performance was disappointing. About 20 MB/s. Tiny files make bulk transfers go pretty slow but that slow? I then tried to copy some larger ISO files and got about 111 MB/s over 16.5 GB, better but not 10G better! Anyway, this is an aside.

Now that I had a real-life-ish sort of test server to backup and recover I let the hourly backup do its thing and once the snapshot was safely parked on the backup server I disabled the backups on the hosts and made a fresh VM with the same virtual hardware details as the original. Then from the backup server I zfs send/received the backed up zvol snapshot over to the host with the new recovery vm and overwrote its virtual disk zvol. This time performance was better than 1 MB/s but not what I was looking for. It averaged 70 MB/s and took a bit over 30 minutes. The size of the zvol is about 148 GB. I expected it to take about 7 minutes. Thinking this was somehow limited by the target zpool (3-spindle-vdevs), I tried sending once more but this time to an all SSD zpool. Performance didn't really improve any. A single 7200 RPM spindle could deliver this kind of throughput! I have six spindles in three mirrored vdevs on the backup server and expected three times this performance at least. Could it be that there is something about sending zvols that is more cumbersome and thus produces poorer throughput? Would it work better to send the parent zfs dataset of the zvol? I suppose I could test and find out over the weekend.

SSH is a bottleneck for zfs send/receive transfers but not to this extent. I recall seeing over 250 MB/s using zfs send/receive with SSH as the transport. 70 MB/s is just plain slow and recovering a 2 TB or more server would take 8 white knuckled hours! There has to be a faster way.

Has anyone gone through this exercise and found a faster way? It seems our system can handle 300 to 400 MB/s throughput if it is pushed. How does one make it work harder for a smaller task like recovering a single VM?

guletz · Apr 19, 2017

Hello Gurn,

Your post is interesting, at least

I am not a old proxmox user, but I know something about zfs (not so much). I can teĺl this:
- the zfs send/receive depends ... a lot of things
- it is better to do zfs send/receive with the same record / block size on both systems (source and destination hosts)
- depends if the source/destination hosts have the same zfs layout ( source is zfs mirror and destination is raidz1 as an example)
- if the zvols on source have a low block/record size (8 k, as proxmox use, without any option by default - a wrong decisions if you ask me) , the speed can not be the best, because you need to read X blocks of 8 K, insted of read less blocks of let say 64 k at the source side - and most of the time this blocks are not secvential, so your hdd need to to read block X and then the hdd head need to whait one rotation of the spindle to read the next sector (this is iops)
- also a great impact could be the usage space alocated by zfs (70 % will be a yellow zone, 80% is red)
- the age of your pools could be a restrictive fact (an old pool has more fragmentation compared with a new pool with the same data)
- the load of the source/destination could have a big impact

Some bad ideas

- source and destination host must have the same topology (zpool, no of disks, and so on)
- use bigger sector size/block size as you can (and if you can) > 16 k (32 or 64 as an example) - and the same on the guests (also if you can)
- use datasets, if it is possible(128 k is a very good option, compared with let say 16 k for zvol)
- but remember, nothing is perfect "like me"(just kidding ; )), so a big block/sector size will be a problem when you have many tiny files (< sector/block size)
- in some situation you need a fixed sector/block size (mostly for databases)

Any zfs setup with any raidzX (X=1,3) have the same IOPS , like any single disk ...

Remember , I am not a guru in zfs, and/or the same is true about my bad english

I truly hope that my bad post will give you some ideeas, and maybe some answers to your questions. But for sure, I know this: zfs is a huge and a complicated stuff, so is like an iceberg, and proxmox show you only the top. Sometimes the top is wonderful , but in many case could be very nasty

Have a nice zfs setup to all

aTan · Dec 13, 2017

netcat example:
https://github.com/ava1ar/zfs-send-receive

DerDanilo · Dec 18, 2017

What about all SSD/Nvme setups? 4k blocks are often the optimal for SSDs.

Search

Search

Speedy ZFS Backups

Gurn_Blanston

Member

fabian

Proxmox Staff Member

Gurn_Blanston

Member

fabian

Proxmox Staff Member

Gurn_Blanston

Member

dswartz

Well-Known Member

Gurn_Blanston

Member

Gurn_Blanston

Member

guletz

Famous Member

aTan

Renowned Member

DerDanilo

Renowned Member