Proxmox with StarWind VSA as HA iSCSI storage

stevehughes · Apr 11, 2024

Regarding the references to Proxmox not supporting VM snapshots with iSCSI - is this just with its own iSCSI target? We have been using StarWind VSA with VMware for years and are looking to switch to Proxmox. I'm planning to continue using StarWind as the iSCSI target. Does the loss of snapshot capability apply here also? That would require a serious rethink of my plans.

cb13 · Apr 11, 2024

Snapshots is not supported with iSCSI in general, whichever solution you choose. You can find detailed table of supported storage types and their features here: https://pve.proxmox.com/wiki/Storage
Depending on your storage, if it supports HA (Starwind do that), you can will get HA, Live Migration and PBS backups of your iSCSI-based storage. VM-level won't be supported unfortunately. There is an option to go with cluster aware file-system (like GFS or OCFS2), but there is no native support from Proxmox.
https://forum.proxmox.com/threads/is-there-any-way-that-iscsi-can-offer-snapshots-ha.112378/

bbgeek17 · Apr 11, 2024

The iSCSI protocol has no concept of snapshots. Its literally SCSI commands sent over TCP/IP.

Anyone can implement iSCSI target (Microsoft, Linux, StarWind, Blockbridge). Everyone of the named implements snapshots internally in their storage. How that snapshot is initiated and accessed is different for each vendor. Most implement some sort of out-of-band management API. This API has nothing to do with iSCSI.

So in order to have native Proxmox snapshot support one either needs to use an additional layer on top of iSCSI (i.e. cluster aware filesystem with qcow) or use storage vendor who provides integrated support for Proxmox which enables native snapshots.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

stevehughes · Apr 13, 2024

Thanks for the replies. How would we go about making consistent point-in-time backups of VMs running on StarWind iSCSI storage using PBS?

bbgeek17 · Apr 13, 2024

stevehughes said:
Thanks for the replies. How would we go about making consistent point-in-time backups of VMs running on StarWind iSCSI storage using PBS?

Fortunately for you (but arguably not for all) PBS does not take advantage of storage snapshots at all. If you use "backup/snapshot" option, PBS utilizes QEMU snapshot functionality. It essentially inserts a filter inline of write during backup, which blocks the new writes to areas that are being backed up, until that sector is backed up (on demand). There are obvious drawbacks to this approach, so PVE developers are working on new technology "backup fleecing", but that is some time away from prime time.
In summary - you can already get consistent backups today with any type of storage via PBS, but you need to size your backup environment appropriately.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

stevehughes · Apr 15, 2024

Thanks bbgeek. I have a lot to get my brain around before committing to a new architecture.

I understand that iSCSI itself has no concept of snapshots. The process you describe via QEMU looks very similar to what ESXi does, in that it doesn't rely on snapshot capability in the storage, but rather it diverts writes to a separate file on whatever storage it is using.

Apart from achieving consistency of backups, the other scenario where we would use a snapshot is when making an update to a VM, so that we can quickly roll back if something goes awry. Can Proxmox handle this using the QEMU mechanism you describe above, without requiring the storage itself to be snapshot capable?

bbgeek17 · Apr 15, 2024

stevehughes said:
QEMU looks very similar to what ESXi does, in that it doesn't rely on snapshot capability in the storage, but rather it diverts writes to a separate file on whatever storage it is using.

It looks similar on the surface, but the underlying technology is completely different. Specifically, ESXi has VMFS, which is a specialized cluster aware filesystem. The data in ESXi (in 99% of cases) is stored as files (vmdk). These can be roughly compared to qcow. So the snapshot technologies can be somewhat compared between qcow and vmdk.

However, the QEMU Fleecing is different. I dont think its meant to be a long term type of snapshot, nor does it have ability to have multiple snapshots. Admittedly, I have not studied the design docs in details and could be mistaken here.

stevehughes said:
Apart from achieving consistency of backups, the other scenario where we would use a snapshot is when making an update to a VM, so that we can quickly roll back if something goes awry. Can Proxmox handle this using the QEMU mechanism you describe above, without requiring the storage itself to be snapshot capable?

No, see above. The special Backup integration is not meant to be long term repeatable snapshot, nor does it have roll-back capability. The Fleecing tech has very particular specific use-case - Backups.

It seems like you are looking for standard snapshot functionality. In this case you need to use storage that is capable of it.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

stevehughes · Apr 15, 2024

Thanks bbgeek. I understand. Much appreciated.

stevehughes · Apr 16, 2024

bbgeek17 said:
PBS utilizes QEMU snapshot functionality. It essentially inserts a filter inline of write during backup, which blocks the new writes to areas that are being backed up, until that sector is backed up (on demand).

I've been looking up QEMU resources to understand better how this mechanism works and I pretty much struck out - the official QEMU wiki dates to 2016. My understanding is that by 'blocking writes' you mean that QEMU directs the writes to a separate 'difference' file (I expect that 'file' is probably not the correct term for storage that is managed by LVM). Nothing weird or scary about that.

However I can't figure out what happens at the conclusion of the backup. ESXi merges the difference file back into the primary file (which does have its own issues). What does QEMU do at this point? Does it leave the difference file in place forever (over time this could result in a huge number of difference files), or does it merge the difference file back into the primary file?

fabian · Apr 16, 2024

stevehughes said:
I've been looking up QEMU resources to understand better how this mechanism works and I pretty much struck out - the official QEMU wiki dates to 2016. My understanding is that by 'blocking writes' you mean that QEMU directs the writes to a separate 'difference' file (I expect that 'file' is probably not the correct term for storage that is managed by LVM). Nothing weird or scary about that.

no, it actually inserts a copy-before-write filter, and if the guest writes to an area that hasn't yet been backed up, that write is stalled while the original data is copied to the backup "device".

stevehughes said:
However I can't figure out what happens at the conclusion of the backup. ESXi merges the difference file back into the primary file (which does have its own issues). What does QEMU do at this point? Does it leave the difference file in place forever (over time this could result in a huge number of difference files), or does it merge the difference file back into the primary file?

nothing happens, since there is no difference file.

fleecing works in a similar fashion, but instead of copying directly to the backup target device, there is another "fleecing" image in-between. this fleecing image is sort of a half-transparent overlay over the original image - the backup happens on the fleecing image, with reads falling through to the original image if that part of the image hasn't been written to since the start of the backup. writes are handled similar to the current approach, but instead of copying to the backup target, the copy happens to the (hopefully local and fast) fleecing image. to keep the space usage low, the client can mark areas of the fleecing image as "already backed up", which tells Qemu to skip the copy-before-write action for those areas.

stevehughes · Apr 16, 2024

fabian said:
no, it actually inserts a copy-before-write filter, and if the guest writes to an area that hasn't yet been backed up, that write is stalled while the original data is copied to the backup "device".

Thanks fabian. When you say the write is stalled, I assume you don't mean that the write from the guest OS is delayed until the data has been backed up - that could be many minutes, and the guest application wouldn't tolerate that. Perhaps you mean that the write is held in a temporary buffer and is not committed until the backup is done. Could you elaborate please?

Also, bbgeek17 has indicated that the fleecing technique is under development. Could you elaborate on its availability?

fabian · Apr 16, 2024

stevehughes said:
Thanks fabian. When you say the write is stalled, I assume you don't mean that the write from the guest OS is delayed until the data has been backed up - that could be many minutes, and the guest application wouldn't tolerate that. Perhaps you mean that the write is held in a temporary buffer and is not committed until the backup is done. Could you elaborate please?

no, I actually mean delayed (in the worst case - since of course there is usually some level of in-flight/caching/buffering going on depending on how exactly your backup target is implemented, so the "copy-to-backup" part might return OK before it's actually persisted into the backup archive/..).

"fleecing" is basically such a (big) temporary buffer with extra features, to avoid guest writes failing even in the worst case if the backup target is too slow or unresponsive, while keeping the buffer size manageable and improving error handling.

the last iteration of the fleecing patches was sent a few days ago: https://lists.proxmox.com/pipermail/pve-devel/2024-April/062815.html the cover letter contains a digestable summary of the pros and cons

stevehughes · Apr 16, 2024

Thanks fabian. Sorry this has gone off-topic, but extremely useful information to me (and hopefully others).

Sanity check my understanding of the CoW filter please: While a backup is in progress, if the guest writes to a block that has not yet been backed up then the data for that block will be read and sent to the backup destination before the guest is allowed to write the block. The write performance for that block as seen by the guest will be determined by the write performance of the backup destination (mitigated by whatever caching or bufferring mechanisms might be present in the data path between storage and backup destination). The guest write is delayed only while that block is transferred to the backup destination (not for the entire duration of the backup, since the backup is capable of processing blocks out of sequence). If the backup destination is fast then the impact may go unnoticed; if it is very slow and the guest attempts to write data rapidly during the backup then there could be significant impact to the guest performance.

The fleecing mechanism will avoid this issue by providing temporary storage for the CoW data so that the guest does not need to be significantly delayed.

Score out of 10 please?

fabian · Apr 16, 2024

9/10

there is still some level of delay with fleecing, since a guest write of a block that hasn't been backed-up yet is still basically transformed into a read + 2x write. the idea is that the fleecing storage is as fast as the regular VM storage, and not slower (or higher latency) like the backup storage, keeping the delay small enough to not matter. combined with the efficiency provided by marking areas either not interesting for the backup in the first place, or already backed up, you basically have a tradeoff between additional (local) storage (space) requirements and upper bound of guest write delay induced by the backup. it also has other advantages, like allowing to completely externalize the actual backup (by exposing the fleecing image via NBD, for example, and letting the backup software read+discard on top of that to drive it) - but we won't use those for now since it would require changing a lot of other parts as well (and also adds additional overhead).

stevehughes · Apr 16, 2024

Thanks for the report card

.

The fleecing mechanism actually sounds superior to the mechanism used in ESXi. The snapshot mechanism in ESXi can cause VMs to stun during the snapshot commit at the end of the backup. This has been improved over the years but it's still an issue which prevents us from running backups of certain VMs (mostly RDS servers) during the working day.

Is there any rough expectation of when this will be ready (apart from the obvious "when it's done") ?

fabian · Apr 16, 2024

I think the rough expectation is soonish - but of course, with the caveat that our internal review and testing doesn't come up with hitherto unknown show stoppers - but it will very likely be marked as a "preview" or "experimental" feature first and be opt-in - it is a big change in a rather core feature after all.

bbgeek17 · Apr 16, 2024

stevehughes said:
The fleecing mechanism actually sounds superior to the mechanism used in ESXi. The snapshot mechanism in ESXi can cause VMs to stun during the snapshot commit at the end of the backup.

This only happens with slow storage, we have helped a few customers to deal with this issue.
I think its a bit of misconception to compare fleecing directly to ESXi snapshots. You can get VM "stun" with snapshots in PVE, specifically NFS/qcow suffers from that. As long as storage is fast - both hypervisors will perform seamlessly.

@stevehughes you mentioned that you cant find information about this particular technology, here is some from the horse's mouth, as they say: https://www.youtube.com/watch?v=cjjnm1FqkS0
https://events19.linuxfoundation.or...us-Vladimir-Sementsov-Ogievskiy-Virtuozzo.pdf

Yes, the project is that old - it takes time to bring complex things to fruition.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

stevehughes · Apr 17, 2024

Thanks again bbgeek. I think I do actually understand this now.

Search

Search

Proxmox with StarWind VSA as HA iSCSI storage

stevehughes

Member

cb13

New Member

bbgeek17

Distinguished Member

stevehughes

Member

bbgeek17

Distinguished Member

stevehughes

Member

bbgeek17

Distinguished Member

stevehughes

Member

stevehughes

Member

fabian

Proxmox Staff Member

stevehughes

Member

fabian

Proxmox Staff Member

stevehughes

Member

fabian

Proxmox Staff Member

stevehughes

Member

fabian

Proxmox Staff Member

bbgeek17

Distinguished Member

stevehughes

Member