[SOLVED] ZFS snapshot rollback broken (patched with libpve-storage-perl >= 7.0-16)

Jan 7, 2022
7
0
1
40
Hello,

I've started with proxmox just a few days ago, but found a problem which I can not find anywhere else mentioned so far:

I have a datastore attached via ZFS over ISCSI. The creation of VM Disks works, also the VM works quite well.

I can create a snapshot of a running or shutdown VM, no problem so far.
The dataset gets a snapshot, and the memory state also gets created.

Now the problem:
When I try to rollback, the process cannot finish with the message (after 10% progress bar):
Code:
cannot open 'tank/vm-100-disk-0': missing '@' delimiter in snapshot name
TASK ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' -i /etc/pve/priv/zfs/1.2.3.4_id_rsa root@1.2.3.4 zfs list -H -t snapshot -o name -s creation tank/vm-100-disk-0' failed: exit code 1

I've researched the command, and tried it exactly on my ZFS server (OmniOS managed via napp-it).

I do not know how proxmox would continue after this command, but I have an idea whats wrong:

in the `zfs list` call is a `-r` missing. I've looked a bit in pve-zsync around, and there the `-r` exists, most of the times.
When I try the call with the additional `-r` switch, it finds all snapshots belonging to the dataset in question, but I've no idea why it is missing in this call.

I've just looked a bit through the source without having any clues about proxmox internal stuff, so I'm hoping to find someone here who can help to find a solution.
I also did not find a specific date on which this `-r` switch got deleted, so right now I'm not sure why it seems that only I have problems.

I'm also for now on the no subscription line, so I'm unsure if this is resolved in the enterprise repository. If it is, this would also be a nice solution :)
As I told I'm just starting with proxmox so a subscription will be an option if I stay with proxmox. Until now I've used ESXi and NFS datastores, but I foudn the ZFS over ISCSI in proxmox very interesting, so if the snapshot stuff is working, this would be the only thing keeping me from switching away from ESXi finally.

Greetings,
Uno/Georg
 
Ok, I did some more testing...
I installed the Proxmox 7.0-2 (I think it showed then 7.0-11 in the top bar) iso the ZFS over ISCSI and created a new VM

Now here everything works like it should. I can create a snapshot, and I can rollback to it.

After an upgrade to 7.1-8 this breaks again.

So I think there is definetly a bug somehow between proxmox 7.0 and 7.1.

also my previous assumption that it has something to do with pve-zsync seems wrong, I think it is more related to libpve-storage-perl.

The version when it was working was 7.0-10
the newest proxmox has "libpve-storage-perl: 7.0-15" and with this it is not working...

So right now:
Proxmox 7.0 rollback of zfs snapshots works
Proxmox 7.1 does not work

Any ideas how to proceed?
 
Hi,
Hello,

I've started with proxmox just a few days ago, but found a problem which I can not find anywhere else mentioned so far:

I have a datastore attached via ZFS over ISCSI. The creation of VM Disks works, also the VM works quite well.

I can create a snapshot of a running or shutdown VM, no problem so far.
The dataset gets a snapshot, and the memory state also gets created.

Now the problem:
When I try to rollback, the process cannot finish with the message (after 10% progress bar):
Code:
cannot open 'tank/vm-100-disk-0': missing '@' delimiter in snapshot name
TASK ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' -i /etc/pve/priv/zfs/1.2.3.4_id_rsa root@1.2.3.4 zfs list -H -t snapshot -o name -s creation tank/vm-100-disk-0' failed: exit code 1

I've researched the command, and tried it exactly on my ZFS server (OmniOS managed via napp-it).
What's the result when you manually run these commands on the server?
Code:
zfs list -H -t snapshot -o name -s creation tank/vm-100-disk-0
zfs list -H -r -t snapshot -o name -s creation tank/vm-100-disk-0

What's the ZFS version on the server?

Thanks!

I do not know how proxmox would continue after this command, but I have an idea whats wrong:

in the `zfs list` call is a `-r` missing. I've looked a bit in pve-zsync around, and there the `-r` exists, most of the times.
When I try the call with the additional `-r` switch, it finds all snapshots belonging to the dataset in question, but I've no idea why it is missing in this call.

I've just looked a bit through the source without having any clues about proxmox internal stuff, so I'm hoping to find someone here who can help to find a solution.
I also did not find a specific date on which this `-r` switch got deleted, so right now I'm not sure why it seems that only I have problems.
For me, it's not necessary to use -r, when -t snapshot is used, but maybe that depends on the ZFS version.

I'm also for now on the no subscription line, so I'm unsure if this is resolved in the enterprise repository. If it is, this would also be a nice solution :)
As I told I'm just starting with proxmox so a subscription will be an option if I stay with proxmox. Until now I've used ESXi and NFS datastores, but I foudn the ZFS over ISCSI in proxmox very interesting, so if the snapshot stuff is working, this would be the only thing keeping me from switching away from ESXi finally.

Greetings,
Uno/Georg
 
Hello Fabian,

thanks for your time regarding this issue (especially if it works for you without the -r ;) )

Hi,



What's the result when you manually run these commands on the server?
Code:
zfs list -H -t snapshot -o name -s creation tank/vm-100-disk-0
zfs list -H -r -t snapshot -o name -s creation tank/vm-100-disk-0
The output of thes commands:

Code:
root@cube-it:~# zfs list -H -t snapshot -o name -s creation tank/vm-100-disk-0
cannot open 'tank/vm-100-disk-0': missing '@' delimiter in snapshot name
root@cube-it:~# zfs list -r -H -t snapshot -o name -s creation tank/vm-100-disk-0
tank/vm-100-disk-0@hmpf
tank/vm-100-disk-0@tnak
tank/vm-100-disk-0@hmhm

So as I wrote above: without the -r it complains about the delimiter. This message also shows in the Proxmox GUI as part of the error. (when using Proxmox 7.1)
When I use the -r, it shows apparently every snapshot regarding the given dataset.

What's the ZFS version on the server?
I'm using OmniOS r151038 (LTS) for my storage server, managed by napp-it. So its not ZFS on Linux but solaris based.

I'm not sure how to find the exact zfs version, output of pkg list would be:

Code:
system/file-system/zfs                            0.5.11-151038.0            i--
system/library/python/zfs-39                      0.5.11-151038.0            i--

the pool version would be 5.

Thanks!


For me, it's not necessary to use -r, when -t snapshot is used, but maybe that depends on the ZFS version.
from the manpage of my zfs command (reduced to these 3 parts):

Code:
zfs list [-r|-d depth] [-Hp] [-o property[,property]...] [-s property]...
       [-S property]... [-t type[,type]...] [filesystem|volume|snapshot]...
       Lists the property information for the given datasets in tabular form.
       If specified, you can list property information by the absolute
       pathname or the relative pathname.  By default, all file systems and
       volumes are displayed.  Snapshots are displayed if the listsnaps
       property is on (the default is off).  The following fields are
       displayed, name,used,available,referenced,mountpoint.

       -r  Recursively display any children of the dataset on the command
           line.

       -t type
           A comma-separated list of types to display, where type is one of
           filesystem, snapshot, volume, bookmark, or all.  For example,
           specifying -t snapshot displays only snapshots.

So one desperate hope: could it be related to the 'listsnaps' property (which is neighter set nor does exist right now on my pool, so it is off)?

I have not tested setting this property yet, because as I wrote using Proxmox 7.0 it works.
But for now I could not see if Proxmox 7.0 is using the -r or what else is the difference that it works in 7.0 but not in 7.1.
Right now I have Proxmox 7.0 installed. But if it helps debugging I can switch back and forth right now, as nothing critical is installed on this testing machine right now.
So if I can do more stuff to help you debug this, just tell me.

Greetings,
Georg
 
Hello Fabian,

thanks for your time regarding this issue (especially if it works for you without the -r ;) )


The output of thes commands:

Code:
root@cube-it:~# zfs list -H -t snapshot -o name -s creation tank/vm-100-disk-0
cannot open 'tank/vm-100-disk-0': missing '@' delimiter in snapshot name
root@cube-it:~# zfs list -r -H -t snapshot -o name -s creation tank/vm-100-disk-0
tank/vm-100-disk-0@hmpf
tank/vm-100-disk-0@tnak
tank/vm-100-disk-0@hmhm

So as I wrote above: without the -r it complains about the delimiter. This message also shows in the Proxmox GUI as part of the error. (when using Proxmox 7.1)
When I use the -r, it shows apparently every snapshot regarding the given dataset.


I'm using OmniOS r151038 (LTS) for my storage server, managed by napp-it. So its not ZFS on Linux but solaris based.

I'm not sure how to find the exact zfs version, output of pkg list would be:

Code:
system/file-system/zfs                            0.5.11-151038.0            i--
system/library/python/zfs-39                      0.5.11-151038.0            i--

the pool version would be 5.


from the manpage of my zfs command (reduced to these 3 parts):

Code:
zfs list [-r|-d depth] [-Hp] [-o property[,property]...] [-s property]...
       [-S property]... [-t type[,type]...] [filesystem|volume|snapshot]...
       Lists the property information for the given datasets in tabular form.
       If specified, you can list property information by the absolute
       pathname or the relative pathname.  By default, all file systems and
       volumes are displayed.  Snapshots are displayed if the listsnaps
       property is on (the default is off).  The following fields are
       displayed, name,used,available,referenced,mountpoint.

       -r  Recursively display any children of the dataset on the command
           line.

       -t type
           A comma-separated list of types to display, where type is one of
           filesystem, snapshot, volume, bookmark, or all.  For example,
           specifying -t snapshot displays only snapshots.

So one desperate hope: could it be related to the 'listsnaps' property (which is neighter set nor does exist right now on my pool, so it is off)?

I have not tested setting this property yet, because as I wrote using Proxmox 7.0 it works.
But for now I could not see if Proxmox 7.0 is using the -r or what else is the difference that it works in 7.0 but not in 7.1.
Right now I have Proxmox 7.0 installed. But if it helps debugging I can switch back and forth right now, as nothing critical is installed on this testing machine right now.
So if I can do more stuff to help you debug this, just tell me.

Greetings,
Georg
I sent a patch fixing the regression to the mailing list (still needs to be reviewed and packaged). You could try listsnaps as a workaround in the meantime or apply the patch yourself.

Thanks again!
 
Hello Fabian,

that was quick :)

I've had a look in the meantime regarding the listsnaps property.
As it is a pool property I've looked at the wrong place when I wrote my reply (used zfs not zpool ;) ). The property was already switched to on when I issued the commands above for my reply.

So apparently it is not the part of the problem, or at least not part of the solution.

I sent a patch fixing the regression to the mailing list (still needs to be reviewed and packaged). You could try listsnaps as a workaround in the meantime or apply the patch yourself.

Thanks again!
Thanks for this.

I've updated to Proxmox 7.1, patched the file myself and did a short test: Seems like it works now :)
(but this was just a quick test, if the rollback works....but I assume that the underlying parts are correct since we identified the missing -r as the problematic item in this case)

Another perhaps related thing:

I had a quick glance at the ZFSPoolPlugin.pm and found a potential problem at in the lines ~513+
Perl:
sub volume_snapshot_info {

    my ($class, $scfg, $storeid, $volname) = @_;
    my $vname = ($class->parse_volname($volname))[1];
    my @params = ('-Hp', '-t', 'snapshot', '-o', 'name,guid,creation', "$scfg->{pool}\/$vname");
    my $text = $class->zfs_request($scfg, undef, 'list', @params);
    my @lines = split(/\n/, $text);
    my $info = {};

    for my $line (@lines) {
    my ($snapshot, $guid, $creation) = split(/\s+/, $line);
    (my $snap_name = $snapshot) =~ s/^.*@//;
    $info->{$snap_name} = {
        id => $guid,
        timestamp => $creation,
    };
    }
    return $info;
}

I've no idea when and where this is exactly used, or how to test it. But maybe you have more insight if in this call also the missing -r could be troublesome at certain tasks.

One last thing, regarding subscription/enterprise repo:
As I now understand, in the enterprise repository the tested stuff ends up. So the pve-no-subscription repository would have newer parts quicker than the enterprise?
If I now enable subscription and thus the enterprise repository:
Is this regression right now part of the enterprise repository? If so, how long would the proposed patch from the mailing list take to be included, so no manual patching is needed?
Or is it perhaps not at enterprise yet, which would explain why nobody else complained about it?

Greetings,
Georg
 
Last edited:
Hello Fabian,

that was quick :)

I've had a look in the meantime regarding the listsnaps property.
As it is a pool property I've looked at the wrong place when I wrote my reply (used zfs not zpool ;) ). The property was already switched to on when I issued the commands above for my reply.

So apparently it is not the part of the problem, or at least not part of the solution.


Thanks for this.

I've updated to Proxmox 7.1, patched the file myself and did a short test: Seems like it works now :)
(but this was just a quick test, if the rollback works....but I assume that the underlying parts are correct since we identified the missing -r as the problematic item in this case)

Another perhaps related thing:

I had a quick glance at the ZFSPoolPlugin.pm and found a potential problem at in the lines ~513+
Perl:
sub volume_snapshot_info {

    my ($class, $scfg, $storeid, $volname) = @_;
    my $vname = ($class->parse_volname($volname))[1];
    my @params = ('-Hp', '-t', 'snapshot', '-o', 'name,guid,creation', "$scfg->{pool}\/$vname");
    my $text = $class->zfs_request($scfg, undef, 'list', @params);
    my @lines = split(/\n/, $text);
    my $info = {};

    for my $line (@lines) {
    my ($snapshot, $guid, $creation) = split(/\s+/, $line);
    (my $snap_name = $snapshot) =~ s/^.*@//;
    $info->{$snap_name} = {
        id => $guid,
        timestamp => $creation,
    };
    }
    return $info;
}

I've no idea when and where this is exactly used, or how to test it. But maybe you have more insight if in this call also the missing -r could be troublesome at certain tasks.
This is only used for replication, which (currently) requires a local ZFS pool. But you are correct, this would be affected by the same issue if replication is ever extended to support ZFS over iSCSI if the remote version requires -r.

One last thing, regarding subscription/enterprise repo:
As I now understand, in the enterprise repository the tested stuff ends up. So the pve-no-subscription repository would have newer parts quicker than the enterprise?
If I now enable subscription and thus the enterprise repository:
Is this regression right now part of the enterprise repository? If so, how long would the proposed patch from the mailing list take to be included, so no manual patching is needed?
Or is it perhaps not at enterprise yet, which would explain why nobody else complained about it?
Yes, because of this, the packages in the enterprise repository are better tested. If a fix is important, we obviously try our best to get it to the enterprise repository fast too.

The package (i.e. libpve-storage-perl_7.0-14_all.deb) is already in the enterprise repository since mid-November, so I suppose not too many ZFS versions are affected by this.

Greetings,
Georg
 
Hello Fabian,

thanks for all your help. This issue resolved much faster than I hoped :)

This is only used for replication, which (currently) requires a local ZFS pool. But you are correct, this would be affected by the same issue if replication is ever extended to support ZFS over iSCSI if the remote version requires -r.


Yes, because of this, the packages in the enterprise repository are better tested. If a fix is important, we obviously try our best to get it to the enterprise repository fast too.

The package (i.e. libpve-storage-perl_7.0-14_all.deb) is already in the enterprise repository since mid-November, so I suppose not too many ZFS versions are affected by this.
So as I understand I could switch to the enterprise repository, update, but for the time being I need to patch the ZFSPoolPlugin.pm

As it only is related to snapshot restoring, I could do regular updates, but have to be careful to patch again until the patchnotes tell me, that this specific issue is now also addressed? Is for this file a reboot needed, oder is the perl part fresh evaluated each e.g. rollback?

And what about this thread?
Should I add the title to something "solved/patch inside" ? Or will some moderator do this later?

Greetings and thanks again,
Georg
 
Hello Fabian,

thanks for all your help. This issue resolved much faster than I hoped :)


So as I understand I could switch to the enterprise repository, update, but for the time being I need to patch the ZFSPoolPlugin.pm

As it only is related to snapshot restoring, I could do regular updates, but have to be careful to patch again until the patchnotes tell me, that this specific issue is now also addressed? Is for this file a reboot needed, oder is the perl part fresh evaluated each e.g. rollback?
Yes, but luckily it already got applied, so it should be part of libpve-storage-perl >= 7.0-16. Not sure when it will be packaged/reach the enterprise repository though. For CLI commands, it should take effect immediately, but for API/UI you need to execute
Code:
systemctl reload-or-restart pveproxy.service pvedaemon.service
to load the modified module.

And what about this thread?
Should I add the title to something "solved/patch inside" ? Or will some moderator do this later?
You can either wait for the package or mark it as solved now (by using Edit thread and selecting the SOLVED prefix).
Greetings and thanks again,
Georg
 
  • Like
Reactions: Unostot

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!