Hello,
We have a two independant PVE clusters connected to two PBS servers (one per cluster). On one cluster live restore is working properly as well as normal restore. But on the other one live-restore is always failing (on all nodes).
The only difference we can see between those two clusters is that one has been recently installed with PVE 8 (the working one) and the other one has been upgraded from the 7 version.
The PVE version is pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-15-pve)
And the error received is :
We have tried to gather more information by looking at the qmrestore perl code and what we can say is :
- the VM is launching properly and stay alive if we put some sleep in the code.
- in the QemuServer.pm in sub pbs_live_restore there is those two lines of code
If we sleep before mon_cmd() the VM is still running according to linux "ps aux". After this line the process disappear.
- In the QemuServer/Helpers.pm sub vm_running_locally. If we had some debug to have the function look like this
then the return trace seems to tell that the VM dies somewhere after restoring the first bytes.
I would be happy to provide more information on that but not sure what could be relevant or not.
Many thanks for your help.
We have a two independant PVE clusters connected to two PBS servers (one per cluster). On one cluster live restore is working properly as well as normal restore. But on the other one live-restore is always failing (on all nodes).
The only difference we can see between those two clusters is that one has been recently installed with PVE 8 (the working one) and the other one has been upgraded from the 7 version.
The PVE version is pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-15-pve)
And the error received is :
Code:
new volume ID is 'storage:vm-999-disk-0'
rescan volumes...
VM is locked (create)
starting VM for live-restore
repository: 'USER@XXX:BackupServer', snapshot: 'vm/100000004/2023-10-24T02:27:57Z'
restoring 'drive-virtio0' to 'storage:vm-999-disk-0'
kvm: -drive file.filename=rbd:storage/vm-999-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/storage.keyring,if=none,id=drive-virtio0,discard=on,throttling.iops-read=350,throttling.iops-read-max=1000,throttling.iops-write=350,throttling.iops-write-max=1000,format=alloc-track,file.driver=rbd,cache=none,aio=io_uring,file.detect-zeroes=unmap,backing=drive-virtio0-pbs,auto-remove=on: warning: RBD options encoded in the filename as keyvalue pairs is deprecated
restore-drive-virtio0: transferred 0.0 B of 20.0 GiB (0.00%) in 0s
restore-drive-virtio0: Cancelling block job
An error occurred during live-restore: block job (stream) error: VM 999 not running
trying to acquire lock...
OK
error before or during data restore, some or all disks were not completely restored. VM 999 state is NOT cleaned up.
live-restore failed
We have tried to gather more information by looking at the qmrestore perl code and what we can say is :
- the VM is launching properly and stay alive if we put some sleep in the code.
- in the QemuServer.pm in sub pbs_live_restore there is those two lines of code
Code:
mon_cmd($vmid, 'cont');
qemu_drive_mirror_monitor($vmid, undef, $jobs, 'auto', 0, 'stream');
- In the QemuServer/Helpers.pm sub vm_running_locally. If we had some debug to have the function look like this
Code:
sub vm_running_locally {
my ($vmid) = @_;
my $pidfile = pidfile_name($vmid);
print "entering vm_running_locally\n";
if (my $fd = IO::File->new("<$pidfile")) {
my $st = stat($fd);
my $line = <$fd>;
close($fd);
my $mtime = $st->mtime;
if ($mtime > time()) {
warn "file '$pidfile' modified in future\n";
}
if ($line =~ m/^(\d+)$/) {
print "in if (\$line =~ ...\n";
my $pid = $1;
my $cmdline = parse_cmdline($pid);
if ($cmdline && defined($cmdline->{pidfile}) && $cmdline->{pidfile}->{value}
&& $cmdline->{pidfile}->{value} eq $pidfile) {
print "in if(\$cmdline...\n";
if (my $pinfo = PVE::ProcFSTools::check_process_running($pid)) {
print "return pid $pid\n";
return $pid;
}
}
}
}
print "return null\n";
return;
}
then the return trace seems to tell that the VM dies somewhere after restoring the first bytes.
Code:
kvm: -drive file.filename=rbd:storage/vm-999-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/storage.keyring,if=none,id=drive-virtio0,discard=on,throttling.iops-read=350,throttling.iops-read-max=1000,throttling.iops-write=350,throttling.iops-write-max=1000,format=alloc-track,file.driver=rbd,cache=none,aio=io_uring,file.detect-zeroes=unmap,backing=drive-virtio0-pbs,auto-remove=on: warning: RBD options encoded in the filename as keyvalue pairs is deprecated
entering vm_running_locally
in if ($line =~ ...
in if($cmdline...
return pid 3524918
entering vm_running_locally
in if ($line =~ ...
in if($cmdline...
return pid 3524918
entering vm_running_locally
in if ($line =~ ...
in if($cmdline...
return pid 3524918
entering vm_running_locally
in if ($line =~ ...
in if($cmdline...
return pid 3524918
entering vm_running_locally
in if ($line =~ ...
in if($cmdline...
return pid 3524918
restore-drive-virtio0: transferred 0.0 B of 20.0 GiB (0.00%) in 0s
entering vm_running_locally
in if ($line =~ ...
return null
restore-drive-virtio0: Cancelling block job
entering vm_running_locally
in if ($line =~ ...
return null
entering vm_running_locally
in if ($line =~ ...
return null
An error occurred during live-restore: block job (stream) error: VM 999 not running
entering vm_running_locally
in if ($line =~ ...
return null
error before or during data restore, some or all disks were not completely restored. VM 999 state is NOT cleaned up.
live-restore failed
I would be happy to provide more information on that but not sure what could be relevant or not.
Many thanks for your help.
Last edited: