Urgent Help Needed: VMs Not Starting on ProxMox with Linstor Shared Storage

MARVELMafia · Jun 4, 2024

I'm facing a critical issue with my ProxMox setup using Linstor for shared storage. Both my production and development VMs are failing to start, and I urgently need assistance to resolve this.

Here's the error I'm encountering on the ProxMox GUI when I try to start the VM:

Code:

task started by HA resource agent
WARN: no efidisk configured! Using temporary efivars disk.
blockdev: cannot open /dev/drbd/by-res/vm-102-disk-1/0: No such file or directory
blockdev: cannot open /dev/drbd/by-res/vm-102-disk-2/0: No such file or directory
blockdev: cannot open /dev/drbd/by-res/vm-102-disk-3/0: No such file or directory
kvm: -drive file=/dev/drbd/by-res/vm-102-disk-1/0,if=none,id=drive-scsi0,cache=writeback,format=raw,aio=io_uring,detect-zeroes=on: Could not open '/dev/drbd/by-res/vm-102-disk-1/0': No such file or directory

NOTICE
  Intentionally removing diskless assignment (vm-102-disk-1) on (kube14).
  It will be re-created when the resource is actually used on this node.
API Return-Code: 500. Message: Could not delete diskless resource vm-102-disk-1 on kube14, because:
[{"ret_code":53739522,"message":"Node: kube14, Resource: vm-102-disk-1 preparing for deletion.","details":"Node: kube14, Resource: vm-102-disk-1 UUID is: 3c8f8d4a-97ca-400c-82fb-6ac09c5d7d2a","obj_refs":{"RscDfn":"vm-102-disk-1","Node":"kube14"},"created_at":"2024-06-04T03:13:57.42399Z"},{"ret_code":-4611686018373647386,"message":"(Node: 'linstor2') No response generated by handler.","details":"In API call 'ChangedRsc'.","obj_refs":{"RscDfn":"vm-102-disk-1","Node":"kube14"},"created_at":"2024-06-04T03:13:57.431612Z"},{"ret_code":53739523,"message":"Preparing deletion of resource on 'kube14'","obj_refs":{"RscDfn":"vm-102-disk-1","Node":"kube14"},"created_at":"2024-06-04T03:13:57.432698Z"},{"ret_code":53739523,"message":"Preparing deletion of resource on 'linstor3'","obj_refs":{"RscDfn":"vm-102-disk-1","Node":"kube14"},"created_at":"2024-06-04T03:13:57.673295Z"},{"ret_code":53739523,"message":"Preparing deletion of resource on 'linstor1'","obj_refs":{"RscDfn":"vm-102-disk-1","Node":"kube14"},"created_at":"2024-06-04T03:14:02.501121Z"},{"ret_code":-4611686018373647386,"message":"Deletion of resource 'vm-102-disk-1' on node 'kube14' failed due to an unknown exception.","details":"Node: kube14, Resource: vm-102-disk-1","error_report_ids":["656E857A-00000-054648"],"obj_refs":{"RscDfn":"vm-102-disk-1","Node":"kube14"},"created_at":"2024-06-04T03:14:02.517225Z"}]
 at /usr/share/perl5/PVE/Storage/Custom/LINSTORPlugin.pm line 470.
PVE::Storage::Custom::LINSTORPlugin::deactivate_volume("PVE::Storage::Custom::LINSTORPlugin", "linstor-prod-triplicate", HASH(0x564759f7e068), "vm-102-disk-1", undef, HASH(0x56475a2522c0)) called at /usr/share/perl5/PVE/Storage.pm line 1240
eval {...} called at /usr/share/perl5/PVE/Storage.pm line 1239
PVE::Storage::deactivate_volumes(HASH(0x564759f74c48), ARRAY(0x564759fc3838)) called at /usr/share/perl5/PVE/QemuServer.pm line 5921
eval {...} called at /usr/share/perl5/PVE/QemuServer.pm line 5921
PVE::QemuServer::vm_start_nolock(HASH(0x564759f74c48), 102, HASH(0x56475392eaf8), HASH(0x564758fdc738), HASH(0x564759f6a710)) called at /usr/share/perl5/PVE/QemuServer.pm line 5604
PVE::QemuServer::__ANON__() called at /usr/share/perl5/PVE/AbstractConfig.pm line 299
PVE::AbstractConfig::__ANON__() called at /usr/share/perl5/PVE/Tools.pm line 259
eval {...} called at /usr/share/perl5/PVE/Tools.pm line 259
PVE::Tools::lock_file_full("/var/lock/qemu-server/lock-102.conf", 10, 0, CODE(0x564758a0c758)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 302
PVE::AbstractConfig::__ANON__("PVE::QemuConfig", 102, 10, 0, CODE(0x564758bdd750)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 322
PVE::AbstractConfig::lock_config_full("PVE::QemuConfig", 102, 10, CODE(0x564758bdd750)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 330
PVE::AbstractConfig::lock_config("PVE::QemuConfig", 102, CODE(0x564758bdd750)) called at /usr/share/perl5/PVE/QemuServer.pm line 5605
PVE::QemuServer::vm_start(HASH(0x564759f74c48), 102, HASH(0x564758fdc738), HASH(0x564759f6a710)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 2999
PVE::API2::Qemu::__ANON__("UPID:kube14:00002C1C:00006BA5:665E864E:qmstart:102:root\@pam:") called at /usr/share/perl5/PVE/RESTEnvironment.pm line 620
eval {...} called at /usr/share/perl5/PVE/RESTEnvironment.pm line 611
PVE::RESTEnvironment::fork_worker(PVE::RPCEnvironment=HASH(0x5647538f8798), "qmstart", 102, "root\@pam", CODE(0x564759f7e3e0)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 3003
PVE::API2::Qemu::__ANON__(HASH(0x564759f74840)) called at /usr/share/perl5/PVE/RESTHandler.pm line 499
PVE::RESTHandler::handle("PVE::API2::Qemu", HASH(0x564759a15ba8), HASH(0x564759f74840)) called at /usr/share/perl5/PVE/RESTHandler.pm line 337
PVE::RESTHandler::__ANON__("PVE::API2::Qemu", HASH(0x564759f74840)) called at /usr/share/perl5/PVE/HA/Resources/PVEVM.pm line 74
PVE::HA::Resources::PVEVM::start("PVE::HA::Resources::PVEVM", PVE::HA::Env=HASH(0x564759f6eb98), 102) called at /usr/share/perl5/PVE/HA/LRM.pm line 918
PVE::HA::LRM::exec_resource_agent(PVE::HA::LRM=HASH(0x564759f6ecd0), "vm:102", HASH(0x564759f6f8d0), "started", HASH(0x56475392ecf0)) called at /usr/share/perl5/PVE/HA/LRM.pm line 631
eval {...} called at /usr/share/perl5/PVE/HA/LRM.pm line 630
PVE::HA::LRM::run_workers(PVE::HA::LRM=HASH(0x564759f6ecd0)) called at /usr/share/perl5/PVE/HA/LRM.pm line 700
PVE::HA::LRM::manage_resources(PVE::HA::LRM=HASH(0x564759f6ecd0)) called at /usr/share/perl5/PVE/HA/LRM.pm line 503
eval {...} called at /usr/share/perl5/PVE/HA/LRM.pm line 458
PVE::HA::LRM::work(PVE::HA::LRM=HASH(0x564759f6ecd0)) called at /usr/share/perl5/PVE/HA/LRM.pm line 329
PVE::HA::LRM::do_one_iteration(PVE::HA::LRM=HASH(0x564759f6ecd0)) called at /usr/share/perl5/PVE/Service/pve_ha_lrm.pm line 28
PVE::Service::pve_ha_lrm::run(PVE::Service::pve_ha_lrm=HASH(0x564759ef2b88)) called at /usr/share/perl5/PVE/Daemon.pm line 398
eval {...} called at /usr/share/perl5/PVE/Daemon.pm line 379
PVE::Daemon::__ANON__(PVE::Service::pve_ha_lrm=HASH(0x564759ef2b88), undef) called at /usr/share/perl5/PVE/Daemon.pm line 551
eval {...} called at /usr/share/perl5/PVE/Daemon.pm line 549
PVE::Daemon::start(PVE::Service::pve_ha_lrm=HASH(0x564759ef2b88), undef) called at /usr/share/perl5/PVE/Daemon.pm line 658
PVE::Daemon::__ANON__(HASH(0x564759f6a2f0)) called at /usr/share/perl5/PVE/RESTHandler.pm line 499
PVE::RESTHandler::handle("PVE::Service::pve_ha_lrm", HASH(0x564759ef2f00), HASH(0x564759f6a2f0), 1) called at /usr/share/perl5/PVE/RESTHandler.pm line 985
eval {...} called at /usr/share/perl5/PVE/RESTHandler.pm line 968
PVE::RESTHandler::cli_handler("PVE::Service::pve_ha_lrm", "pve-ha-lrm start", "start", ARRAY(0x564753c199b0), ARRAY(0x564759ef3218), undef, undef, undef) called at /usr/share/perl5/PVE/CLIHandler.pm line 594
PVE::CLIHandler::__ANON__(ARRAY(0x5647538f8708), CODE(0x56475393fd20), undef) called at /usr/share/perl5/PVE/CLIHandler.pm line 673
PVE::CLIHandler::run_cli_handler("PVE::Service::pve_ha_lrm", "prepare", CODE(0x56475393fd20)) called at /usr/sbin/pve-ha-lrm line 30

NOTICE
  Intentionally removing diskless assignment (vm-102-disk-2) on (kube14).
  It will be re-created when the resource is actually used on this node.
API Return-Code: 500. Message: Could not delete diskless resource vm-102-disk-2 on kube14, because:
[{"ret_code":53739522,"message":"Node: kube14, Resource: vm-102-disk-2 preparing for deletion.","details":"Node: kube14, Resource: vm-102-disk-2 UUID is: 18b7f4a2-e5e9-454d-9de1-92fa1b168fe0","obj_refs":{"RscDfn":"vm-102-disk-2","Node":"kube14"},"created_at":"2024-06-04T03:14:02.770854Z"},{"ret_code":-4611686018373647386,"message":"(Node: 'linstor2') No response generated by handler.","details":"In API call 'ChangedRsc'.","obj_refs":{"RscDfn":"vm-102-disk-2","Node":"kube14"},"created_at":"2024-06-04T03:14:02.77783Z"},{"ret_code":53739523,"message":"Preparing deletion of resource on 'kube14'","obj_refs":{"RscDfn":"vm-102-disk-2","Node":"kube14"},"created_at":"2024-06-04T03:14:02.780142Z"},{"ret_code":53739523,"message":"Preparing deletion of resource on 'linstor1'","obj_refs":{"RscDfn":"vm-102-disk-2","Node":"kube14"},"created_at":"2024-06-04T03:14:29.652947Z"},{"ret_code":-4611686018373647386,"message":"Deletion of resource 'vm-102-disk-2' on node 'kube14' failed due to an unknown exception.","details":"Node: kube14, Resource: vm-102-disk-2","error_report_ids":["656E857A-00000-054649"],"obj_refs":{"RscDfn":"vm-102-disk-2","Node":"kube14"},"created_at":"2024-06-04T03:14:29.660518Z"}]
 at /usr/share/perl5/PVE/Storage/Custom/LINSTORPlugin.pm line 470.
PVE::Storage::Custom::LINSTORPlugin::deactivate_volume("PVE::Storage::Custom::LINSTORPlugin", "linstor-prod-duplicate", HASH(0x564759f7e308), "vm-102-disk-2", undef, HASH(0x56475a2522c0)) called at /usr/share/perl5/PVE/Storage.pm line 1240
eval {...} called at /usr/share/perl5/PVE/Storage.pm line 1239
PVE::Storage::deactivate_volumes(HASH(0x564759f74c48), ARRAY(0x564759fc3838)) called at /usr/share/perl5/PVE/QemuServer.pm line 5921
eval {...} called at /usr/share/perl5/PVE/QemuServer.pm line 5921
PVE::QemuServer::vm_start_nolock(HASH(0x564759f74c48), 102, HASH(0x56475392eaf8), HASH(0x564758fdc738), HASH(0x564759f6a710)) called at /usr/share/perl5/PVE/QemuServer.pm line 5604

PVE::CLIHandler::__ANON__(ARRAY(0x5647538f8708), CODE(0x56475393fd20), undef) called at /usr/share/perl5/PVE/CLIHandler.pm line 673
PVE::CLIHandler::run_cli_handler("PVE::Service::pve_ha_lrm", "prepare", CODE(0x56475393fd20)) called at /usr/sbin/pve-ha-lrm line 30

NOTICE
  Intentionally removing diskless assignment (vm-102-disk-3) on (kube14).
  It will be re-created when the resource is actually used on this node.
API Return-Code: 500. Message: Could not delete diskless resource vm-102-disk-3 on kube14, because:
[{"ret_code":53739522,"message":"Node: kube14, Resource: vm-102-disk-3 preparing for deletion.","details":"Node: kube14, Resource: vm-102-disk-3 UUID is: faf735f3-f92c-4d7c-a372-6199993f4f1a","obj_refs":{"RscDfn":"vm-102-disk-3","Node":"kube14"},"created_at":"2024-06-04T03:14:30.517016Z"},{"ret_code":-4611686018373647386,"message":"(Node: 'linstor2') No response generated by handler.","details":"In API call 'ChangedRsc'.","obj_refs":{"RscDfn":"vm-102-disk-3","Node":"kube14"},"created_at":"2024-06-04T03:14:30.524225Z"},{"ret_code":53739523,"message":"Preparing deletion of resource on 'kube14'","obj_refs":{"RscDfn":"vm-102-disk-3","Node":"kube14"},"created_at":"2024-06-04T03:14:30.525404Z"},{"ret_code":53739523,"message":"Preparing deletion of resource on
volume deactivation failed: linstor-prod-triplicate:vm-102-disk-1 linstor-prod-duplicate:vm-102-disk-2 linstor-prod-duplicate:vm-102-disk-3 at /usr/share/perl5/PVE/Storage.pm line 1248.
TASK ERROR: start failed: QEMU exited with code 1

This is what I see on the listor end for VM 100

and this is for VM 102

Der Harry · Jun 4, 2024

What about listore list?

Bash:

linstor node list

Bash:

linstor storage-pool list

Bash:

linstor resource-group list

Bash:

linstor resource list

MARVELMafia · Jun 4, 2024

MARVELMafia · Jun 4, 2024

I have no idea where to look. I just want to be able to start vm-100

Der Harry · Jun 4, 2024

MARVELMafia said:
I have no idea where to look. I just want to be able to start vm-100

It looks like one of your pools is down.

I am speculating. It might be because of "some" hardware issues with the computer / disk / temperature / network / ...

Again. Speculation.

However - it doesn't look like a proxmox problem. Proxmox can't get the disk - because - a linstor pool is offline / broken.

MARVELMafia · Jun 4, 2024

Der Harry said:
It looks like one of your pools is down.

I am speculating. It might be because of "some" hardware issues with the computer / disk / temperature / network / ...

Again. Speculation.

However - it doesn't look like a proxmox problem. Proxmox can't get the disk - because - a linstor pool is offline / broken.

So the Linstor2 server probably has some hw issues.

How can I be sure or prove that it is not an issue on the proxmox end? Two of my VMs are not able to start because of this. Is there any way out? Or am I just stuck with the two VM's as not working. Any tips on how do I recover the two VMs?

And how would one go about debugging this? Do I check on the RAID set up to see which HDD/SSD is failing?

Der Harry · Jun 4, 2024

MARVELMafia said:
So the Linstor2 server probably has some hw issues.

NO - that is speculation (I wrote this twice!)

It might be hardware, it might be the fan, it might be the software, it might be the harddisk, it might be a cable, it might be RAM, it might be the CPU, it might be the motherboard...

We know for a fact - a pool is down - we don't know why.

MARVELMafia · Jun 4, 2024

Yes I understand that. But where does that leave us?
Any resource (VM or anything) using the linstor2 storage pool would be affected. So what are my options for recovering from this?

Der Harry said:
NO - that is speculation (I wrote this twice!)

It might be hardware, it might be the fan, it might be the software, it might be the harddisk, it might be a cable, it might be RAM, it might be the CPU, it might be the motherboard...

We know for a fact - a pool is down - we don't know why.

View attachment 69209

Der Harry · Jun 4, 2024

MARVELMafia said:
Yes I understand that. But where does that leave us?
Any resource (VM or anything) using the linstor2 storage pool would be affected. So what are my options for recovering from this?

Very simple.

It's not a proxmox issue. Proxmox can't fix it. Proxmox is not the problem why this happened. Proxmox just can't access the data - causing in a VM that does not start.

Someone who installed and configured linstor is maybe the first person to ask. It looks like you have 150+ TB of storage (that means "a lot of data"). Someone who is a linbit / linstor expert might help you. You have to carefully check why the pool is in error state.

MARVELMafia · Jun 4, 2024

Der Harry said:
Very simple.

It's not a proxmox issue. Proxmox can't fix it. Proxmox is not the problem why this happened. Proxmox just can't access the data - causing in a VM that does not start.

Someone who installed and configured linstor is maybe the first person to ask. It looks like you have 150+ TB of storage (that means "a lot of data"). Someone who is a linbit / linstor expert might help you. You have to carefully check why the pool is in error state.

The person who configured it or our 'expert' is on leave and I am the junior who is responding to this outage for now.

Thank you very much for your help. I will try to figure out why the pool is in an error state.

From top of my head, can I copy/move data from the failing VM hardisk resource to a new VM hardisk resource and then attach this new hardisk to a newly created VM in proxmox? We use drbd internally

Der Harry · Jun 4, 2024

MARVELMafia said:
The person who configured it or our 'expert' is on leave and I am the junior who is responding to this outage for now.

Thank you very much for your help. I will try to figure out why the pool is in an error state.

If you are on Discord or somewhere we might can have a look at this together. Pn me. I might be availabe in ~3 hours.

MARVELMafia said:
From top of my head, can I copy/move data from the failing VM hardisk resource to a new VM hardisk resource and then attach this new hardisk to a newly created VM in proxmox? We use drbd internally

I think that is actually what happened to the other pools / data. They are in a warning mode what what is left on reduncantcy. But the machines that are on error - are down.

Search

Search

Urgent Help Needed: VMs Not Starting on ProxMox with Linstor Shared Storage

MARVELMafia

New Member

Der Harry

Active Member

MARVELMafia

New Member

MARVELMafia

New Member

Der Harry

Active Member

MARVELMafia

New Member

Der Harry

Active Member

MARVELMafia

New Member

Der Harry

Active Member

MARVELMafia

New Member

Der Harry

Active Member

We value your privacy