We are currently migrating to Proxmox (PVE 8.3.0, LVM storage). With automated bulk starts of VMs we observe hanging tasks. Others in our company could mitigate or work around, but as we do many automated VM starts over the REST API, we are only shifting the issue.
What we see is, that the start command never does anything, but is stuck forever. Even giving it a timeout in the API did not help. One needs to check for it and stop it manually from the outside. Otherwise due to the lock-file the VM will be in an unusable state forever. It seems to happen, if there is a bit more disk I/O in the system.
Digging deeper we can see the hanging child and parent processes from fork_worker in RESTEnvironment.pm. The parent is stuck in a blocking read, where no data appears, while the child process executes the vm_start, where it does a blocking select on a UNIX sockets, and no data becomes available (see attached files). Even with gdb and syslogs in the Perl code, we could not pin down the real issue so far. Any pointers or hints would be highly appreciated.
What we see is, that the start command never does anything, but is stuck forever. Even giving it a timeout in the API did not help. One needs to check for it and stop it manually from the outside. Otherwise due to the lock-file the VM will be in an unusable state forever. It seems to happen, if there is a bit more disk I/O in the system.
Digging deeper we can see the hanging child and parent processes from fork_worker in RESTEnvironment.pm. The parent is stuck in a blocking read, where no data appears, while the child process executes the vm_start, where it does a blocking select on a UNIX sockets, and no data becomes available (see attached files). Even with gdb and syslogs in the Perl code, we could not pin down the real issue so far. Any pointers or hints would be highly appreciated.