Maximal Workers / Bulk-Action setting ignored

jlauro · May 13, 2024

Proxmox doesn't appear to honor the setting the Maximal Workers / Bulk-Action. I currently have it set to 4, and creating 21 vms simultaneously causes some to fail.

Code:

for a in `seq 201 221` ; do pvesh create nodes/pm105/qemu/103/clone --full --name cl$a --newid $a --target $( let b=${a}%3+105 ; echo pm$b ) >log.clone.$a & done

or

Code:

for a in `seq 201 221` ; do qm clone 103 $a --full true --name cl$a --target $( let b=${a}%3+105 ; echo pm$b ) >log.clone.$a & done

Several fail with clone failed: cfs-lock 'storage-iSAN8LVM' error: got lock request timeout

Lock handling appears to be non optimal, but wouldn't be noticed if the maximal workers operated as expected. (Not sure if it's a problem with the lock handling, or if the lock is held too long doing partition operations, but suspect it's the locking/releasing is non optimal).

When automating creating vms, is there a general preference for pvesh create, or qm clone, or some other method? or some option needed to add so that Maximal Works is honored?

Expected behavior: Only 4 (or whatever cluster Maximum Works is set to) jobs to run simultaneously, and additional requests are queued up and started after initial jobs complete.

Is there a better place to submit bug report than the forum? (I am planning on signing up for support later this month, so it's fine if you say putting in a support ticket is the best option.)

shanreich · May 14, 2024

I assume you are talking about the setting in the datacenter.cfg (Datacenter > Options)? This setting is used for bulk tasks only, such as 'start/stop all' or for the HA manager. Since you are spawning several independent tasks this limit does not apply to the clone actions.

The locking errors you are getting are quite likely because each clone tries to lock the storage and then some fail because the other clone tasks are taking longer than the timeout which is expected in this case.

It would make sense to space out the clone tasks a bit, depending on how long each clone task takes. A simple sleep for a few seconds could already be sufficient if cloning doesn't take too long.

jlauro · May 14, 2024

Yes, on your assumption. The similar setting in VMWare does also impact individual tasks such as cloning. Adding a a delay only a little, as it still runs into problems with this many vms as each vm is has disks, which is locked separately (BOOT/OS, Data disk, and EFI Disk) and so they start to fail later in the cloning process instead of on the first disk.

The locking/unlock mechanism seems to be rather slow, or possibly the iscsi partition slows down once enough worker processes hitting it. Is there a way to increase the timeout / retries? Have you considered to moving to something like a resource queue in redis that could handle thousands of locks/unlocks/sec from hundreds of workers and would also keep it as a FIFO instead of the random order it's currently hitting?

I can do a work around such as with xargs to keep the parallelism of a single job but limit how many are concurrent. That said, as soon as a few different bulk jobs are ran by different devs, it would be right back at the same problem. It would be nice of proxmox was a bit more robust out of the box.

Search

Search

Maximal Workers / Bulk-Action setting ignored

jlauro

Active Member

shanreich

Proxmox Staff Member

jlauro

Active Member

We value your privacy