Sometimes API requests will timeout after 3 minutes - how can we effectively write solutions to act only after longer requests have actually completed?
The first consideration is that you could craft your own timeout with the API as defined here:
--timeout <s> Timeout for server communication. Defaults to 180 seconds (3 minutes).
eg: here is a simple command that will wait 10 minutes rather than the default of 3 minutes:
ectool --timeout 600 getResources
This approach will help in most cases, but it’s not perfect. The problem with this approach is that you may pick a time that seems appropriate, but if the overall system is temporarily bogged down, perhaps by a heavy burst of workload, what you thought was a sufficient time extension may become not be enough, and your procedure will end up failing in such situations.
It’s also true that such failures often won’t show up for years and thus stump those wondering why something previously stable is only now encountering inconsistent failures.
One resolution to this imperfection might be to modify the timeout value for this command to use something excessive, say 24-hours (86400 seconds).
However, if there is something else creating a blockage (a bug or underlying environment issue), this approach may mean that the underlying problem could take longer to get attention.
Using an extended time like this will also lock up the resource slot for the entire time you are waiting, which can create blockages for other work that requires this resource or associated resource pool(s).
One might consider writing code that loops to check if a task has completed. For example, if you write code to continue in cases where the request completes successfully in time, and when timing out, to loop and check until you can confirm that the operation completed.
The most common pattern would be when spawning another procedure, and waiting for it’s run to be completed before wanting the caller’s process to continue.
The problem with a looping technique is that you are placing a burden onto the system to constantly check the status of the spawned procedure.
Such checking on status will continue to tax the DB. You therefore might consider adding a sleep operation inside this loop, so that you only check during set time intervals, like every 1 minute.
While this can help cut down on the DB query overhead costs, it can also mean waiting the coded sleep time for things to continue after the spawned procedure has already finished, which is also non-ideal.
For these reasons, we generally would discourage using a coded or self-written looping approach to monitor directly.
Rather than keeping a resource locked down while constantly looping to check on a request that takes more than some timed value (3-min default), a better model would be to split the desired work into 2 separate steps. From the example above, where a new procedure is being spawned, have the first step do it’s pre-work and then kick off the called procedure as it’s last item of work. You can then store the resulting jobId. Then have the next step be programmed with a PreCondition that will wait for the status of that jobId to be “completed”. You could then have your code choose how to handle things based on that job’s outcome possibilities: “success”, “warning” or “error”.
If you are not waiting on a procedure to complete, but instead some other long-running query (eg: using findObjects for a very open-ended query may take excessive times for the DB to collect), you could create a utility procedure reponsible for launching such queries through a separate job, and then monitor the result in the same way.
If you need to launch a set of queries, perhaps to calculate some result, or produce some form of report, then launching these queries in parallel could be helpful. In such a case, you could leverage a procedure’s ability to spawn dyanmic jobSteps and then have a single collector step be responsible to monitor the conditions of the set of these steps before starting. This could be set to run after the parallel set of steps have all completed, or perhaps as part of the parallel block with preCondition logic if only some of the outputs from the steps are needed to move forward. In such cases, the API timeout for each individual data request inside the parallel block may need to be set as outlined above, in (1)