I have been, for quite some time now, taking advantage of the best practice shared by Nerdio reps at various points, and setting up a notification rule for a blanket 'all' failures. This allows a select group of recipients to see what is failing and why. Sometimes it is operator error on the part of the administrator, sometimes it exposes a problem with a customer we didn't expect and need to remedy, the list of value goes on and on.
In the last month or so I have noticed a significant increase in the number of tasks failing for what appears to be a transient issue with NMM calling the respective Azure API. This manifests itself in an error notification (or task logs and insights downloads) as a fairly descriptive error it for a subtask (let's use AutoScale as an example) first, then the non-descript portion:
Failed get RAM usage for host mysessionhost1.mycustomer.com. Current values count: 0 count unknown: 0
Error: An error occurred while sending the request.
Support has advised this is a transient error with Microsoft, but so long as the task retries and is successful, we can go about our day.
This seems like a backend process improvement to introduce extra retry logic that would cut down on the failures in general and trigger a task to fail only after sufficient attempts have been made. If this kind of logic is already there, it isn't apparent as I asked Support and they had no insight.
Theory - When this particular common error is encountered (I have a few variants collected) - NMM should internally wait and retry for a reasonable amount of time, say wait 30 seconds and retry up to 2 or 3 times. If all of those attempts fail to produce the expected content result, then fail the task and the rest of the magic happens.
When this happens for automatic scaling tasks (Host Pools or Files) it naturally resolves by just trying again at its next scheduled interval. However the increased frequency lately has me leaning against just ignoring these in favor of better logic.
Sharing here in case 1, others are seeing the increase too, and 2, it just makes sense (unless NMM is -already- doing this and we just can't see it).
Comments (0 comments)