Failed status and backups - auto-recovery

I recently encountered a scenario I had not observed previously, maybe it is a recent change at MS, but regardless I saw an opportunity for NMM to help maintain environment health for AVD and Azure VM deployments.

In my case, we had several session hosts that were left in a failed state due to https://app.azure.com/h/_M5B-9RZ/eb5ea4 incident. Auto-scale tried to bring them online (because the pool exceeded CPU metrics, not because we needed user capacity).  They failed to reach running state because of said incident and were left in a ‘failed’ status.  NMM and Auto-scale left them there because it no longer had need to bring them online.   From that point forward, until they reached either ‘Stopped’ or "Running' status, Azure Recovery Services Vault backups were failing.  The error indicated the disks were not in a proper state to allow for backups.   As soon as I manually powered them on, backups normalized and life was great.

While the incident was notable, this condition is always at risk of happening, some service management issue that results in a failed power on or create action. Especially those SKU's that are in high demand, and high utilization regions like East US and West US.

In this particular case, 2 session hosts reached this state, and 1 was already powered on by auto-scale the next morning for pre-staging, but I could envision environments with more excess capacity sitting around than mine, so this could have persisted quite a while longer with no backups being taken, and failed backup job alerts.

So my feature request is for NMM to recognize that status and have an option to enable, that would work with auto-scale to get that session host out of the ‘failed’ condition.   I would contend when in a failed state, try every so often (say 5 or 15 minutes) to power it back on.  I don't think (But didn't think to capture that datapoint) we can actually attempt to deallocate, but if it is an option in that condition, all the more better.   But even with powering it on, auto-scale, if enabled, could bring it back offline fairly quickly if the capacity isn't needed.

Having it tied to auto-scale makes it work at regular intervals.  Having it as an option to enable or disable makes it so the partner gets to choose whether this is something to be concerned about.   And extra brownie points, having a notification condition to alert an appropriate team, when this condition occurs, would be valuable.

Thank you for the consideration!

1

Comments (0 comments)

Please sign in to leave a comment.