If this has been considered previously, please don't hesitate to forward me to that consideration and the result.
In a recent Azure Files Storage performance/scaling incident we uncovered some excellent lessons learned to help our team see a real life example of the challenges FSLogix and Azure Storage dance around. What I've realized is that Auto-Scale looks at consumed and total space quota (good), and latency (good) - but does not consider IOPS. Maybe there is a reason why, but in our case, had Auto-Scale monitoring IOPS (which are a simple ratio of 1 IOPS/GB of provisioned quota for the baseline) - it could have seen the spike and scaled up accordingly until the IOPS were within spec, and scaled back in according to all the appropriate rules.
Available burst IOPS helped contain the negative effects, but only for so long. In our case, the FSLogix detach/reattach was really the key indicator. So I would add that to the mix as well, since the indicator that really set us on the path to the right fix was the rapid detach/attach log entries from FSLogix, in the Insights data. Surfacing this as an indicator of Azure Files storage health would be extra beneficial! Some arbitrary (or configurable) threshhold of detach/reattach events in an evaluation window like 30 minutes… Could be useful? There may be other FSLogix events that should trigger a scale-up, but that is what experience I have in front of me.
I recognize its possible that this could be notably ‘expensive’ if a login storm or other such ‘normal’ activity spikes IOPS and triggers a scale-out that is less than necessary, that cannot be brought back for another 24 hours. Thorough testing should result in a comfort level that a default metric like that of latency would work. And to be fair, this event happened because auto-scale was not properly configured on our side. I would contend that the most value here likely comes from smaller environments that might not scale past the minimum 100GB for Azure Files Premium tier - yet Latency on its own was insufficient to trigger a needed scale-up.
Bonus points to make it an optional metric - Off the cuff I suspect this would be valuable for ensuring a quality user experience for smaller deployments, or initial deployments where the initial quota setting can be a bit of a mystery. We trust in auto-scale to keep us ahead of the curve, and this enhancement would be another step forward in maximizing the value of NMM.
Comments (1 comment)