Auto-Scaling Logic: Drain-Mode when User Load < Scaling Logic

Justin Trantham

October 09, 2023 19:46

Hello first time poster long time listener,
TLDR: If user load < Scaling Logic put servers with the least connections in drain mode to avoid users getting sent to a server and keeping it on longer than needed.

I've heard through our sales engineer this feature may be in the enterprise version but wanted to highlight the reasoning on the importance of this feature.

An example setup:
40 users
Scale-In restriction set to low to avoid kicking people
4 AVD Servers with Auto-Scaling set to Scale out a total of 3 servers in the morning and leave one running at all times – This server is SessionHost-1
Host Pool set to Breadth-First
Inactivity timer 2 hours before disconnected and 2 hours disconnected to logoff (coming from RDS we can't really change that as people want to disconnect and re-connect to not lose work)

Now a brief description of Breadth First load balancing in AVD: The breadth-first algorithm first queries session hosts in a host pool that allow new connections. The algorithm then selects a session host randomly from half the set of available session hosts with the fewest sessions. For example, if there are nine session hosts with 11, 12, 13, 14, 15, 16, 17, 18, and 19 sessions, a new session doesn't automatically go to the session host with the fewest sessions. Instead, it can go to any of the first five session hosts with the fewest sessions at random. Due to the randomization, some sessions may not be evenly distributed across all session hosts.
Found here: Host pool load balancing algorithms in Azure Virtual Desktop - Azure | Microsoft Learn

Problem: At the end of the day, person 1 connected to SessionHost-1 and person 2 conncted to SessionHost-2 are working late. Person 2 leaves for the day but doesn't log off and they just close out of the Remote Desktop Client. This starts the 2 hour timer before it logs them fully out and the scale-in restriction when set to low will power off that server.

During that 2 hour time window Person 1 has logged off but since they were on SessionHost-1 it will always remain online. Now you have 2 sessions hosts powered on SessionHost-1 has 0 connections and SessionHost-2 has 1 user disconnected. Person 3 then decides to log in and gets load balanced to SessionHost-2. Which if Person 3 does the same thing and close out at the end of the day and person 4, person 5, and person 6 do some late work you have 2 servers up when you only need 1 server up for that load. This problem is only amplified keeping sessions hosts online longer the smaller your host pool is because Breadth-first algorithm selects the first five session hosts with the fewest sessions at random to connect that user to.

Solution: Barring any Scale-In restrictions if your user count < Scaling Logic but servers remain online due to a disconnected user or connected users put those servers in Drain Mode so no new sessions can be kicked to that server(unless there is the maximum number of users on those servers). This would remove the problem we see where sometimes session hosts that should be scaled-in stay online all night because those new users would never hit that server because it’s in drain mode. It would be nice to not have to rely on randomness for those users to be sent to the server that is set to always remain online.

I know there are other ways we could setup the scaling-logic to avoid this like using depth-first however I’d argue that is a much more complicated solution and worse for user experience jamming the maximum users on a single server.

Thanks if you made it this far

Comments (3 comments)

DStephenson

October 09, 2023 21:23

Loving the idea!
It sounds like being able to add multiple scale-in restrictions could help with this.

Example:
During Business hours: Scale in at Low Aggressiveness
At the end of business hours: Scale in at Medium Aggressiveness
Starting at 9PM: Scale in at High Aggressiveness

Justin Trantham

October 09, 2023 21:35 (Edited October 09, 2023 21:41)

DStephenson that would also work, we could probably make it work with our current setup too with longer scale-in restrictions and a higher level but then we lose scaling during the day which I like squeezing as much as we can during the day. The one nice thing drain-mode on the VM would be that if for some reason that user disconnected at the end of the business day leaving work up without saving if we kick that user we are getting a call. Where if that user just disconnected naturally we wouldn't ie. After the two hour window or they logged back in reconnecting to that session when they got home before that window.
I do agree that in your example it would also work we would just need to notify our clients. I like it though saving on azure costs means more pizza parties for IT!

DStephenson

October 09, 2023 21:55

Very good point, Justin Trantham (welcome to the community by the way 🙂).
It's all about client communication and setting those expectations.
Sometimes, even if we tell them we're going to do something, they find a way to "forget" and inevitably complain. 😞

Your point, of notifying the client before disconnecting their session, reminds me of another request (Send a Notification Before the Session Time Limit – Nerdio Help Center).
If the Nerdio powers that be can implement both ideas, we would have the best of both worlds and let MSPs handle the scale-in scenarios whichever way works best for their customers/host pools.

Please sign in to leave a comment.