Today, my main Policy Baseline is applied to ~100 tenants and has ~100 policies. It currently takes over an hour to run, and I believe it runs every hour, so essentially, it is always running. It also seems the job "Republish assignments in baseline" has no parallelization and runs synchronously, which means that one job has roughly 10,000 tasks. Furthermore, the entire job fails if one task fails for any reason. Yesterday, that job ran 19 times and failed 7. Typically, the job following the failed execution is completed successfully, but the failure impacts all policies that had not been reached.
My feature request/proposal is to revamp the logic in which these are executed. For example, have the "Republish assignments in baseline" job kick off independent child jobs for each tenant and allow them to run in parallel. Even with restricting the number of tenants processed in parallelization, this should directly impact the time to completion but would also allow policies to apply to other tenants should a specific tenant have errors. Additionally, if a task fails within the job, it should not exit the entire job. If there is no retry logic, I would love to see that implemented; however, if the task fails (after retrying), I would still like to see the job status show an error, but failing out of the job because of one policy seems counterintuitive. If one policy has an issue, the other policies should still be applied/checked/managed. This would make the solution more robust and allow policies to reach tenants much quicker should something need to be changed.
This same logic could be applied to other jobs at the MSP level, like Solutions Baseline, but in my experience, the Policy Baselines are the most impacted.



Comments (2 comments)