Windows Service Monitoring at Scale using Cloud Native Azure Components

Recently, I was challenged to build a scalable, cloud native solution that should be used for monitoring of critical Windows Services on both Azure Native and Azure Arc servers. Solution should scale to +1000 servers.

Requirements included the concept of server maintenance mode (true/false); send mail & SMS notification when stopped & running; run every 5 min; easy to manage using tags and integrate with Azure Update Manager maintenance config for Patching (pre/post).

This blog shows how I built it using Azure Tags, Kusto Query against Azure Resource Graph and Azure LogAnalytics Change Tracking data; collected with Azure Monitor Agent using Azure Data Collection Rule and everything wrapped into an Azure Logic App.

Content

Output: Service is stopped

Output: Service is running

Overview: Azure Logic Apps Engine

Tags used on Azure Native & Azure Arc Servers

TagNameTagValuesPurpose
MonitorMaintenanceModeActiveFALSE
TRUE
If a server should be put into maintenance mode, you can set the value to TRUE. Then no alerts will arrive until status is changed back t FALSE again.

You can also control this tag-value using azure Update Manager pre/post-events
MonitorWindowsServicesWindows Service name, separated by , (comma)

sample
IISADMIN,Dfs,DFSR,DNS,ADSync,AADConnectProvisioningAgent
Lists the services that should be monitored
MonitorContactMailMail address like
mok@mortenknudsen.net
Value of it alerts group that should receive notifications
MonitorContactSMSMobile number like
+45 12345678
Mobile number of recipient

Azure Logic App configuration

The Logic app is defined by the following phases:

Step #StepAction
1RecurrenceRuns every 5 min
2Query stopped/started (list output format)Checks if any service in scope has stopped (or started) during last 5 min.
3Condition (output) from step 2If output, then it continues to step 4
4Query stopped/started (HTML output format)Checks if any service in scope has stopped (or started) during last 5 min.

It is 100% the same output as step 2, but it builds an HTML output with tables
5Send mailHere mail it sent to the variable ContactMail
6Send SMS (optional)Here mail it sent to the variable ContactSMS

Step 1 – Recurrence

Step 2 – Query stopped/started (list output format)

Step 2 – Query – Detect Windows Service Stopped

arg("").resources
| where type in ('microsoft.compute/virtualmachines','microsoft.hybridcompute/machines')
| extend ContactMail = tostring(tags["MonitorContactMail"])
| extend ContactSMS = tostring(tags["MonitorContactSMS"])
| extend MonitorMaintenanceModeActive = tostring(tags["MonitorMaintenanceModeActive"])
| extend PatchGroup = tostring(tags["PatchGroup"])
| extend PatchActive = tostring(tags["PatchActive"])
| mvexpand tags
| extend Name = toupper(name)
| extend tagKey = tostring(bag_keys(tags)[0])
| extend tagValue = split(tostring(tags[tagKey]), ",")
| extend tagValueStr = tostring(tags[tagKey])
| where tagKey in ("MonitorContactMail","MonitorContactSMS","MonitorMaintenanceModeActive","MonitorWindowsServices","PatchActive","PatchGroup","Environment")
| project Name, tagKey, tagValue, tagValueStr, type, location, resourceGroup, subscriptionId, id, ContactMail, ContactSMS, MonitorMaintenanceModeActive, PatchGroup, PatchActive
| join kind=inner (
    ConfigurationChange
    | extend Name = toupper(Computer)
) on $left.Name == $right.Name
| where TimeGenerated > ago(5m)
| where MonitorMaintenanceModeActive =~ "FALSE"
| where ConfigChangeType == "WindowsServices"
| where SvcChangeType == "State"
| where SvcState == "Stopped"
| where SvcPreviousState == "Running"
| where tagKey =~ "MonitorWindowsServices"
| where SvcStartupType == "Auto"
| sort by TimeGenerated desc
| mv-apply tagValue on (
    where SvcName == (tagValue)
)
| project Name, TimeGenerated, SvcName, SvcDisplayName, SvcState, SvcPreviousState, SvcStartupType, tagKey, tagValue, tagValueStr, ContactMail, ContactSMS, MonitorMaintenanceModeActive, PatchGroup, PatchActive

Step 2 – Query – Detect Windows Service Running

arg("").resources
| where type in ('microsoft.compute/virtualmachines','microsoft.hybridcompute/machines')
| extend ContactMail = tostring(tags["MonitorContactMail"])
| extend ContactSMS = tostring(tags["MonitorContactSMS"])
| extend MonitorMaintenanceModeActive = tostring(tags["MonitorMaintenanceModeActive"])
| extend PatchGroup = tostring(tags["PatchGroup"])
| extend PatchActive = tostring(tags["PatchActive"])
| mvexpand tags
| extend Name = toupper(name)
| extend tagKey = tostring(bag_keys(tags)[0])
| extend tagValue = split(tostring(tags[tagKey]), ",")
| extend tagValueStr = tostring(tags[tagKey])
| where tagKey in ("MonitorContactMail","MonitorContactSMS","MonitorMaintenanceModeActive","MonitorWindowsServices","PatchActive","PatchGroup","Environment")
| project Name, tagKey, tagValue, tagValueStr, type, location, resourceGroup, subscriptionId, id, ContactMail, ContactSMS, MonitorMaintenanceModeActive, PatchGroup, PatchActive
| join kind=inner (
    ConfigurationChange
    | extend Name = toupper(Computer)
) on $left.Name == $right.Name
| where TimeGenerated > ago(5m)
| where MonitorMaintenanceModeActive =~ "FALSE"
| where ConfigChangeType == "WindowsServices"
| where SvcChangeType == "State"
| where SvcState == "Running"
| where SvcPreviousState == "Stopped"
| where tagKey =~ "MonitorWindowsServices"
| where SvcStartupType == "Auto"
| sort by TimeGenerated desc
| mv-apply tagValue on (
    where SvcName == (tagValue)
)
| project Name, TimeGenerated, SvcName, SvcDisplayName, SvcState, SvcPreviousState, SvcStartupType, tagKey, tagValue, tagValueStr, ContactMail, ContactSMS, MonitorMaintenanceModeActive, PatchGroup, PatchActive

Step 3 – Conditions

Validate code

"type": "If",
  "expression": {
    "and": [
      {
        "equals": [
          "@length(body('Detect_Windows_Service_Stopped_-_LIST')?['value'])",
          0
        ]
      }
    ]

Step 4 – Query stopped/started (HTML output format)

Use same queries as earlier (stopped/running)

Step 5 – Send Mail

Final layout

Validate that Azure Logic Apps runs every 5 min

Azure Logic Apps (Code View)

Azure Logic Apps (Code View)

Azure Data Collection Rules for Change Tracking Collection

Windows service collection runs every 60 sec. You can control this frequency by adjusting this value.

“servicesSettings”: {
“serviceCollectionFrequency”: 60

Don’t forget to associate the servers, where you want to collect Windows Service Change Tracking information as shown below

Azure Data Collection Rules Template (JSON)

Azure Data Collection Rules Template (JSON). Used with Custom Deployment

Azure LogAnalytics Workspace for Change Tracking

Leave a Reply