Skip to main content
Version: Next

Task 04: Setup Alerting

Runbook Azure

DOCUMENT CATEGORY: Runbook SCOPE: Azure alerting configuration PURPOSE: Configure proactive alerts for cluster health, performance, and hardware issues MASTER REFERENCE: Microsoft Learn - Azure Alerts

Status: Active


Primary Alerting Platform

Datadog is the primary alerting platform for Azure Local Cloud Azure Local deployments with ServiceNow integration. Azure Monitor alerts documented here provide supplementary alerting for Azure-native metrics and health faults. For primary alerting configuration, see Task 7 - Configure Datadog Integration.

Azure Monitor alerts provide proactive notification when cluster health degrades, performance thresholds are exceeded, or critical events occur. This step configures action groups for notifications and alert rules for common Azure Local scenarios.

Prerequisites

RequirementDescriptionValidation
Log Analytics WorkspaceData collection activeQueries return data
HCI InsightsEnabled (Step 3)Workbook shows data
Notification TargetsEmail addresses, Teams webhooks, etc.Contact list prepared
RBAC PermissionsMonitoring ContributorRole assignment verified

Variables from variables.yml

VariableConfig PathExample
AZURE_SUBSCRIPTION_IDazure.subscription.id00000000-0000-0000-0000-000000000000
AZURE_SUBSCRIPTION_NAMEazure.subscription.nameAzure Local Production
AZURE_RESOURCE_GROUPazure.resource_group.namerg-azurelocal-prod-eus2
SITE_CODEsite.codeDAL
NOC_EMAILalerting.noc_emailnoc@contoso.com
CUSTOMER_EMAILalerting.customer_emailops@customer.com
TEAMS_WEBHOOK_URLalerting.teams_webhook_urlhttps://outlook.office.com/webhook/...

Action Groups

Action groups define who and how to notify when alerts fire. Create these before creating alert rules.

  1. Navigate to Azure MonitorAlertsAction groups
  2. Click + Create
  3. Configure Basics:
SettingValue
Subscription{{AZURE_SUBSCRIPTION_NAME}}
Resource Group{{AZURE_RESOURCE_GROUP}}
Action group nameag-azl-{{SITE_CODE}}-critical
Display nameAZL Critical
  1. Configure Notifications:
Notification TypeNameTarget
Email/SMS/Push/VoiceAzure Local Cloud NOC{{NOC_EMAIL}}
Email/SMS/Push/VoiceCustomer Contact{{CUSTOMER_EMAIL}}
  1. Configure Actions (optional):
Action TypeNameConfiguration
WebhookTeams Channel{{TEAMS_WEBHOOK_URL}}
Azure FunctionAuto-RemediationFunction App URL
  1. Click Review + createCreate

Cluster Health Alerts

Create alerts based on HCI health events and metrics:

Alert 1: Node Health Critical

// Alert when a cluster node reports unhealthy
Event
| where Source == "Microsoft-Windows-SDDC-Management"
| where EventID == 3000
| extend HealthData = parse_json(RenderedDescription)
| where HealthData.HealthState != "Healthy"
| project TimeGenerated, Computer, HealthState = tostring(HealthData.HealthState)
SettingValue
Alert Rule NameNode Health - Critical
SeveritySev 1 - Critical
Evaluation FrequencyEvery 5 minutes
Lookback Period5 minutes
ThresholdGreater than 0
Action Groupag-azl-{{SITE_CODE}}-critical

Alert 2: Storage Volume Unhealthy

// Alert on storage volume health issues
Event
| where Source == "Microsoft-Windows-SDDC-Management"
| where EventID == 3002
| extend VolumeData = parse_json(RenderedDescription)
| where VolumeData.HealthStatus != "Healthy"
| project TimeGenerated, VolumeName = tostring(VolumeData.Name),
HealthStatus = tostring(VolumeData.HealthStatus)

Alert 3: High CPU Utilization

// Alert when CPU exceeds threshold
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize AvgCPU = avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
| where AvgCPU > 90
SettingValue
Alert Rule NameHigh CPU Utilization
SeveritySev 2 - Warning
ThresholdCPU > 90% for 15 minutes

Alert 4: Low Available Memory

// Alert when available memory drops below threshold
Perf
| where ObjectName == "Memory" and CounterName == "Available Bytes"
| extend AvailableGB = CounterValue / 1073741824
| summarize AvgAvailableGB = avg(AvailableGB) by Computer, bin(TimeGenerated, 5m)
| where AvgAvailableGB < 16 // Alert if less than 16 GB available

Create Alert Rule (Portal)

  1. Navigate to Azure MonitorAlertsAlert rules
  2. Click + Create
  3. Select Scope: Your Log Analytics workspace
  4. Configure Condition:
  • Signal type: Custom log search
  • Paste the KQL query
  • Set threshold logic
  1. Configure Actions: Select action group
  2. Configure Details:
  • Alert rule name
  • Severity
  • Resource group
  1. Click Review + createCreate

PowerShell: Create Log Alert Rule

# Create a scheduled query rule for node health
$workspace = Get-AzOperationalInsightsWorkspace `
-ResourceGroupName "{{AZURE_RESOURCE_GROUP}}" `
-Name "{{LOG_ANALYTICS_WORKSPACE_NAME}}"

$actionGroup = Get-AzActionGroup `
-ResourceGroupName "{{AZURE_RESOURCE_GROUP}}" `
-Name "ag-azl-{{SITE_CODE}}-critical"

$query = @"
Event
| where Source == "Microsoft-Windows-SDDC-Management"
| where EventID == 3000
| extend HealthData = parse_json(RenderedDescription)
| where HealthData.HealthState != "Healthy"
| summarize Count = count() by Computer
"@

# Note: Use New-AzScheduledQueryRule for full implementation
Write-Host "Alert rule configuration prepared" -ForegroundColor Cyan
Write-Host "Query: $query"
Write-Host "Action Group: $($actionGroup.Id)"
Alert NameSeverityConditionAction
Node Health CriticalSev 1Node unhealthy eventEmail + Teams
Storage Volume UnhealthySev 1Volume status != HealthyEmail + Teams
High CPU UtilizationSev 2CPU > 90% for 15 minEmail
Low Available MemorySev 2Memory < 16 GBEmail
VM FailedSev 2VM state = FailedEmail
Storage Latency HighSev 3Write latency > 100msEmail
Arc Connectivity LostSev 1No heartbeat > 30 minEmail + Teams

Validation

Test Alert Rule

  1. Navigate to Azure MonitorAlertsAlert rules
  2. Select your alert rule
  3. Click Test action group to verify notifications work

Check Alert History

# Get recent alerts
Get-AzAlert -ResourceGroupName "{{AZURE_RESOURCE_GROUP}}" `
-TimeRange "P1D" | # Last 24 hours
Select-Object Name, Severity, MonitorCondition, ResolvedTime |
Format-Table -AutoSize

Verify Action Group

// Check for alert notifications in Activity Log
AzureActivity
| where CategoryValue == "Alert"
| where TimeGenerated > ago(24h)
| project TimeGenerated, OperationName, Status, Description
| order by TimeGenerated desc

Troubleshooting

IssuePossible CauseResolution
No alerts firingQuery returns no resultsTest query in Log Analytics
Email not receivedEmail in spamCheck spam folder, add to safe senders
Action group errorInvalid webhook URLVerify Teams/webhook URL format
Alert fires repeatedlyThreshold too sensitiveAdjust threshold or aggregation period

Variables Reference

VariableDescriptionExample
{{NOC_EMAIL}}Azure Local Cloud NOC emailnoc@azurelocal.cloud
{{CUSTOMER_EMAIL}}Customer contactadmin@customer.com
{{TEAMS_WEBHOOK_URL}}Teams channel webhookhttps://outlook.office.com/webhook/...

Next Steps

After setting up alerting:

  1. ➡️ Task 5: Deploy OMIMSWAC Monitoring — Hardware monitoring
  2. Review alert rules weekly and adjust thresholds as needed
  3. Create runbooks for common alert responses
  4. Document escalation procedures for critical alerts

PreviousUpNext
← Task 03: HCI InsightsPhase 02: Monitoring & ObservabilityPhase 04: Security & Governance →

VersionDateAuthorChanges
1.0.02026-03-24Azure Local Cloudnology TeamInitial release