Task 04: Setup Alerting
DOCUMENT CATEGORY: Runbook SCOPE: Azure alerting configuration PURPOSE: Configure proactive alerts for cluster health, performance, and hardware issues MASTER REFERENCE: Microsoft Learn - Azure Alerts
Status: Active
Datadog is the primary alerting platform for Azure Local Cloud Azure Local deployments with ServiceNow integration. Azure Monitor alerts documented here provide supplementary alerting for Azure-native metrics and health faults. For primary alerting configuration, see Task 7 - Configure Datadog Integration.
Azure Monitor alerts provide proactive notification when cluster health degrades, performance thresholds are exceeded, or critical events occur. This step configures action groups for notifications and alert rules for common Azure Local scenarios.
Prerequisites
| Requirement | Description | Validation |
|---|---|---|
| Log Analytics Workspace | Data collection active | Queries return data |
| HCI Insights | Enabled (Step 3) | Workbook shows data |
| Notification Targets | Email addresses, Teams webhooks, etc. | Contact list prepared |
| RBAC Permissions | Monitoring Contributor | Role assignment verified |
Variables from variables.yml
| Variable | Config Path | Example |
|---|---|---|
AZURE_SUBSCRIPTION_ID | azure.subscription.id | 00000000-0000-0000-0000-000000000000 |
AZURE_SUBSCRIPTION_NAME | azure.subscription.name | Azure Local Production |
AZURE_RESOURCE_GROUP | azure.resource_group.name | rg-azurelocal-prod-eus2 |
SITE_CODE | site.code | DAL |
NOC_EMAIL | alerting.noc_email | noc@contoso.com |
CUSTOMER_EMAIL | alerting.customer_email | ops@customer.com |
TEAMS_WEBHOOK_URL | alerting.teams_webhook_url | https://outlook.office.com/webhook/... |
Action Groups
Action groups define who and how to notify when alerts fire. Create these before creating alert rules.
- Azure Portal
- Direct Script (On Node)
- Standalone Script
- Navigate to Azure Monitor → Alerts → Action groups
- Click + Create
- Configure Basics:
| Setting | Value |
|---|---|
| Subscription | {{AZURE_SUBSCRIPTION_NAME}} |
| Resource Group | {{AZURE_RESOURCE_GROUP}} |
| Action group name | ag-azl-{{SITE_CODE}}-critical |
| Display name | AZL Critical |
- Configure Notifications:
| Notification Type | Name | Target |
|---|---|---|
| Email/SMS/Push/Voice | Azure Local Cloud NOC | {{NOC_EMAIL}} |
| Email/SMS/Push/Voice | Customer Contact | {{CUSTOMER_EMAIL}} |
- Configure Actions (optional):
| Action Type | Name | Configuration |
|---|---|---|
| Webhook | Teams Channel | {{TEAMS_WEBHOOK_URL}} |
| Azure Function | Auto-Remediation | Function App URL |
- Click Review + create → Create
# Variables
SUBSCRIPTION_ID="{{AZURE_SUBSCRIPTION_ID}}"
RESOURCE_GROUP="{{AZURE_RESOURCE_GROUP}}"
SITE_CODE="{{SITE_CODE}}"
NOC_EMAIL="{{NOC_EMAIL}}"
# Create Action Group for Critical Alerts
az monitor action-group create \
--resource-group "$RESOURCE_GROUP" \
--name "ag-azl-$SITE_CODE-critical" \
--short-name "AZLCrit" \
--action email noc-email "$NOC_EMAIL" \
--tags Environment=Production Application=AzureLocal
# Create Action Group for Warning Alerts
az monitor action-group create \
--resource-group "$RESOURCE_GROUP" \
--name "ag-azl-$SITE_CODE-warning" \
--short-name "AZLWarn" \
--action email noc-email "$NOC_EMAIL"
echo "✅ Action groups created"
#Requires -Modules Az.Monitor
# Variables
$SubscriptionId = "{{AZURE_SUBSCRIPTION_ID}}"
$ResourceGroup = "{{AZURE_RESOURCE_GROUP}}"
$SiteCode = "{{SITE_CODE}}"
$NocEmail = "{{NOC_EMAIL}}"
$CustomerEmail = "{{CUSTOMER_EMAIL}}"
# Connect to Azure
Connect-AzAccount -Subscription $SubscriptionId
# Create email receivers
$nocReceiver = New-AzActionGroupReceiver `
-Name "Azure Local Cloud-NOC" `
-EmailReceiver `
-EmailAddress $NocEmail
$customerReceiver = New-AzActionGroupReceiver `
-Name "Customer-Contact" `
-EmailReceiver `
-EmailAddress $CustomerEmail
# Create Critical Action Group
$criticalAG = Set-AzActionGroup `
-ResourceGroupName $ResourceGroup `
-Name "ag-azl-$SiteCode-critical" `
-ShortName "AZLCrit" `
-Receiver $nocReceiver, $customerReceiver `
-Tag @{
Environment = "Production"
Application = "AzureLocal"
Severity = "Critical"
}
Write-Host "✅ Action Group created: $($criticalAG.Id)" -ForegroundColor Green
Recommended Alert Rules
Cluster Health Alerts
Create alerts based on HCI health events and metrics:
Alert 1: Node Health Critical
// Alert when a cluster node reports unhealthy
Event
| where Source == "Microsoft-Windows-SDDC-Management"
| where EventID == 3000
| extend HealthData = parse_json(RenderedDescription)
| where HealthData.HealthState != "Healthy"
| project TimeGenerated, Computer, HealthState = tostring(HealthData.HealthState)
| Setting | Value |
|---|---|
| Alert Rule Name | Node Health - Critical |
| Severity | Sev 1 - Critical |
| Evaluation Frequency | Every 5 minutes |
| Lookback Period | 5 minutes |
| Threshold | Greater than 0 |
| Action Group | ag-azl-{{SITE_CODE}}-critical |
Alert 2: Storage Volume Unhealthy
// Alert on storage volume health issues
Event
| where Source == "Microsoft-Windows-SDDC-Management"
| where EventID == 3002
| extend VolumeData = parse_json(RenderedDescription)
| where VolumeData.HealthStatus != "Healthy"
| project TimeGenerated, VolumeName = tostring(VolumeData.Name),
HealthStatus = tostring(VolumeData.HealthStatus)
Alert 3: High CPU Utilization
// Alert when CPU exceeds threshold
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize AvgCPU = avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
| where AvgCPU > 90
| Setting | Value |
|---|---|
| Alert Rule Name | High CPU Utilization |
| Severity | Sev 2 - Warning |
| Threshold | CPU > 90% for 15 minutes |
Alert 4: Low Available Memory
// Alert when available memory drops below threshold
Perf
| where ObjectName == "Memory" and CounterName == "Available Bytes"
| extend AvailableGB = CounterValue / 1073741824
| summarize AvgAvailableGB = avg(AvailableGB) by Computer, bin(TimeGenerated, 5m)
| where AvgAvailableGB < 16 // Alert if less than 16 GB available
Create Alert Rule (Portal)
- Navigate to Azure Monitor → Alerts → Alert rules
- Click + Create
- Select Scope: Your Log Analytics workspace
- Configure Condition:
- Signal type: Custom log search
- Paste the KQL query
- Set threshold logic
- Configure Actions: Select action group
- Configure Details:
- Alert rule name
- Severity
- Resource group
- Click Review + create → Create
PowerShell: Create Log Alert Rule
# Create a scheduled query rule for node health
$workspace = Get-AzOperationalInsightsWorkspace `
-ResourceGroupName "{{AZURE_RESOURCE_GROUP}}" `
-Name "{{LOG_ANALYTICS_WORKSPACE_NAME}}"
$actionGroup = Get-AzActionGroup `
-ResourceGroupName "{{AZURE_RESOURCE_GROUP}}" `
-Name "ag-azl-{{SITE_CODE}}-critical"
$query = @"
Event
| where Source == "Microsoft-Windows-SDDC-Management"
| where EventID == 3000
| extend HealthData = parse_json(RenderedDescription)
| where HealthData.HealthState != "Healthy"
| summarize Count = count() by Computer
"@
# Note: Use New-AzScheduledQueryRule for full implementation
Write-Host "Alert rule configuration prepared" -ForegroundColor Cyan
Write-Host "Query: $query"
Write-Host "Action Group: $($actionGroup.Id)"
Recommended Alert Summary
| Alert Name | Severity | Condition | Action |
|---|---|---|---|
| Node Health Critical | Sev 1 | Node unhealthy event | Email + Teams |
| Storage Volume Unhealthy | Sev 1 | Volume status != Healthy | Email + Teams |
| High CPU Utilization | Sev 2 | CPU > 90% for 15 min | |
| Low Available Memory | Sev 2 | Memory < 16 GB | |
| VM Failed | Sev 2 | VM state = Failed | |
| Storage Latency High | Sev 3 | Write latency > 100ms | |
| Arc Connectivity Lost | Sev 1 | No heartbeat > 30 min | Email + Teams |
Validation
Test Alert Rule
- Navigate to Azure Monitor → Alerts → Alert rules
- Select your alert rule
- Click Test action group to verify notifications work
Check Alert History
# Get recent alerts
Get-AzAlert -ResourceGroupName "{{AZURE_RESOURCE_GROUP}}" `
-TimeRange "P1D" | # Last 24 hours
Select-Object Name, Severity, MonitorCondition, ResolvedTime |
Format-Table -AutoSize
Verify Action Group
// Check for alert notifications in Activity Log
AzureActivity
| where CategoryValue == "Alert"
| where TimeGenerated > ago(24h)
| project TimeGenerated, OperationName, Status, Description
| order by TimeGenerated desc
Troubleshooting
| Issue | Possible Cause | Resolution |
|---|---|---|
| No alerts firing | Query returns no results | Test query in Log Analytics |
| Email not received | Email in spam | Check spam folder, add to safe senders |
| Action group error | Invalid webhook URL | Verify Teams/webhook URL format |
| Alert fires repeatedly | Threshold too sensitive | Adjust threshold or aggregation period |
Variables Reference
| Variable | Description | Example |
|---|---|---|
{{NOC_EMAIL}} | Azure Local Cloud NOC email | noc@azurelocal.cloud |
{{CUSTOMER_EMAIL}} | Customer contact | admin@customer.com |
{{TEAMS_WEBHOOK_URL}} | Teams channel webhook | https://outlook.office.com/webhook/... |
Next Steps
After setting up alerting:
- ➡️ Task 5: Deploy OMIMSWAC Monitoring — Hardware monitoring
- Review alert rules weekly and adjust thresholds as needed
- Create runbooks for common alert responses
- Document escalation procedures for critical alerts
Navigation
| Previous | Up | Next |
|---|---|---|
| ← Task 03: HCI Insights | Phase 02: Monitoring & Observability | Phase 04: Security & Governance → |
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-03-24 | Azure Local Cloudnology Team | Initial release |