Skip to main content
Version: Next

Lab Monitoring Plan

Runbook Azure

DOCUMENT CATEGORY: Runbook SCOPE: Lab environment monitoring and observability PURPOSE: Define the monitoring strategy for the Azure Local lab environment, modelled after what a production deployment would use MASTER REFERENCE: Monitor Azure Local with Azure Monitor

Status: Example / Reference

Lab Reference Document

This monitoring plan is a worked example showing how a real Azure Local deployment would be documented. Replace all [EXAMPLE] values with your environment-specific details before using in production.


Monitoring Objectives

The lab monitoring strategy delivers the same observability posture as a production deployment, enabling:

  • Cluster health visibility — Real-time and historical health across all nodes, storage, and network
  • Workload performance — Per-VM and per-service metrics for workloads running on the cluster
  • Alerting — Automated notification when critical thresholds are breached
  • Audit trail — Log retention sufficient for security review and troubleshooting
  • Capacity planning — Trend data for predicting growth and right-sizing

Environment Scope

AttributeValue
Environment Name[EXAMPLE] azlocal-lab-001
Cluster Node Count[EXAMPLE] 3
Azure Region[EXAMPLE] australiaeast
Azure Subscription[EXAMPLE] AzureLocal-Lab-Sub
Resource Group[EXAMPLE] rg-azlocal-lab-monitoring
Log Analytics Workspace[EXAMPLE] law-azlocal-lab-001
Retention Period[EXAMPLE] 30 days (hot) / 90 days (archive)

Monitoring Architecture

The lab uses the same stack documented in the Monitoring and Observability solution:

Cluster Nodes (3x)

├──▶ Azure Monitor Agent (AMA)
│ │
│ ├──▶ Log Analytics Workspace (law-azlocal-lab-001)
│ │ └──▶ KQL Queries / Workbooks / Alerts
│ └──▶ Azure Monitor Metrics

├──▶ Prometheus (workload metrics — deployed on AKS Arc)
│ │
│ └──▶ Azure Managed Grafana (grafana-azlocal-lab-001)

└──▶ Azure Monitor Alerts ──▶ Action Group (lab-ops-email)

Components

ComponentDeploymentNotes
Azure Monitor AgentArc-enabled VM extensionReplaces legacy MMA/OMS
Log Analytics WorkspaceAzure-hostedCentralized for all nodes and workloads
PrometheusAKS Arc clusterScrapes node-exporter and kube-state-metrics
Azure Managed GrafanaAzure-hostedPre-built Azure Local dashboards
Azure Monitor AlertsAzure-hostedThreshold and anomaly-based rules

Key Metrics and Thresholds

Cluster Infrastructure

MetricWarning ThresholdCritical ThresholdCollection Interval
Node CPU utilization70%90%60 seconds
Node memory utilization75%90%60 seconds
Storage pool capacity70%85%5 minutes
Storage I/O latency (avg)5 ms20 ms60 seconds
Network packet loss0.1%1%60 seconds
Node heartbeatNo heartbeat > 5 min1 minute

Workload Metrics

MetricWarning ThresholdCritical ThresholdCollection Interval
VM CPU utilization80%95%5 minutes
VM memory utilization80%95%5 minutes
VM disk read latency10 ms30 ms5 minutes
VM disk write latency10 ms30 ms5 minutes
AVD session count75% of capacity95% of capacity5 minutes

Azure Local-Specific

MetricWarning ThresholdCritical Threshold
Arc connectivity statusDisconnected > 15 min
Storage Spaces Direct (S2D) healthDegradedFailed
Failover Cluster healthAny node offline

Log Analytics Configuration

Data Sources

SourceTableRetention
Windows Event Logs (System, Application)Event30 days
Performance CountersPerf30 days
Azure Activity LogAzureActivity90 days
Arc Agent HeartbeatHeartbeat30 days
Custom Syslog (Linux nodes)Syslog30 days
Security EventsSecurityEvent90 days

Key KQL Queries

Node availability (last 24 hours):

Heartbeat
| where TimeGenerated > ago(24h)
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| extend MinutesSince = datetime_diff('minute', now(), LastHeartbeat)
| project Computer, LastHeartbeat, MinutesSince
| order by MinutesSince desc

Top CPU consumers (last 1 hour):

Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue) by Computer
| order by AvgCPU desc
| take 10

Storage latency trend:

Perf
| where TimeGenerated > ago(6h)
| where ObjectName == "LogicalDisk" and CounterName == "Avg. Disk sec/Read"
| summarize AvgLatencyMs = avg(CounterValue * 1000) by bin(TimeGenerated, 5m), Computer
| render timechart

Alert Configuration

Action Group

SettingValue
Action Group Name[EXAMPLE] ag-azlocal-lab-ops
Email Recipients[EXAMPLE] lab-ops@contoso.com
Webhook (optional)[EXAMPLE] Teams channel webhook URL
SMS (optional)[EXAMPLE] N/A for lab

Alert Rules

Alert NameSeverityConditionAction
Node CPU CriticalSev 1CPU > 90% for 5 minEmail + Teams
Node Memory CriticalSev 1Memory > 90% for 5 minEmail + Teams
Node Heartbeat LostSev 0No heartbeat > 5 minEmail + Teams
Storage Pool WarningSev 2Capacity > 70%Email
Storage Pool CriticalSev 1Capacity > 85%Email + Teams
Storage Latency CriticalSev 1Avg latency > 20msEmail
Arc Connectivity LostSev 1Disconnected > 15 minEmail + Teams
VM CPU CriticalSev 2VM CPU > 95% for 10 minEmail

Grafana Dashboards

DashboardSourcePurpose
Azure Local Cluster Overviewazurelocal-monitoring repoNode health, storage, network
AVD Session Monitoringazurelocal-avdSession counts, hostpool health
VM PerformanceCommunity / CustomPer-VM CPU, memory, disk
Storage Spaces DirectMicrosoftS2D health, capacity, latency

Access: [EXAMPLE] https://grafana-azlocal-lab-001.australiaeast.grafana.azure.com


Response Procedures

Severity 0 — Node Down

  1. Validate in Azure portal → Azure Arc → Servers
  2. Check physical hardware (iDRAC/iLO console)
  3. Attempt remote power cycle if hardware permits
  4. Escalate to on-site team if remote resolution fails
  5. Document incident in Support Instructions

Severity 1 — Critical Threshold Breached

  1. Acknowledge alert within 15 minutes
  2. Open Log Analytics and run relevant KQL query to confirm scope
  3. Check workload impact — live migration VMs if node is saturated
  4. Review recent changes via Azure Activity Log
  5. Document findings in incident log

Severity 2 — Warning Threshold Breached

  1. Review trend over last 24 hours in Grafana
  2. Determine if threshold is trending toward critical
  3. Open capacity review ticket if storage warning persists > 48 hours
  4. No immediate escalation required unless trend worsens

Review Cadence

ReviewFrequencyOwner
Alert rule tuningMonthlyLab Ops
Log Analytics cost reviewMonthlyLab Ops
Dashboard reviewQuarterlyLab Ops
Retention policy reviewQuarterlyLab Ops
Monitoring architecture reviewAnnuallyPlatform Team

References