Task 01: Infrastructure Health Validation
DOCUMENT CATEGORY: Runbook
SCOPE: Infrastructure health validation
PURPOSE: Validate cluster infrastructure health before performance testing
MASTER REFERENCE: Microsoft Learn - Health Service
Status: Active
Overview
This step performs comprehensive infrastructure validation before proceeding to performance testing. All validation results are captured in a report for the customer handover package.
Prerequisites
- Azure Local cluster deployed and accessible
- Administrative credentials available
- PowerShell 5.1+ or PowerShell 7
Variables from variables.yml
| Variable Path | Type | Description |
|---|---|---|
platform.cluster_name | String | Cluster name used in validation report headers |
compute.nodes[].name | String | Node hostnames for per-node health checks |
networking.management.gateway | String | Default gateway IP for connectivity tests |
networking.management.dns_servers | Array | DNS server IPs for resolution validation |
Report Output
Validation results are saved to:
\\<ClusterName>\ClusterStorage$\Collect\validation-reports\01-infrastructure-health-report-YYYYMMDD.txt
Part 1: Initialize Validation Environment
1.1 Create Report Directory and File
# Initialize variables
$ClusterName = (Get-Cluster).Name
$DateStamp = Get-Date -Format "yyyyMMdd"
$ReportPath = "C:\ClusterStorage\Collect\validation-reports"
$ReportFile = "$ReportPath\01-infrastructure-health-report-$DateStamp.txt"
# Create directory if not exists
if (-not (Test-Path $ReportPath)) {
New-Item -Path $ReportPath -ItemType Directory -Force
}
# Initialize report
$ReportHeader = @"
================================================================================
INFRASTRUCTURE HEALTH VALIDATION REPORT
================================================================================
Cluster: $ClusterName
Date: $(Get-Date -Format "yyyy-MM-dd HH:mm:ss")
Generated By: $(whoami)
================================================================================
"@
$ReportHeader | Out-File -FilePath $ReportFile -Encoding UTF8
Part 2: Cluster Node Validation
2.1 Node Status Check
"CLUSTER NODE STATUS" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile
$Nodes = Get-ClusterNode | Select-Object Name, State, DrainStatus, DynamicWeight
$Nodes | Format-Table -AutoSize | Out-String | Add-Content $ReportFile
# Check for any nodes not in "Up" state
$FailedNodes = $Nodes | Where-Object { $_.State -ne "Up" }
if ($FailedNodes) {
"WARNING: Nodes not in Up state:" | Add-Content $ReportFile
$FailedNodes | Format-Table | Out-String | Add-Content $ReportFile
} else {
"All nodes are Up and healthy." | Add-Content $ReportFile
}
2.2 Run Full Test-Cluster Validation
"`n" + "="*40 | Add-Content $ReportFile
"TEST-CLUSTER VALIDATION" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile
# Run comprehensive cluster validation
$TestClusterReport = "$ReportPath\TestClusterReport-$DateStamp.html"
# Run Test-Cluster (storage tests skipped as cluster is operational)
Test-Cluster -Include "Cluster Configuration","Inventory","Network","System Configuration" `
-ReportName $TestClusterReport `
-Verbose 4>&1 | Out-String | Add-Content $ReportFile
"Test-Cluster HTML report saved to: $TestClusterReport" | Add-Content $ReportFile
Part 3: Cluster Resource Validation
3.1 Core Cluster Resources
"`n" + "="*40 | Add-Content $ReportFile
"CLUSTER RESOURCES" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile
# Check for any resources not Online
$OfflineResources = Get-ClusterResource | Where-Object { $_.State -ne "Online" }
if ($OfflineResources) {
"WARNING: Resources not Online:" | Add-Content $ReportFile
$OfflineResources | Format-Table Name, State, OwnerGroup, ResourceType -AutoSize | Out-String | Add-Content $ReportFile
} else {
"All cluster resources are Online." | Add-Content $ReportFile
}
# List critical resources
"Critical Cluster Resources:" | Add-Content $ReportFile
Get-ClusterResource | Where-Object { $_.ResourceType -in "IP Address","Network Name","Cluster Shared Volume","Virtual Machine" } |
Format-Table Name, State, OwnerGroup, OwnerNode -AutoSize | Out-String | Add-Content $ReportFile
3.2 Cluster Quorum Status
"`nCLUSTER QUORUM" | Add-Content $ReportFile
Get-ClusterQuorum | Format-List | Out-String | Add-Content $ReportFile
Part 4: Storage Health Validation
4.1 Storage Pool Health
"`n" + "="*40 | Add-Content $ReportFile
"STORAGE HEALTH" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile
# Storage pools
"Storage Pools:" | Add-Content $ReportFile
Get-StoragePool | Where-Object { $_.IsPrimordial -eq $false } |
Format-Table FriendlyName, HealthStatus, OperationalStatus, Size, AllocatedSize -AutoSize | Out-String | Add-Content $ReportFile
# Virtual disks
"Virtual Disks:" | Add-Content $ReportFile
Get-VirtualDisk | Format-Table FriendlyName, HealthStatus, OperationalStatus, Size, FootprintOnPool, ResiliencySettingName -AutoSize | Out-String | Add-Content $ReportFile
# Physical disks
"Physical Disks (Summary):" | Add-Content $ReportFile
Get-PhysicalDisk | Group-Object HealthStatus | Format-Table Name, Count -AutoSize | Out-String | Add-Content $ReportFile
4.2 Health Service Faults
"`nHEALTH SERVICE FAULTS" | Add-Content $ReportFile
$HealthFaults = Get-HealthFault
if ($HealthFaults) {
"Active Health Faults:" | Add-Content $ReportFile
$HealthFaults | Format-Table FaultType, FaultingObjectDescription, Reason -AutoSize -Wrap | Out-String | Add-Content $ReportFile
} else {
"No active health faults detected." | Add-Content $ReportFile
}
4.3 Cluster Shared Volumes
"`nCLUSTER SHARED VOLUMES" | Add-Content $ReportFile
Get-ClusterSharedVolume | ForEach-Object {
$CSV = $_
[PSCustomObject]@{
Name = $CSV.Name
State = $CSV.State
OwnerNode = $CSV.OwnerNode.Name
Path = $CSV.SharedVolumeInfo.FriendlyVolumeName
SizeGB = [math]::Round($CSV.SharedVolumeInfo.Partition.Size / 1GB, 2)
FreeGB = [math]::Round($CSV.SharedVolumeInfo.Partition.FreeSpace / 1GB, 2)
UsedPercent = [math]::Round((1 - ($CSV.SharedVolumeInfo.Partition.FreeSpace / $CSV.SharedVolumeInfo.Partition.Size)) * 100, 1)
}
} | Format-Table -AutoSize | Out-String | Add-Content $ReportFile
Part 5: Azure Arc Connectivity Validation
5.1 Arc Agent Status
"`n" + "="*40 | Add-Content $ReportFile
"AZURE ARC CONNECTIVITY" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile
$Nodes = (Get-ClusterNode).Name
foreach ($Node in $Nodes) {
"Node: $Node" | Add-Content $ReportFile
$ArcStatus = Invoke-Command -ComputerName $Node -ScriptBlock {
& "$env:ProgramFiles\AzureConnectedMachineAgent\azcmagent.exe" show 2>&1
}
# Parse key values
$AgentVersion = ($ArcStatus | Select-String "Agent version").ToString().Split(":")[1].Trim()
$ConnStatus = ($ArcStatus | Select-String "Agent Status").ToString().Split(":")[1].Trim()
$LastHeartbeat = ($ArcStatus | Select-String "Last Heartbeat").ToString().Split(": ")[1]
" Agent Version: $AgentVersion" | Add-Content $ReportFile
" Connection Status: $ConnStatus" | Add-Content $ReportFile
" Last Heartbeat: $LastHeartbeat" | Add-Content $ReportFile
"" | Add-Content $ReportFile
}
5.2 Azure Local Registration
"AZURE LOCAL REGISTRATION" | Add-Content $ReportFile
$AzureLocalReg = Get-AzureStackHCI
$AzureLocalReg | Format-List ClusterStatus, RegistrationStatus, ConnectionStatus, LastConnected | Out-String | Add-Content $ReportFile
Part 6: Event Log Review
6.1 Critical Events (Last 24 Hours)
"`n" + "="*40 | Add-Content $ReportFile
"EVENT LOG REVIEW (CRITICAL/ERROR - LAST 24 HOURS)" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile
$StartTime = (Get-Date).AddHours(-24)
$CriticalEvents = Get-WinEvent -FilterHashtable @{
LogName = 'System','Application'
Level = 1,2 # Critical and Error
StartTime = $StartTime
} -MaxEvents 50 -ErrorAction SilentlyContinue
if ($CriticalEvents) {
$CriticalEvents | Group-Object ProviderName | Sort-Object Count -Descending | Select-Object -First 10 |
Format-Table @{N='Source';E={$_.Name}}, Count -AutoSize | Out-String | Add-Content $ReportFile
"Recent Critical/Error Events:" | Add-Content $ReportFile
$CriticalEvents | Select-Object -First 20 TimeCreated, ProviderName, Id, Message |
Format-Table -Wrap | Out-String | Add-Content $ReportFile
} else {
"No critical or error events in the last 24 hours." | Add-Content $ReportFile
}
Part 7: Generate Summary
$NodeCount = (Get-ClusterNode | Where-Object State -eq "Up").Count
$TotalNodes = (Get-ClusterNode).Count
$HealthFaultCount = (Get-HealthFault).Count
$OfflineResourceCount = (Get-ClusterResource | Where-Object State -ne "Online").Count
$ArcStatus = (Get-AzureStackHCI).ConnectionStatus
$Summary = @"
================================================================================
INFRASTRUCTURE HEALTH SUMMARY
================================================================================
Validation Category Status
------------------------------- --------
Cluster Nodes $NodeCount of $TotalNodes Up
Cluster Resources $(if($OfflineResourceCount -eq 0){"All Online"}else{"$OfflineResourceCount Offline"})
Storage Health Faults $(if($HealthFaultCount -eq 0){"None"}else{"$HealthFaultCount Active"})
Azure Arc Connection $ArcStatus
Test-Cluster See HTML Report
OVERALL STATUS: $(if($NodeCount -eq $TotalNodes -and $HealthFaultCount -eq 0 -and $OfflineResourceCount -eq 0){"PASS"}else{"REVIEW REQUIRED"})
================================================================================
Report saved to: $ReportFile
================================================================================
"@
$Summary | Add-Content $ReportFile
Write-Host $Summary
Validation Checklist
| Category | Requirement | Status |
|---|---|---|
| Nodes | All nodes in "Up" state | ☐ |
| Nodes | No nodes in drain status | ☐ |
| Resources | All cluster resources online | ☐ |
| Quorum | Quorum established | ☐ |
| Storage | All storage pools healthy | ☐ |
| Storage | All virtual disks healthy | ☐ |
| Storage | No active health faults | ☐ |
| CSV | All CSVs online | ☐ |
| Arc | All nodes connected | ☐ |
| Events | No critical events (last 24h) | ☐ |
Common Issues
Node Not Responding
# Check physical connectivity and restart cluster service
Get-Service -Name ClusSvc -ComputerName <NodeName>
Restart-Service -Name ClusSvc -ComputerName <NodeName>
Storage Pool Unhealthy
# Check for failed physical disks
Get-PhysicalDisk | Where-Object HealthStatus -ne "Healthy"
# Repair virtual disk
Repair-VirtualDisk -FriendlyName <VirtualDiskName>
Arc Agent Disconnected
# Reconnect Arc agent
azcmagent connect --resource-group <RG> --tenant-id <TenantID> --subscription-id <SubID>
Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
| Cluster validation reports node unreachable | WinRM disabled or network partition | Verify connectivity: Test-Connection <node-ip>; enable WinRM: Enable-PSRemoting -Force on the failing node |
Storage pool health shows Degraded | Physical disk failure or missing | Run Get-PhysicalDisk | Where-Object HealthStatus -ne Healthy to identify failed disks; replace and repair |
Arc agent shows Disconnected status | Agent service stopped or token expired | Restart agent: Restart-Service himds; if persists, reconnect: azcmagent connect with fresh credentials |
Next Step
Proceed to Task 2: VMFleet Storage Testing once infrastructure validation passes.
- Manual
- Orchestrated Script
- Standalone Script
When to use: Use this option for manual step-by-step execution.
See procedure steps above for manual execution guidance.
When to use: Use this option when deploying across multiple nodes from a management server using ariables.yml.
Script: See azurelocal-toolkit for the orchestrated script for this task.
Orchestrated script content references the toolkit repository.
When to use: Use this option for a self-contained deployment without a shared configuration file.
Script: See azurelocal-toolkit for the standalone script for this task.
Standalone script content references the toolkit repository.
Scripts for this task are located in the azurelocal-toolkit repository under scripts/deploy/ in the appropriate task folder.
Alternatives
The procedures in this task use the scripted methods shown in the tabs above. Additional deployment methods including Azure CLI and Bash scripts are available in the azurelocal-toolkit repository under scripts/deploy/.
| Method | Description |
|---|---|
| Azure CLI | PowerShell-based Azure CLI scripts for Azure resource operations |
| Bash | Linux/macOS compatible shell scripts for pipeline environments |
Navigation
| Previous | Up | Next |
|---|---|---|
| ← Part 6 Overview | Testing & Validation | Task 2: VMFleet Storage Testing → |
Version Control
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-03-24 | Azure Local Cloud | Initial release |