Skip to main content
Version: Next

Task 01: Infrastructure Health Validation

Runbook Azure

DOCUMENT CATEGORY: Runbook SCOPE: Infrastructure health validation PURPOSE: Validate cluster infrastructure health before performance testing MASTER REFERENCE: Microsoft Learn - Health Service

Status: Active


Overview

This step performs comprehensive infrastructure validation before proceeding to performance testing. All validation results are captured in a report for the customer handover package.

Prerequisites

  • Azure Local cluster deployed and accessible
  • Administrative credentials available
  • PowerShell 5.1+ or PowerShell 7

Variables from variables.yml

Variable PathTypeDescription
platform.cluster_nameStringCluster name used in validation report headers
compute.nodes[].nameStringNode hostnames for per-node health checks
networking.management.gatewayStringDefault gateway IP for connectivity tests
networking.management.dns_serversArrayDNS server IPs for resolution validation

Report Output

Validation results are saved to:

\\<ClusterName>\ClusterStorage$\Collect\validation-reports\01-infrastructure-health-report-YYYYMMDD.txt

Part 1: Initialize Validation Environment

1.1 Create Report Directory and File

# Initialize variables
$ClusterName = (Get-Cluster).Name
$DateStamp = Get-Date -Format "yyyyMMdd"
$ReportPath = "C:\ClusterStorage\Collect\validation-reports"
$ReportFile = "$ReportPath\01-infrastructure-health-report-$DateStamp.txt"

# Create directory if not exists
if (-not (Test-Path $ReportPath)) {
New-Item -Path $ReportPath -ItemType Directory -Force
}

# Initialize report
$ReportHeader = @"
================================================================================
INFRASTRUCTURE HEALTH VALIDATION REPORT
================================================================================
Cluster: $ClusterName
Date: $(Get-Date -Format "yyyy-MM-dd HH:mm:ss")
Generated By: $(whoami)
================================================================================

"@
$ReportHeader | Out-File -FilePath $ReportFile -Encoding UTF8

Part 2: Cluster Node Validation

2.1 Node Status Check

"CLUSTER NODE STATUS" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile

$Nodes = Get-ClusterNode | Select-Object Name, State, DrainStatus, DynamicWeight
$Nodes | Format-Table -AutoSize | Out-String | Add-Content $ReportFile

# Check for any nodes not in "Up" state
$FailedNodes = $Nodes | Where-Object { $_.State -ne "Up" }
if ($FailedNodes) {
"WARNING: Nodes not in Up state:" | Add-Content $ReportFile
$FailedNodes | Format-Table | Out-String | Add-Content $ReportFile
} else {
"All nodes are Up and healthy." | Add-Content $ReportFile
}

2.2 Run Full Test-Cluster Validation

"`n" + "="*40 | Add-Content $ReportFile
"TEST-CLUSTER VALIDATION" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile

# Run comprehensive cluster validation
$TestClusterReport = "$ReportPath\TestClusterReport-$DateStamp.html"

# Run Test-Cluster (storage tests skipped as cluster is operational)
Test-Cluster -Include "Cluster Configuration","Inventory","Network","System Configuration" `
-ReportName $TestClusterReport `
-Verbose 4>&1 | Out-String | Add-Content $ReportFile

"Test-Cluster HTML report saved to: $TestClusterReport" | Add-Content $ReportFile

Part 3: Cluster Resource Validation

3.1 Core Cluster Resources

"`n" + "="*40 | Add-Content $ReportFile
"CLUSTER RESOURCES" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile

# Check for any resources not Online
$OfflineResources = Get-ClusterResource | Where-Object { $_.State -ne "Online" }

if ($OfflineResources) {
"WARNING: Resources not Online:" | Add-Content $ReportFile
$OfflineResources | Format-Table Name, State, OwnerGroup, ResourceType -AutoSize | Out-String | Add-Content $ReportFile
} else {
"All cluster resources are Online." | Add-Content $ReportFile
}

# List critical resources
"Critical Cluster Resources:" | Add-Content $ReportFile
Get-ClusterResource | Where-Object { $_.ResourceType -in "IP Address","Network Name","Cluster Shared Volume","Virtual Machine" } |
Format-Table Name, State, OwnerGroup, OwnerNode -AutoSize | Out-String | Add-Content $ReportFile

3.2 Cluster Quorum Status

"`nCLUSTER QUORUM" | Add-Content $ReportFile
Get-ClusterQuorum | Format-List | Out-String | Add-Content $ReportFile

Part 4: Storage Health Validation

4.1 Storage Pool Health

"`n" + "="*40 | Add-Content $ReportFile
"STORAGE HEALTH" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile

# Storage pools
"Storage Pools:" | Add-Content $ReportFile
Get-StoragePool | Where-Object { $_.IsPrimordial -eq $false } |
Format-Table FriendlyName, HealthStatus, OperationalStatus, Size, AllocatedSize -AutoSize | Out-String | Add-Content $ReportFile

# Virtual disks
"Virtual Disks:" | Add-Content $ReportFile
Get-VirtualDisk | Format-Table FriendlyName, HealthStatus, OperationalStatus, Size, FootprintOnPool, ResiliencySettingName -AutoSize | Out-String | Add-Content $ReportFile

# Physical disks
"Physical Disks (Summary):" | Add-Content $ReportFile
Get-PhysicalDisk | Group-Object HealthStatus | Format-Table Name, Count -AutoSize | Out-String | Add-Content $ReportFile

4.2 Health Service Faults

"`nHEALTH SERVICE FAULTS" | Add-Content $ReportFile

$HealthFaults = Get-HealthFault
if ($HealthFaults) {
"Active Health Faults:" | Add-Content $ReportFile
$HealthFaults | Format-Table FaultType, FaultingObjectDescription, Reason -AutoSize -Wrap | Out-String | Add-Content $ReportFile
} else {
"No active health faults detected." | Add-Content $ReportFile
}

4.3 Cluster Shared Volumes

"`nCLUSTER SHARED VOLUMES" | Add-Content $ReportFile

Get-ClusterSharedVolume | ForEach-Object {
$CSV = $_
[PSCustomObject]@{
Name = $CSV.Name
State = $CSV.State
OwnerNode = $CSV.OwnerNode.Name
Path = $CSV.SharedVolumeInfo.FriendlyVolumeName
SizeGB = [math]::Round($CSV.SharedVolumeInfo.Partition.Size / 1GB, 2)
FreeGB = [math]::Round($CSV.SharedVolumeInfo.Partition.FreeSpace / 1GB, 2)
UsedPercent = [math]::Round((1 - ($CSV.SharedVolumeInfo.Partition.FreeSpace / $CSV.SharedVolumeInfo.Partition.Size)) * 100, 1)
}
} | Format-Table -AutoSize | Out-String | Add-Content $ReportFile

Part 5: Azure Arc Connectivity Validation

5.1 Arc Agent Status

"`n" + "="*40 | Add-Content $ReportFile
"AZURE ARC CONNECTIVITY" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile

$Nodes = (Get-ClusterNode).Name

foreach ($Node in $Nodes) {
"Node: $Node" | Add-Content $ReportFile

$ArcStatus = Invoke-Command -ComputerName $Node -ScriptBlock {
& "$env:ProgramFiles\AzureConnectedMachineAgent\azcmagent.exe" show 2>&1
}

# Parse key values
$AgentVersion = ($ArcStatus | Select-String "Agent version").ToString().Split(":")[1].Trim()
$ConnStatus = ($ArcStatus | Select-String "Agent Status").ToString().Split(":")[1].Trim()
$LastHeartbeat = ($ArcStatus | Select-String "Last Heartbeat").ToString().Split(": ")[1]

" Agent Version: $AgentVersion" | Add-Content $ReportFile
" Connection Status: $ConnStatus" | Add-Content $ReportFile
" Last Heartbeat: $LastHeartbeat" | Add-Content $ReportFile
"" | Add-Content $ReportFile
}

5.2 Azure Local Registration

"AZURE LOCAL REGISTRATION" | Add-Content $ReportFile

$AzureLocalReg = Get-AzureStackHCI
$AzureLocalReg | Format-List ClusterStatus, RegistrationStatus, ConnectionStatus, LastConnected | Out-String | Add-Content $ReportFile

Part 6: Event Log Review

6.1 Critical Events (Last 24 Hours)

"`n" + "="*40 | Add-Content $ReportFile
"EVENT LOG REVIEW (CRITICAL/ERROR - LAST 24 HOURS)" | Add-Content $ReportFile
"="*40 | Add-Content $ReportFile

$StartTime = (Get-Date).AddHours(-24)
$CriticalEvents = Get-WinEvent -FilterHashtable @{
LogName = 'System','Application'
Level = 1,2 # Critical and Error
StartTime = $StartTime
} -MaxEvents 50 -ErrorAction SilentlyContinue

if ($CriticalEvents) {
$CriticalEvents | Group-Object ProviderName | Sort-Object Count -Descending | Select-Object -First 10 |
Format-Table @{N='Source';E={$_.Name}}, Count -AutoSize | Out-String | Add-Content $ReportFile

"Recent Critical/Error Events:" | Add-Content $ReportFile
$CriticalEvents | Select-Object -First 20 TimeCreated, ProviderName, Id, Message |
Format-Table -Wrap | Out-String | Add-Content $ReportFile
} else {
"No critical or error events in the last 24 hours." | Add-Content $ReportFile
}

Part 7: Generate Summary

$NodeCount = (Get-ClusterNode | Where-Object State -eq "Up").Count
$TotalNodes = (Get-ClusterNode).Count
$HealthFaultCount = (Get-HealthFault).Count
$OfflineResourceCount = (Get-ClusterResource | Where-Object State -ne "Online").Count
$ArcStatus = (Get-AzureStackHCI).ConnectionStatus

$Summary = @"

================================================================================
INFRASTRUCTURE HEALTH SUMMARY
================================================================================

Validation Category Status
------------------------------- --------
Cluster Nodes $NodeCount of $TotalNodes Up
Cluster Resources $(if($OfflineResourceCount -eq 0){"All Online"}else{"$OfflineResourceCount Offline"})
Storage Health Faults $(if($HealthFaultCount -eq 0){"None"}else{"$HealthFaultCount Active"})
Azure Arc Connection $ArcStatus
Test-Cluster See HTML Report

OVERALL STATUS: $(if($NodeCount -eq $TotalNodes -and $HealthFaultCount -eq 0 -and $OfflineResourceCount -eq 0){"PASS"}else{"REVIEW REQUIRED"})

================================================================================
Report saved to: $ReportFile
================================================================================

"@

$Summary | Add-Content $ReportFile
Write-Host $Summary

Validation Checklist

CategoryRequirementStatus
NodesAll nodes in "Up" state
NodesNo nodes in drain status
ResourcesAll cluster resources online
QuorumQuorum established
StorageAll storage pools healthy
StorageAll virtual disks healthy
StorageNo active health faults
CSVAll CSVs online
ArcAll nodes connected
EventsNo critical events (last 24h)

Common Issues

Node Not Responding

# Check physical connectivity and restart cluster service
Get-Service -Name ClusSvc -ComputerName <NodeName>
Restart-Service -Name ClusSvc -ComputerName <NodeName>

Storage Pool Unhealthy

# Check for failed physical disks
Get-PhysicalDisk | Where-Object HealthStatus -ne "Healthy"

# Repair virtual disk
Repair-VirtualDisk -FriendlyName <VirtualDiskName>

Arc Agent Disconnected

# Reconnect Arc agent
azcmagent connect --resource-group <RG> --tenant-id <TenantID> --subscription-id <SubID>

Troubleshooting

IssueCauseResolution
Cluster validation reports node unreachableWinRM disabled or network partitionVerify connectivity: Test-Connection <node-ip>; enable WinRM: Enable-PSRemoting -Force on the failing node
Storage pool health shows DegradedPhysical disk failure or missingRun Get-PhysicalDisk | Where-Object HealthStatus -ne Healthy to identify failed disks; replace and repair
Arc agent shows Disconnected statusAgent service stopped or token expiredRestart agent: Restart-Service himds; if persists, reconnect: azcmagent connect with fresh credentials

Next Step

Proceed to Task 2: VMFleet Storage Testing once infrastructure validation passes.


PreviousUpNext
← Part 6 OverviewTesting & ValidationTask 2: VMFleet Storage Testing →

Version Control

VersionDateAuthorChanges
1.0.02026-03-24Azure Local Cloudnology TeamInitial release