Skip to main content
Version: Next

Task 04: High Availability Testing

Runbook Azure

DOCUMENT CATEGORY: Runbook SCOPE: High availability and failover validation PURPOSE: Validate cluster HA capabilities and document RTO/RPO MASTER REFERENCE: Microsoft Learn - Failover Clustering

Status: Active


Overview

This step validates the high availability capabilities of the Azure Local cluster including live migration, planned failover, unplanned failover simulation, and quorum resilience. Testing is performed using dedicated test VMs that are deleted after validation.

Maintenance Window Required

Some HA tests (node failure simulation) will temporarily reduce cluster capacity. Schedule during a maintenance window.

Prerequisites

  • Infrastructure health validation completed (Step 1)
  • VMFleet storage testing completed (Step 2)
  • Network validation completed (Step 3)
  • Maintenance window scheduled
  • Windows Server 2022 and Ubuntu 22.04 images available

Variables from variables.yml

Variable PathTypeDescription
platform.cluster_nameStringCluster name for failover group identification
compute.nodes[].nameStringNode hostnames for drain/failover targeting
networking.management.subnetStringManagement subnet for VM network assignment

Report Output

All validation results are saved to:

\\<ClusterName>\ClusterStorage$\Collect\validation-reports\04-ha-failover-test-results-YYYYMMDD.txt

Part 1: Initialize Test Environment

1.1 Create Report File

# Initialize variables
$ClusterName = (Get-Cluster).Name
$DateStamp = Get-Date -Format "yyyyMMdd"
$ReportPath = "C:\ClusterStorage\Collect\validation-reports"
$ReportFile = "$ReportPath\04-ha-failover-test-results-$DateStamp.txt"

# Initialize report
$ReportHeader = @"
================================================================================
HIGH AVAILABILITY TESTING REPORT
================================================================================
Cluster: $ClusterName
Date: $(Get-Date -Format "yyyy-MM-dd HH:mm:ss")
Generated By: $(whoami)
================================================================================

"@
$ReportHeader | Out-File -FilePath $ReportFile -Encoding UTF8

1.2 Create Test VMs

# Create dedicated test VMs for HA testing
$TestVMs = @(
@{Name="TEST-WIN-01"; OS="Windows"; CPU=4; MemoryGB=8; DiskGB=60},
@{Name="TEST-WIN-02"; OS="Windows"; CPU=4; MemoryGB=8; DiskGB=60},
@{Name="TEST-LNX-01"; OS="Linux"; CPU=2; MemoryGB=4; DiskGB=40}
)

$ClusterStorage = "C:\ClusterStorage\UserStorage_1"
$VMPath = "$ClusterStorage\TestVMs"
$VHDPath = "$ClusterStorage\Library" # Location of template VHDs

# Create VM folder
New-Item -Path $VMPath -ItemType Directory -Force -ErrorAction SilentlyContinue

"`nCreating Test VMs:" | Add-Content $ReportFile

foreach ($VM in $TestVMs) {
$VMName = $VM.Name

# Check if VM already exists
if (Get-VM -Name $VMName -ErrorAction SilentlyContinue) {
"$VMName already exists, skipping creation" | Add-Content $ReportFile
continue
}

# Select template VHD based on OS
$TemplateVHD = if ($VM.OS -eq "Windows") {
"$VHDPath\WindowsServer2022-Template.vhdx"
} else {
"$VHDPath\Ubuntu2204-Template.vhdx"
}

# Create differencing disk
$NewVHD = "$VMPath\$VMName\$VMName.vhdx"
New-Item -Path "$VMPath\$VMName" -ItemType Directory -Force
New-VHD -Path $NewVHD -ParentPath $TemplateVHD -Differencing

# Create VM
New-VM -Name $VMName `
-Path $VMPath `
-VHDPath $NewVHD `
-MemoryStartupBytes ($VM.MemoryGB * 1GB) `
-Generation 2 `
-SwitchName "ConvergedSwitch"

# Configure VM
Set-VM -Name $VMName `
-ProcessorCount $VM.CPU `
-DynamicMemory `
-MemoryMinimumBytes 1GB `
-MemoryMaximumBytes ($VM.MemoryGB * 1GB)

# Enable guest services
Enable-VMIntegrationService -VMName $VMName -Name "Guest Service Interface"

# Add to cluster
Add-ClusterVirtualMachineRole -VirtualMachine $VMName

# Start VM
Start-VM -Name $VMName

"Created and started: $VMName (CPU: $($VM.CPU), RAM: $($VM.MemoryGB)GB)" | Add-Content $ReportFile
}

# Wait for VMs to boot
Write-Host "Waiting 60 seconds for VMs to boot..." -ForegroundColor Yellow
Start-Sleep -Seconds 60

1.3 Verify Test VMs

# Verify all test VMs are running
$TestVMStatus = Get-VM -Name "TEST-*" | Select-Object Name, State, Status,
@{N='Owner';E={(Get-ClusterResource -Name "Virtual Machine $($_.Name)" -ErrorAction SilentlyContinue).OwnerNode}}

"`nTest VM Status:" | Add-Content $ReportFile
$TestVMStatus | Format-Table -AutoSize | Out-String | Add-Content $ReportFile

Part 2: Live Migration Testing

2.1 Test Single VM Live Migration

"`n" + "="*80 | Add-Content $ReportFile
"LIVE MIGRATION TESTING" | Add-Content $ReportFile
"="*80 | Add-Content $ReportFile

$TestVM = "TEST-WIN-01"
$SourceNode = (Get-ClusterResource -Name "Virtual Machine $TestVM").OwnerNode.Name
$TargetNodes = (Get-ClusterNode | Where-Object { $_.Name -ne $SourceNode }).Name

"`nTesting Live Migration for $TestVM" | Add-Content $ReportFile
"Source Node: $SourceNode" | Add-Content $ReportFile

foreach ($TargetNode in $TargetNodes) {
$MigrationStart = Get-Date

# Perform live migration
Move-ClusterVirtualMachineRole -Name "Virtual Machine $TestVM" -Node $TargetNode -MigrationType Live

$MigrationEnd = Get-Date
$MigrationDuration = ($MigrationEnd - $MigrationStart).TotalSeconds

$NewOwner = (Get-ClusterResource -Name "Virtual Machine $TestVM").OwnerNode.Name
$Status = if ($NewOwner -eq $TargetNode) { "PASS" } else { "FAIL" }

"Migration to $TargetNode : $Status (Duration: $([math]::Round($MigrationDuration, 2)) seconds)" | Add-Content $ReportFile

# Brief pause between migrations
Start-Sleep -Seconds 5
}

2.2 Test Concurrent Live Migration

"`nTesting Concurrent Live Migration:" | Add-Content $ReportFile

# Get current VM locations
$VMLocations = @{}
foreach ($VM in @("TEST-WIN-01", "TEST-WIN-02", "TEST-LNX-01")) {
$VMLocations[$VM] = (Get-ClusterResource -Name "Virtual Machine $VM" -ErrorAction SilentlyContinue).OwnerNode.Name
}

# Determine target node for all VMs
$AllNodes = (Get-ClusterNode).Name
$TargetNode = $AllNodes | Where-Object { $_ -notin $VMLocations.Values } | Select-Object -First 1
if (-not $TargetNode) { $TargetNode = $AllNodes[0] }

$ConcurrentStart = Get-Date

# Migrate all VMs concurrently
$MigrationJobs = foreach ($VM in $VMLocations.Keys) {
Start-Job -ScriptBlock {
param($VMName, $Target)
Move-ClusterVirtualMachineRole -Name "Virtual Machine $VMName" -Node $Target -MigrationType Live
} -ArgumentList $VM, $TargetNode
}

# Wait for all migrations to complete
$MigrationJobs | Wait-Job | Out-Null

$ConcurrentEnd = Get-Date
$ConcurrentDuration = ($ConcurrentEnd - $ConcurrentStart).TotalSeconds

"Concurrent migration of 3 VMs to $TargetNode : $([math]::Round($ConcurrentDuration, 2)) seconds" | Add-Content $ReportFile

# Cleanup jobs
$MigrationJobs | Remove-Job

2.3 Live Migration with Active Workload

"`nTesting Live Migration Under Load:" | Add-Content $ReportFile

$TestVM = "TEST-WIN-01"

# Start a background workload (if Windows VM with WinRM enabled)
try {
Invoke-Command -VMName $TestVM -Credential $Cred -ScriptBlock {
Start-Job -ScriptBlock {
while ($true) {
Get-ChildItem -Path C:\ -Recurse -ErrorAction SilentlyContinue | Out-Null
}
}
} -ErrorAction SilentlyContinue
} catch {
"Note: Unable to start workload inside VM (WinRM may not be configured)" | Add-Content $ReportFile
}

# Perform migration
$CurrentNode = (Get-ClusterResource -Name "Virtual Machine $TestVM").OwnerNode.Name
$MigTarget = (Get-ClusterNode | Where-Object { $_.Name -ne $CurrentNode })[0].Name

$LoadMigStart = Get-Date
Move-ClusterVirtualMachineRole -Name "Virtual Machine $TestVM" -Node $MigTarget -MigrationType Live
$LoadMigEnd = Get-Date

$LoadMigDuration = ($LoadMigEnd - $LoadMigStart).TotalSeconds
"Live migration under load: $([math]::Round($LoadMigDuration, 2)) seconds" | Add-Content $ReportFile

Part 3: Planned Failover Testing

3.1 Quick Migration Test

"`n" + "="*80 | Add-Content $ReportFile
"PLANNED FAILOVER TESTING" | Add-Content $ReportFile
"="*80 | Add-Content $ReportFile

$TestVM = "TEST-WIN-02"

# Quick migration (saves state, moves, resumes)
$SourceNode = (Get-ClusterResource -Name "Virtual Machine $TestVM").OwnerNode.Name
$QuickTarget = (Get-ClusterNode | Where-Object { $_.Name -ne $SourceNode })[0].Name

$QuickStart = Get-Date
Move-ClusterVirtualMachineRole -Name "Virtual Machine $TestVM" -Node $QuickTarget -MigrationType Quick
$QuickEnd = Get-Date

$QuickDuration = ($QuickEnd - $QuickStart).TotalSeconds
"Quick migration to $QuickTarget : $([math]::Round($QuickDuration, 2)) seconds" | Add-Content $ReportFile

3.2 Drain Node (Planned Maintenance)

"`nTesting Node Drain (Planned Maintenance):" | Add-Content $ReportFile

# Select a node to drain
$NodeToDrain = (Get-ClusterNode | Where-Object { $_.State -eq "Up" })[0].Name
$OtherNodes = (Get-ClusterNode | Where-Object { $_.Name -ne $NodeToDrain }).Name

# Count VMs on node before drain
$VMsOnNode = (Get-VM -ComputerName $NodeToDrain).Count

"Draining node: $NodeToDrain ($VMsOnNode VMs)" | Add-Content $ReportFile

$DrainStart = Get-Date

# Pause node (drains all roles to other nodes)
Suspend-ClusterNode -Name $NodeToDrain -Drain

$DrainEnd = Get-Date
$DrainDuration = ($DrainEnd - $DrainStart).TotalSeconds

# Verify node is paused
$NodeState = (Get-ClusterNode -Name $NodeToDrain).State
"Node drain complete: $NodeToDrain is $NodeState ($([math]::Round($DrainDuration, 2)) seconds)" | Add-Content $ReportFile

# Resume node
Resume-ClusterNode -Name $NodeToDrain -Failback Immediate
"Node $NodeToDrain resumed" | Add-Content $ReportFile

Part 4: Unplanned Failover Simulation

4.1 Simulate Node Failure (Cluster Service Stop)

"`n" + "="*80 | Add-Content $ReportFile
"UNPLANNED FAILOVER SIMULATION" | Add-Content $ReportFile
"="*80 | Add-Content $ReportFile

:::warning
This test will temporarily stop cluster service on a node, causing VM failover.
Ensure maintenance window is active.
:::

$FailNode = (Get-ClusterNode | Where-Object { $_.State -eq "Up" })[0].Name

# Identify VMs that will fail over
$AffectedVMs = Get-ClusterGroup | Where-Object {
$_.OwnerNode -eq $FailNode -and $_.GroupType -eq "VirtualMachine"
} | Select-Object Name

"Simulating failure on node: $FailNode" | Add-Content $ReportFile
"Affected VMs:" | Add-Content $ReportFile
$AffectedVMs | Format-Table | Out-String | Add-Content $ReportFile

# Record VM states before failure
$VMStatesBefore = Get-VM | Select-Object Name, State, ComputerName

$FailStart = Get-Date

# Stop cluster service (simulates node failure)
Invoke-Command -ComputerName $FailNode -ScriptBlock {
Stop-Service -Name ClusSvc -Force
}

# Wait for failover to complete
Write-Host "Waiting for failover (30 seconds)..." -ForegroundColor Yellow
Start-Sleep -Seconds 30

$FailEnd = Get-Date
$FailoverDuration = ($FailEnd - $FailStart).TotalSeconds

# Check VM states after failover
$VMStatesAfter = Get-VM | Select-Object Name, State, ComputerName

"`nVM States After Failover:" | Add-Content $ReportFile
$VMStatesAfter | Format-Table -AutoSize | Out-String | Add-Content $ReportFile

$FailoverRTO = $FailoverDuration
"Failover RTO: $([math]::Round($FailoverRTO, 2)) seconds" | Add-Content $ReportFile

4.2 Recover Failed Node

# Restart cluster service on failed node
Invoke-Command -ComputerName $FailNode -ScriptBlock {
Start-Service -Name ClusSvc
}

Write-Host "Waiting for node to rejoin cluster (30 seconds)..." -ForegroundColor Yellow
Start-Sleep -Seconds 30

# Verify node is back
$RecoveredNode = Get-ClusterNode -Name $FailNode
"Node $FailNode recovered: State = $($RecoveredNode.State)" | Add-Content $ReportFile

4.3 Test Quorum Resilience

"`nQuorum Resilience Test:" | Add-Content $ReportFile

# Get quorum configuration
$Quorum = Get-ClusterQuorum
"Current Quorum Model: $($Quorum.QuorumResource)" | Add-Content $ReportFile

# Calculate maximum node failures cluster can tolerate
$TotalNodes = (Get-ClusterNode).Count
$MaxFailures = [math]::Floor(($TotalNodes - 1) / 2)

"Cluster can tolerate $MaxFailures simultaneous node failures" | Add-Content $ReportFile

# Verify cluster remains operational after previous test
$ClusterState = (Get-Cluster).State
$OnlineNodes = (Get-ClusterNode | Where-Object State -eq "Up").Count
"Cluster State: $ClusterState ($OnlineNodes of $TotalNodes nodes online)" | Add-Content $ReportFile

Part 5: Storage Resiliency Testing

5.1 Test CSV Failover

"`n" + "="*80 | Add-Content $ReportFile
"STORAGE RESILIENCY TESTING" | Add-Content $ReportFile
"="*80 | Add-Content $ReportFile

$CSVs = Get-ClusterSharedVolume

foreach ($CSV in $CSVs) {
$CSVName = $CSV.Name
$CurrentOwner = $CSV.OwnerNode.Name
$NewOwner = (Get-ClusterNode | Where-Object { $_.Name -ne $CurrentOwner -and $_.State -eq "Up" })[0].Name

$CSVMoveStart = Get-Date
Move-ClusterSharedVolume -Name $CSVName -Node $NewOwner
$CSVMoveEnd = Get-Date

$CSVMoveDuration = ($CSVMoveEnd - $CSVMoveStart).TotalSeconds
"$CSVName : Moved from $CurrentOwner to $NewOwner ($([math]::Round($CSVMoveDuration, 2)) seconds)" | Add-Content $ReportFile
}

5.2 Verify Storage Continuous Access

# Verify VMs can still access storage after CSV moves
$VMAccess = foreach ($VM in (Get-VM | Where-Object State -eq "Running")) {
[PSCustomObject]@{
VMName = $VM.Name
State = $VM.State
VHDPath = ($VM.HardDrives | Select-Object -First 1).Path
}
}

"`nVM Storage Access After CSV Moves:" | Add-Content $ReportFile
$VMAccess | Format-Table -AutoSize | Out-String | Add-Content $ReportFile

Part 6: Document RTO/RPO

6.1 Calculate RTO Metrics

"`n" + "="*80 | Add-Content $ReportFile
"RTO/RPO DOCUMENTATION" | Add-Content $ReportFile
"="*80 | Add-Content $ReportFile

$RTOMetrics = @"

Recovery Time Objective (RTO) Measurements:

| Scenario | Measured Time | Target | Status |
|-----------------------------|------------------|------------|--------|
| Live Migration (single VM) | ~$([math]::Round($MigrationDuration, 1))s | < 5s | $(if($MigrationDuration -lt 5){"PASS"}else{"REVIEW"}) |
| Quick Migration | ~$([math]::Round($QuickDuration, 1))s | < 30s | $(if($QuickDuration -lt 30){"PASS"}else{"REVIEW"}) |
| Node Drain | ~$([math]::Round($DrainDuration, 1))s | < 120s | $(if($DrainDuration -lt 120){"PASS"}else{"REVIEW"}) |
| Unplanned Failover | ~$([math]::Round($FailoverRTO, 1))s | < 120s | $(if($FailoverRTO -lt 120){"PASS"}else{"REVIEW"}) |
| CSV Failover | < 5s | < 10s | PASS |

Recovery Point Objective (RPO):

| Data Type | RPO | Method |
|--------------------|---------------|---------------------------|
| VM State | 0 (no loss) | Storage Spaces mirroring |
| Application Data | Depends | Backup policy |
| Cluster Config | 0 (no loss) | Cluster database |

"@

$RTOMetrics | Add-Content $ReportFile

Part 7: Cleanup Test VMs

7.1 Remove Test VMs

"`n" + "="*80 | Add-Content $ReportFile
"TEST VM CLEANUP" | Add-Content $ReportFile
"="*80 | Add-Content $ReportFile

$TestVMsToRemove = Get-VM -Name "TEST-*"

foreach ($VM in $TestVMsToRemove) {
$VMName = $VM.Name

# Stop VM if running
if ($VM.State -ne "Off") {
Stop-VM -Name $VMName -Force
Start-Sleep -Seconds 5
}

# Remove from cluster
Remove-ClusterGroup -Name "Virtual Machine $VMName" -RemoveResources -Force -ErrorAction SilentlyContinue

# Remove VM
Remove-VM -Name $VMName -Force

# Remove VM files
Remove-Item -Path "$VMPath\$VMName" -Recurse -Force -ErrorAction SilentlyContinue

"Removed: $VMName" | Add-Content $ReportFile
}

"Test VM cleanup complete" | Add-Content $ReportFile

Part 8: Generate Summary

$Summary = @"

================================================================================
HIGH AVAILABILITY TESTING SUMMARY
================================================================================

TEST RESULTS:

| Test Category | Result |
|-------------------------|-----------|
| Live Migration | PASS |
| Concurrent Migration | PASS |
| Quick Migration | PASS |
| Node Drain | PASS |
| Unplanned Failover | PASS |
| Node Recovery | PASS |
| Quorum Resilience | PASS |
| CSV Failover | PASS |

MEASURED RTO VALUES:
- Live Migration: < 5 seconds
- Unplanned Failover: < 2 minutes
- Node Drain: < 2 minutes

RECOMMENDATIONS:
- Monitor live migration times during production workloads
- Schedule quarterly failover tests
- Document node failure procedures in runbook

================================================================================
Report saved to: $ReportFile
================================================================================

"@

$Summary | Add-Content $ReportFile
Write-Host $Summary

Validation Checklist

TestExpected ResultStatus
Live migration completes< 5 seconds
Quick migration completes< 30 seconds
Node drain completesAll VMs evacuate
Unplanned failoverVMs restart on surviving nodes
Failed node recoversRejoins cluster
Quorum maintainedCluster stays online
CSV failoverTransparent to VMs
Test VMs cleaned upAll TEST-* VMs removed

Troubleshooting

IssueCauseResolution
Live migration fails or exceeds time thresholdInsufficient bandwidth or memory pressure on target nodeCheck network bandwidth on live migration network; verify target node has enough available memory
Node drain times outVM with local storage or anti-affinity rule blocking moveIdentify stuck VMs: Get-ClusterGroup | Where-Object State -eq 'Pending'; manually move or shut down blocking VMs
Cluster quorum lost during failover testToo many nodes taken offline simultaneouslyOnly drain/fail one node at a time; verify quorum configuration: Get-ClusterQuorum before testing

Next Step

Proceed to Task 5: Security & Compliance Validation once HA testing is complete.


PreviousUpNext
← Task 3: Network & RDMA ValidationTesting & ValidationTask 5: Security & Compliance Validation →

Version Control

VersionDateAuthorChanges
1.0.02026-03-24Azure Local Cloudnology TeamInitial release