Skip to main content
Version: Next

Task 03: Test DR Procedures

Runbook Azure

DOCUMENT CATEGORY: Runbook SCOPE: DR testing and validation PURPOSE: Validate failover and failback procedures without impacting production MASTER REFERENCE: Microsoft Learn - Test Failover to Azure

Status: Active


Regular disaster recovery testing validates that replication is working correctly and that your organization can recover within defined RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets. This step covers test failover procedures, validation steps, and cleanup.

Prerequisites

RequirementDescriptionValidation
Replication configuredVMs replicating to AzureStep 2 complete
Initial sync completeAll VMs show 100% syncedReplicated items status
Test network availableIsolated Azure VNet for testingNo production connectivity
Recovery planMulti-VM orchestration planCreated in Site Recovery
Change windowApproved maintenance windowFor production failover tests

Variables from variables.yml

VariableConfig PathExample
AZURE_RESOURCE_GROUPazure.resource_group.namerg-azurelocal-prod-eus2
AZURE_REGIONazure.resource_group.locationeastus2
RECOVERY_VAULT_NAMEdr.recovery_vault_namersv-azl-dal-dr-01
DR_TEST_VNET_NAMEdr.test_vnet_namevnet-dr-test-isolated
RECOVERY_PLAN_NAMEdr.recovery_plan_nameRP-CriticalApps-DAL
TARGET_RTOdr.target_rto_minutes240
TARGET_RPOdr.target_rpo_minutes15

Test Failover Overview

Test failover creates VMs in Azure from the latest recovery point without affecting production:

Failover TypeImpactUse Case
Test failoverNone — creates isolated copyRegular DR testing
Planned failoverSource VMs shut down firstPlanned migration or DR
Unplanned failoverImmediate failoverActual disaster scenario
Best Practice

Run test failovers quarterly to validate DR readiness and train operations staff.

Test Failover Procedures

Step 3.1: Create Isolated Test Network

Create a virtual network isolated from production for safe testing:

  1. Navigate to Virtual networksCreate
  2. Configure:
  • Name: {{DR_TEST_VNET_NAME}}
  • Region: Same as Recovery Services vault
  • Address space: Non-overlapping CIDR
  1. Do NOT configure VNet peering or VPN gateways
  2. Click Create
Network Isolation

Never connect the test network to production. Test VMs could conflict with production if networks are connected.

Step 3.2: Run Test Failover for Single VM

Test individual VMs before testing recovery plans:

  1. Navigate to Recovery Services vaultReplicated items
  2. Select the VM to test
  3. Click Test Failover
  4. Configure:
  • Recovery point: Latest processed (recommended) or specific point
  • Azure virtual network: {{DR_TEST_VNET_NAME}}
  1. Click OK to start test failover

  2. Monitor progress in Site Recovery jobs

Step 3.3: Validate Test VM

Once test failover completes, validate the VM:

Connect to Test VM

  1. Navigate to Virtual machines in Azure Portal
  2. Find the test VM (named {VMName}-test)
  3. Assign a public IP if needed:
  • Go to VM → Networking → NIC → IP configurations
  • Associate a public IP
  1. Connect via RDP or SSH

Validation Checklist

ValidationCommand/CheckExpected Result
VM boots successfullyRDP/SSH connectionLogin successful
Network connectivityTest-NetConnectionInternal ping works
Disk volumes mountedGet-VolumeAll drives present
Services runningGet-ServiceCritical services started
Application healthApp-specific checksApplication responds
Data integrityVerify recent dataLatest data present

Application-Specific Validation

# Example: Validate SQL Server
Get-Service -Name 'MSSQLSERVER'
Invoke-Sqlcmd -Query "SELECT @@VERSION" -ServerInstance "localhost"

# Example: Validate IIS
Get-Service -Name 'W3SVC'
Invoke-WebRequest -Uri "http://localhost" -UseBasicParsing

# Example: Validate Domain Controller
Get-Service -Name 'NTDS'
dcdiag /q

Step 3.4: Document Test Results

Record test results for compliance and improvement:

## DR Test Report - {{TEST_DATE}}

### Test Summary
- **Test Type**: Test Failover
- **VMs Tested**: {{VM_LIST}}
- **Recovery Point**: {{RECOVERY_POINT_TIME}}
- **Test Duration**: {{DURATION_MINUTES}} minutes

### Results
| VM Name | Failover Time | Boot Time | Validation | Status |
|---------|---------------|-----------|------------|--------|
| VM1 | 5 min | 3 min | Passed ||
| VM2 | 7 min | 4 min | Passed ||

### Issues Identified
- Issue 1: [Description and resolution]

### RTO/RPO Analysis
- **Actual RTO**: {{ACTUAL_RTO}} minutes
- **Target RTO**: {{TARGET_RTO}} minutes
- **Actual RPO**: {{ACTUAL_RPO}} minutes (data age)
- **Target RPO**: {{TARGET_RPO}} minutes

### Recommendations
1. [Improvement recommendation]

Step 3.5: Cleanup Test Failover

Always clean up test failovers to avoid orphaned resources and charges:

  1. Navigate to Recovery Services vaultReplicated items
  2. Select the VM that was tested
  3. Click Cleanup test failover
  4. Add notes about the test (optional)
  5. Check Testing is complete. Delete test failover virtual machine(s)
  6. Click OK
Cleanup Required

VMs marked for test failover will show a warning banner until cleanup is completed. Cleanup removes the test VM and associated resources.

Recovery Plan Testing

Step 3.6: Test Recovery Plan

Test multi-VM failover with proper sequencing:

  1. Navigate to Recovery Services vaultRecovery Plans
  2. Select {{RECOVERY_PLAN_NAME}}
  3. Click Test failover
  4. Configure:
  • Recovery point: Latest processed
  • Azure virtual network: {{DR_TEST_VNET_NAME}}
  1. Click OK

  2. Monitor failover sequence:

  • Group 1 fails over first
  • Each group completes before next starts
  • Pre/post actions execute at appropriate times

Step 3.7: Validate Recovery Plan Sequence

GroupVMsExpected Boot OrderDependencies
1Domain ControllersFirstNone
2SQL ServersAfter Group 1AD available
3App ServersAfter Group 2Database available
4Web ServersAfter Group 3App tier available

Cleanup Recovery Plan Test

  1. Navigate to Recovery Plans → Select plan
  2. Click Cleanup test failover
  3. Confirm cleanup of all VMs in the plan

Production Failover (Planned)

For planned failover during actual DR event or migration:

Production Impact

Planned and unplanned failovers affect production. Only execute during approved change windows or actual disasters.

Step 3.8: Execute Planned Failover

  1. Pre-failover checklist:
  • Change approval obtained
  • Stakeholders notified
  • DNS TTL reduced (optional)
  • Recent recovery point available
  1. Navigate to Replicated items or Recovery Plans
  2. Click Failover (not Test Failover)
  3. Configure:
  • Recovery point: Latest or specific point
  • Shut down machine before failover: Yes (for planned)
  1. Click OK

  2. After failover completes:

  • Verify VMs in Azure
  • Update DNS records
  • Validate applications
  • Commit the failover

Commit Failover

After validating successful failover:

  1. Navigate to Replicated items → Select VM
  2. Click Commit
  3. This finalizes the failover and removes earlier recovery points

Failback Procedures

Step 3.9: Plan Failback

After on-premises infrastructure is restored:

Failback OptionDescriptionUse Case
Minimize downtimeSyncs data before failoverProduction workloads
Full downloadDownloads entire diskOn-premises VM deleted
Create new VMCreates new VM on-premisesOriginal VM unavailable

Step 3.10: Execute Failback

  1. Prepare on-premises environment:
  • Restore Hyper-V hosts
  • Reinstall Site Recovery provider if needed
  • Verify network connectivity
  1. Re-protect Azure VMs:
  • Navigate to Replicated items → Select VM
  • Click Re-protect
  • Replication reverses: Azure → On-premises
  1. Execute failback:
  • After reverse replication sync completes
  • Click Failover with direction Recovery to Primary
  • Select recovery point
  • Execute failover
  1. Commit and re-protect:
  • Validate on-premises VM
  • Commit failback
  • Re-enable Azure replication for future DR

Testing Schedule

Recommended testing frequency:

Test TypeFrequencyDurationParticipants
Single VM testMonthly2-4 hoursDR team
Recovery plan testQuarterly4-8 hoursDR team, App owners
Full DR drillAnnually1-2 daysAll stakeholders
Documentation reviewQuarterly2 hoursDR team

Validation

Post-Test Validation

ComponentVerificationExpected Result
Test cleanupNo orphan VMsTest VMs deleted
Replication healthReplicated items statusHealthy, Protected
Recovery point currencyLatest recovery pointRecent timestamp
Test recordsDocumentation completeReport filed

Monitor for Issues

// Query Site Recovery events in Log Analytics
AzureDiagnostics
| where Category == "AzureSiteRecoveryEvents"
| where TimeGenerated > ago(24h)
| where Level == "Error" or Level == "Warning"
| project TimeGenerated, OperationName, Message, Resource
| order by TimeGenerated desc

Troubleshooting

SymptomLikely CauseResolution
Test failover failsQuota exceededRequest Azure quota increase
VM fails to bootBoot order or driversCheck boot diagnostics, update drivers
No network connectivityNSG rulesCheck NSG allows test traffic
App validation failsMissing dependenciesVerify all app-tier VMs included
Cleanup incompleteAzure resource lockRemove locks, retry cleanup
Long failover timeLarge disk sizesConsider managed disks, SSD

Metrics and Reporting

Key DR Metrics

MetricDefinitionTarget
RTOTime to restore service< 4 hours
RPOMaximum data loss< 15 minutes
Test success rate% of tests passed100%
Mean time to recoverAverage recovery time< 2 hours

Sample Report Query

# Get failover history
$jobs = Get-AzRecoveryServicesAsrJob |
Where-Object { $_.JobType -like "*Failover*" } |
Select-Object ScenarioName, State, StartTime, EndTime,
@{N='Duration';E={($_.EndTime - $_.StartTime).TotalMinutes}}

$jobs | Export-Csv "DR-Test-Report-$(Get-Date -Format 'yyyy-MM').csv"

Variables Reference

VariableDescriptionExample
{{DR_TEST_VNET_NAME}}Isolated test networkvnet-dr-test-isolated
{{RECOVERY_PLAN_NAME}}Recovery plan nameRP-CriticalApps-DAL
{{TARGET_RTO}}RTO target in minutes240
{{TARGET_RPO}}RPO target in minutes15

Next Steps

After completing DR testing:

  1. ➡️ Phase 20: Security and Governance — Configure security baselines
  2. Schedule next quarterly DR test
  3. Update DR runbooks based on lessons learned
  4. Review and update RTO/RPO targets if needed
  5. Train new staff on DR procedures