Skip to content

VMFleet Monitoring

Tool: VMFleet Category: Tool Guide

The monitoring subsystem collects real-time storage, network, and compute metrics during VMFleet test execution. It uses a hybrid architecture: local PowerShell/PerfMon collection (always active) with optional push to Azure Monitor/Log Analytics.

Monitoring Architecture

Cluster Nodes (PerfMon counters)
PowerShell Get-Counter (remote sessions)
        ├──► Local CSV/JSON files (results/{run-id}/metrics/)
        └──► Azure Monitor / Log Analytics (optional)
              Azure Portal dashboards, alerts

Live Console Monitoring

Use the built-in watcher during active tests:

.\src\solutions\vmfleet\scripts\Watch-VMFleetMonitor.ps1 `
    -ClusterConfigPath "config/clusters/my-cluster.yml" `
    -RefreshIntervalSeconds 5 `
    -DurationMinutes 30

The watcher prints rolling VM count, read/write IOPS, read/write latency, and throughput.

Metric Categories

Storage Metrics

Collected by Collect-StorageMetrics.ps1:

Counter Description
CSVFS: Read IOPS Cluster Shared Volume read operations per second
CSVFS: Write IOPS Cluster Shared Volume write operations per second
CSVFS: Read Throughput (MB/s) Aggregate read bandwidth across CSVs
CSVFS: Write Throughput (MB/s) Aggregate write bandwidth across CSVs
CSVFS: Read Latency (ms) Average read latency per I/O operation
CSVFS: Write Latency (ms) Average write latency per I/O operation
S2D Cache Hit Ratio Percentage of I/O served from cache tier
Physical Disk Queue Depth Outstanding I/O requests per physical disk
Storage Pool Health Degraded/healthy status of storage pools

Network Metrics

Collected by Collect-NetworkMetrics.ps1:

Counter Description
RDMA Activity: Bytes Sent/sec RDMA network adapter transmit throughput
RDMA Activity: Bytes Received/sec RDMA network adapter receive throughput
SMB Direct: Bytes Sent/sec SMB Direct (RDMA) data transfer rate
Network Adapter: Throughput (Gbps) Total NIC throughput per adapter
Network Adapter: Packets Dropped Dropped packets indicating congestion or errors
TCP: Retransmits/sec TCP retransmission rate (indicates network issues)

Compute Metrics

Collected by Collect-ComputeMetrics.ps1:

Counter Description
Processor: % Total CPU Host CPU utilization across all logical processors
Hyper-V Hypervisor Logical Processor: % Total Run Time Hypervisor-level CPU consumption
Guest vCPU Utilization Per-VM CPU usage (via VMFleet Watch-FleetCPU)
Memory: Available MBytes Free physical memory on each host
Memory: % Committed Bytes In Use Memory pressure indicator
Hyper-V Dynamic Memory: Current Pressure Dynamic memory pressure per VM

Local Metric Collection

Local collection is always active during test runs:

# Collect all metric categories with 5-second intervals
.\tools\vmfleet\monitoring\Collect-StorageMetrics.ps1 `
    -ClusterConfig "config/clusters/my-cluster.yml" `
    -SampleIntervalSeconds 5 `
    -OutputPath "results/run-001/metrics/storage/"

.\tools\vmfleet\monitoring\Collect-NetworkMetrics.ps1 `
    -ClusterConfig "config/clusters/my-cluster.yml" `
    -SampleIntervalSeconds 5 `
    -OutputPath "results/run-001/metrics/network/"

.\tools\vmfleet\monitoring\Collect-ComputeMetrics.ps1 `
    -ClusterConfig "config/clusters/my-cluster.yml" `
    -SampleIntervalSeconds 5 `
    -OutputPath "results/run-001/metrics/compute/"

Output format is JSON-lines for machine parsing:

{"timestamp":"2026-02-13T10:05:00Z","node":"hci-node-01","counter":"CSVFS_ReadIOPS","value":45230}
{"timestamp":"2026-02-13T10:05:00Z","node":"hci-node-01","counter":"CSVFS_WriteIOPS","value":12840}

Azure Monitor Integration

Optionally push collected metrics to Azure Monitor for centralized dashboards and alerting:

.\tools\vmfleet\monitoring\Push-MetricsToAzureMonitor.ps1 `
    -MetricsPath "results/run-001/metrics/" `
    -WorkspaceId "your-log-analytics-workspace-id" `
    -CredentialSource KeyVault

Each pushed metric contains these core fields:

  • timestamp
  • node
  • counter_name
  • value
  • run_id
  • profile_name

Note

Azure Monitor integration requires the monitoring.bicep infrastructure template to be deployed. See tools/vmfleet/infrastructure/bicep/monitoring.bicep.

Alert Rules

Predefined alert thresholds in monitoring/alerts/alert-rules.yml:

alerts:
  - name: high_storage_latency
    metric: csvfs_write_latency_ms
    threshold: 50
    severity: warning
    description: "Write latency exceeds 50ms  possible storage bottleneck"

  - name: critical_cpu
    metric: host_cpu_percent
    threshold: 95
    severity: critical
    description: "Host CPU exceeds 95%  test may be compute-bound"

Real-Time Dashboard

During test execution, use the monitoring dashboard wrapper:

# Launch real-time monitoring (combines VMFleet Watch-FleetCluster with custom metrics)
.\tools\vmfleet\monitoring\Export-MetricsDashboard.ps1 `
    -InputPath "results/run-001/metrics/" `
    -OutputPath "reports/run-001/" `
    -Title "VMFleet Run 001"

Counter Reference

Category Counter name Purpose
Storage CSVFS_ReadIOPS CSVFS read operations per second
Storage CSVFS_WriteIOPS CSVFS write operations per second
Storage CSVFS_ReadMBps Read throughput
Storage CSVFS_WriteMBps Write throughput
Storage CSVFS_ReadLatencyMs Read latency
Storage CSVFS_WriteLatencyMs Write latency
Compute HostCpuPercent Host CPU saturation
Compute HostAvailableMemoryMB Available memory headroom
Compute HyperVLogicalProcessorRunTime Hypervisor CPU pressure

KQL Examples

VMFleetMetrics_CL
| where CounterName_s in ("CSVFS_ReadIOPS", "CSVFS_WriteIOPS")
| summarize AvgValue=avg(Value_d) by bin(TimeGenerated, 1m), CounterName_s
| render timechart
VMFleetMetrics_CL
| where CounterName_s in ("CSVFS_ReadLatencyMs", "CSVFS_WriteLatencyMs")
| summarize P95=percentile(Value_d, 95) by bin(TimeGenerated, 1m), CounterName_s
| render timechart