Performance Baseline¶
This domain captures short-horizon operational signals that help teams understand the current state of the environment.
It is a baseline, not a full long-term observability platform.
What Ranger Collects¶
The performance-baseline domain should document:
- host CPU and memory utilization snapshots
- storage IOPS, throughput, latency, cache, and health-adjacent signals
- network throughput and error indicators
- active alert or Health Service summaries
- recent event patterns that materially affect interpretation of the current state
Manifest Sub-Domains¶
The v1 collector writes to these named sections of the performance manifest domain:
| Sub-domain | Content |
|---|---|
nodes |
Per-node CPU utilization, memory pressure, and uptime snapshot |
compute |
Cluster-wide vCPU and memory allocation vs. capacity with overcommit indicators |
storage |
Storage IOPS, throughput, latency, and cache hit-ratio baselines |
networking |
Network throughput and adapter error counts |
outliers |
Nodes or workloads that significantly exceed utilization norms |
events |
Recent Health Service events and cluster health faults |
summary |
Aggregate performance risk signals — high-CPU nodes, high-memory nodes, and alert counts |
Current Collector Depth¶
Current v1 collection also covers:
- RDMA and host-network performance counters where available.
- CSV cache and storage-latency context used for platform health interpretation.
- Event-log aggregation and outlier detection across the nodes.
- Point-in-time performance baselines that feed management and technical report sections.
Why It Matters¶
Ranger should not stop at static inventory. A short-horizon operational baseline helps explain whether a healthy-looking design is currently under stress, degraded, or simply under-observed.
Connectivity and Credentials¶
| Requirement | Purpose |
|---|---|
| WinRM / PowerShell remoting | Host-side performance and event data |
| Cluster credential | Required |
| Optional Azure credential | Useful for Azure Monitor, metrics, alerts, and HCI Insights correlation |
Default Behavior¶
This domain should run by default when cluster credentials are available.
Azure-side observability overlays should be partial or skipped when Azure credentials are missing.
Variant Behavior¶
Hyperconverged¶
Baseline host, storage, and network metrics apply directly.
Disconnected Operations¶
Monitoring paths differ because local control-plane observability matters more than public Azure telemetry assumptions.
Multi-Rack Preview¶
Performance interpretation may need to respect managed networking and SAN-backed storage rather than assuming standard hyperconverged behaviors.
Example Manifest Data¶
A successful collect produces entries like this:
{
"id": "managementPerformance",
"status": "success",
"domains": {
"performance": {
"nodes": [
{ "host": "tplabs-01-n01", "cpuUsagePercent": 12.4, "memoryUsedGB": 148,
"memoryTotalGB": 256, "uptimeDays": 14 },
{ "host": "tplabs-01-n02", "cpuUsagePercent": 8.1, "memoryUsedGB": 132,
"memoryTotalGB": 256, "uptimeDays": 14 }
],
"storage": {
"iopsRead": 4821, "iopsWrite": 2310,
"latencyReadMs": 0.4, "latencyWriteMs": 0.6,
"cacheHitRatePercent": 94.2
},
"outliers": [],
"summary": { "highCpuNodeCount": 0, "highMemoryNodeCount": 0, "activeAlertCount": 0 }
}
}
}
Common Findings¶
| Finding | Severity | What it means |
|---|---|---|
| Node CPU utilization above 80% | Warning | Host is under significant compute load at time of collection; investigate workload distribution |
| Node memory utilization above 90% | Warning | Memory pressure detected; workload consolidation may be needed |
| Storage latency above threshold | Warning | Read or write latency is elevated; investigate cache hit rate and disk health |
| Storage cache hit rate below 80% | Warning | Cache is not effective; working set may exceed cache capacity |
| Active Health Service alerts | Warning | Cluster has open health faults; investigate before treating run data as a clean baseline |
Partial Status¶
status: partial on the performance collector means:
- Per-node CPU/memory snapshots succeeded on some nodes but not others
- Azure Monitor overlay (HCI Insights, DCR state) failed while local performance counters succeeded — local baseline is still valid
- Event-log aggregation timed out while compute and storage metrics succeeded
Local performance snapshots are collected independently of Azure Monitor. Azure Monitor absence does not invalidate the on-premises performance baseline.
Domain Dependencies¶
Depends on the cluster-and-node domain for a node list. Azure Monitor overlay sections also depend on valid targets.azure configuration and an active Azure credential.
Evidence Boundaries¶
- Direct discovery: host-side utilization and event snapshots
- Azure-side discovery: Azure Monitor, alerts, DCRs, and HCI Insights context
- Manual/imported evidence: operator-supplied baselines or SLO context when needed
v1 and Future Boundaries¶
v1 should provide a useful snapshot and anomaly framing.
It should not promise long-term trending, forecasting, or full APM-style workload performance analysis.