stress-ng — Troubleshooting¶

stress-ng Not Found on Target Node¶

Symptom: Start-StressNgTest.ps1 returns bash: stress-ng: command not found from one or more nodes.

Resolution:

# Ubuntu
sudo apt-get install -y stress-ng

# Rocky Linux / RHEL
sudo dnf install -y epel-release && sudo dnf install -y stress-ng

See Installation for the version verification procedure across all nodes.

OOM Killer Terminates Workers Mid-Run¶

Symptom: The run completes but the per-node YAML contains fewer stressor entries than expected, or Collect-StressNgResults.ps1 reports a node_count mismatch.

Diagnostic:

ssh azurelocaladmin@hci01-node1 "dmesg | grep -i 'Killed process' | tail -5"

If stress-ng appears in the output, the OOM killer fired.

Resolution:

# Check available memory before the run
ssh azurelocaladmin@hci01-node1 "free -h"

Then either reduce workers in memory-stress.yml (e.g. from 4 → 2) or add temporary swap:

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Bogo-ops Count Is Zero for One Stressor¶

Symptom: The aggregate JSON shows total_bogo_ops: 0 for a specific stressor (e.g., hdd).

Cause: The hdd stressor writes temporary files to /tmp. If that filesystem is full, every write returns an error and no operations complete.

Diagnostic:

ssh azurelocaladmin@hci01-node1 "df -h /tmp"

Resolution:

# Clean up residual test files from previous runs
ssh azurelocaladmin@hci01-node1 "rm -rf /tmp/stress-ng-results/ && df -h /tmp"

If /tmp is a separate tmpfs, increase its size in /etc/fstab:

tmpfs  /tmp  tmpfs  defaults,size=16G  0 0

CPU Throttling — Low Bogo-ops Compared to Baseline¶

Symptom: avg_bogo_ops_per_sec for the cpu stressor is 20%+ below a previous baseline run on identical hardware.

Investigation:

# Check CPU frequency counter on the affected node
Get-Counter "\Processor Information(_Total)\% Processor Frequency" `
    -ComputerName "hci01-node1" -SampleInterval 5 -MaxSamples 6 |
    ForEach-Object { $_.CounterSamples | Select-Object CookedValue }

If the frequency is below 80% of rated, the stressng_cpu_throttling alert will have fired. Review:

BIOS power profile (set to Maximum Performance, not Balanced)
Chassis airflow / fan curves
turbostat on the Linux node: sudo turbostat --quiet --show CPU,Bzy_MHz,Avg_MHz --interval 5

YAML Parse Failure — PowerShell Key Access Error¶

Symptom: Collect-StressNgResults.ps1 throws The property 'stress-ng' cannot be found on this object.

Cause: PowerShell cannot access a property with a hyphen using dot notation.

Resolution: Use single-quote notation for the hyphenated key:

# Wrong
$metrics = $parsed."stress-ng".metrics

# Correct
$metrics = $parsed.'stress-ng'.metrics

This is documented in Reporting. If you customised the collection script and introduced dot notation, revert to quoted access.

SSH Connectivity Issues¶

Symptom: Start-StressNgTest.ps1 fails with Permission denied (publickey).

See the Operations Troubleshooting Guide for full SSH key resolution steps. Quick check:

ssh -i "$env:USERPROFILE\.ssh\azurelocal_rsa" -o BatchMode=yes `
    "azurelocaladmin@hci01-node1" "echo ok"

If this returns ok, the key is correct. If it hangs or returns Permission denied, the key has not been distributed to that node — re-run ssh-copy-id.

Inconsistent Bogo-ops Across Nodes (Same Hardware)¶

Symptom: Node bogo-ops/sec values differ by more than 10% across nodes with identical CPU and RAM.

Common Causes:

Cause	Diagnostic	Fix
Background process (backup, AV scan)	`Get-Process -ComputerName node` sorted by CPU	Schedule maintenance windows outside test runs
NUMA imbalance	`numactl --hardware` on each node	Pin `workers` to specific NUMA nodes: `stress-ng --cpu 16 --taskset 0-15`
Different BIOS power profiles per node	`powercfg /query`	Set consistent BIOS/hypervisor policy across all nodes
VM density difference	One node running extra VMs	Drain excess VMs before running stress-ng tests