Azure instance CPU hog
Azure instance CPU hog is an Azure chaos fault that drives CPU utilization to CPU_LOAD percent (or saturates CPU_CORES cores when CPU_LOAD=0) on each VM listed in AZURE_INSTANCE_NAMES (in RESOURCE_GROUP, subscription AZURE_SUBSCRIPTION_ID) for TOTAL_CHAOS_DURATION seconds. The fault runs a stress workload inside the target VM via the Azure VM run-command extension, then terminates it when the duration ends.
Use this fault to test how a workload behaves when compute headroom shrinks: whether latency stays inside the SLA, whether the OS throttles correctly, whether autoscaling responds, and whether monitoring detects CPU saturation within the alerting SLA.
If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.
Use cases
Run this fault when you want to answer concrete questions like:
- CPU pressure: When CPU utilization climbs, does application latency stay inside the SLA?
- Autoscaling fidelity: Does VMSS autoscale, AKS HPA, or App Service autoscale add capacity inside the alerting SLA?
- OS scheduling: Do critical processes (kubelet, agent daemons) keep getting CPU during the saturation window?
- Monitoring fidelity: Do alerts on
Percentage CPUin Azure Monitor fire inside the alerting SLA?
Prerequisites
- Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
- Target VMs reachable: Each entry in
AZURE_INSTANCE_NAMESexists inRESOURCE_GROUPand is inrunningstate. - VM Agent and run-command enabled: The Azure VM Agent must be running on the target VM and the run-command extension must be reachable.
- Azure credentials available: A service principal File Secret in Harness Secret Manager, workload identity on AKS, or managed identity on the AKS node pool.
- RBAC granted: The principal includes the role listed below.
Supported environments
| Platform | Support status |
|---|---|
| Standalone Linux VMs | Supported |
| Standalone Windows VMs | Supported (use Windows CPU stress for native Windows support) |
| VMSS instances | Supported (set SCALE_SET=enable) |
| AKS worker nodes (VMSS-backed) | Supported with SCALE_SET=enable |
Permissions required
The Azure principal used by the chaos pod needs the following role on the target resource group or subscription.
Recommended built-in role: Virtual Machine Contributor
Custom role (minimum actions):
{
"Name": "Harness Chaos VM Stress",
"Actions": [
"Microsoft.Compute/virtualMachines/read",
"Microsoft.Compute/virtualMachines/runCommand/action",
"Microsoft.Compute/virtualMachines/runCommands/read",
"Microsoft.Compute/virtualMachines/runCommands/write",
"Microsoft.Compute/virtualMachines/runCommands/delete",
"Microsoft.Compute/virtualMachineScaleSets/virtualMachines/runCommand/action"
],
"AssignableScopes": ["/subscriptions/<SUBSCRIPTION_ID>"]
}
Go to Azure fault permissions to read the full permission catalog.
Authentication
Pick one of the following methods. Go to Azure authentication methods to read the full setup.
| Method | When to use it | How to configure |
|---|---|---|
| Service principal | Chaos infrastructure runs outside AKS, or you want explicit static credentials | Upload the service principal JSON file as a File Secret in Harness Secret Manager and reference it via AZURE_AUTHENTICATION_SECRET |
| Workload identity | Chaos infrastructure runs on AKS with workload identity enabled | Annotate the chaos infra service account with azure.workload.identity/client-id |
| Managed identity | Chaos infrastructure runs on AKS with a managed identity on the node pool | No tunable changes; the pod inherits the identity from IMDS |
Fault tunables
Configure the following fault parameters when you add Azure instance CPU hog to an experiment in Chaos Studio. Defaults are shown for reference.
Required parameters
| Tunable | Description | Default |
|---|---|---|
AZURE_INSTANCE_NAMES | Comma-separated list of VM names. | (required) |
RESOURCE_GROUP | Resource group that contains the VMs. | (required) |
Stress parameters
| Tunable | Description | Default |
|---|---|---|
CPU_CORES | Number of CPU cores to stress. Ignored when CPU_LOAD > 0. | 500 |
CPU_LOAD | Target CPU utilization percentage (0-100). Set to 0 to saturate CPU_CORES instead. | 0 |
Chaos parameters
| Tunable | Description | Default |
|---|---|---|
TOTAL_CHAOS_DURATION | Total duration of the fault in seconds. | 30 |
CHAOS_INTERVAL | Delay in seconds between successive iterations when running for more than one cycle. | 30 |
SCALE_SET | Set to enable when the VMs belong to a Virtual Machine Scale Set. | "" |
SEQUENCE | Order in which multiple instances are stressed: parallel or serial. | parallel |
RAMP_TIME | Wait period in seconds before and after the fault. Go to ramp time to read how it is applied. | 0 |
Authentication
| Tunable | Description | Default |
|---|---|---|
AZURE_SUBSCRIPTION_ID | Target Azure subscription ID. Required when using workload identity or managed identity. | "" |
AZURE_CLIENT_ID | Client ID of a user-assigned managed identity. | "" |
AZURE_AUTHENTICATION_SECRET | Identifier of the File Secret in Harness Secret Manager that contains the service principal JSON. | "" |
Tunables that apply to every fault are documented in common tunables for all faults.
Fault execution in brief
Uses the Azure VM run-command extension to launch a CPU stress workload on each VM in AZURE_INSTANCE_NAMES, driving utilization to CPU_LOAD percent across CPU_CORES cores for TOTAL_CHAOS_DURATION seconds, then terminates the workload.
Expected behavior during fault execution
- CPU utilization on the affected VMs climbs to
CPU_LOADpercent (or saturatesCPU_COREScores) for the duration. - Application latency increases proportionally to the load.
- Azure Monitor
Percentage CPUreflects the spike. - After the duration ends, the stress workload exits and CPU utilization returns to baseline.
The chaos pod terminates the stress workload through a follow-up run-command call. CPU utilization returns to baseline within seconds.
Signals to watch
Attach resilience probes to assert each layer:
- CPU on the VM: Use a Prometheus probe on
node_cpu_seconds_total(Linux) orwindows_cpu_time_total(Windows) and assert the spike is observed. - Application latency: Use an HTTP probe on the user-visible endpoint and assert p95 stays inside the SLA.
Verify the fault execution effect
-
Inspect Azure Monitor CPU metric.
az monitor metrics list \--resource /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm> \--metric "Percentage CPU" \--interval PT1MYou should see the spike during the chaos window.
-
SSH into the VM and run
top.The stress process should be visible during the chaos window.
Recovery and cleanup
- End of duration: The chaos pod terminates the stress workload via run-command.
- Abort the experiment: Stopping the experiment from Chaos Studio also terminates the workload.
- Manual recovery: SSH into the VM and kill the stress process if it survived.
- Workload recovery: Latency returns to baseline within seconds once stress stops.
Limitations
- VM Agent dependency: The Azure VM Agent and run-command extension must be running on the target VM.
- Same-subscription targeting: A single experiment targets one
AZURE_SUBSCRIPTION_ID. - CPU_LOAD vs CPU_CORES:
CPU_LOAD > 0applies to all available cores;CPU_LOAD=0saturates exactlyCPU_COREScores. - OS guest dependency: Stress runs inside the guest OS; OS limits (cgroups, ulimits) apply.
Troubleshooting
Azure instance CPU hog fails with run-command failed in Harness Chaos Engineering
The Azure VM Agent must be running and the run-command extension must be installed. Verify with az vm get-instance-view -g <rg> -n <vm> --query 'instanceView.vmAgent.statuses'. Reinstall the VM Agent if the status is not Ready.
Azure instance CPU hog fails with AuthorizationFailed
The Azure principal is missing Microsoft.Compute/virtualMachines/runCommand/action. Assign Virtual Machine Contributor (or a custom role with the runCommand action) on the target resource group or subscription.
CPU spike not visible in Azure Monitor
Azure Monitor metrics aggregate at 1-minute granularity. For short chaos windows, inspect node-level metrics through the OS (top/Windows Task Manager) or a Prometheus node exporter scrape.
Related faults
- Azure instance memory hog: Stress memory instead of CPU.
- Azure instance IO stress: Stress disk IO instead of CPU.
- Azure instance stop: Stop the VM instead of stressing it.