Azure instance CPU hog

Last updated on Jun 22, 2026

Azure instance CPU hog is an Azure chaos fault that drives CPU utilization to CPU_LOAD percent (or saturates CPU_CORES cores when CPU_LOAD=0) on each VM listed in AZURE_INSTANCE_NAMES (in RESOURCE_GROUP, subscription AZURE_SUBSCRIPTION_ID) for TOTAL_CHAOS_DURATION seconds. The fault runs a stress workload inside the target VM via the Azure VM run-command extension, then terminates it when the duration ends.

Use this fault to test how a workload behaves when compute headroom shrinks: whether latency stays inside the SLA, whether the OS throttles correctly, whether autoscaling responds, and whether monitoring detects CPU saturation within the alerting SLA.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.

Use cases

Run this fault when you want to answer concrete questions like:

CPU pressure: When CPU utilization climbs, does application latency stay inside the SLA?
Autoscaling fidelity: Does VMSS autoscale, AKS HPA, or App Service autoscale add capacity inside the alerting SLA?
OS scheduling: Do critical processes (kubelet, agent daemons) keep getting CPU during the saturation window?
Monitoring fidelity: Do alerts on Percentage CPU in Azure Monitor fire inside the alerting SLA?

Prerequisites

Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
Target VMs reachable: Each entry in AZURE_INSTANCE_NAMES exists in RESOURCE_GROUP and is in running state.
VM Agent and run-command enabled: The Azure VM Agent must be running on the target VM and the run-command extension must be reachable.
Azure credentials available: A service principal File Secret in Harness Secret Manager, workload identity on AKS, or managed identity on the AKS node pool.
RBAC granted: The principal includes the role listed below.

Supported environments

Platform	Support status
Standalone Linux VMs	Supported
Standalone Windows VMs	Supported (use Windows CPU stress for native Windows support)
VMSS instances	Supported (set `SCALE_SET=enable`)
AKS worker nodes (VMSS-backed)	Supported with `SCALE_SET=enable`

Permissions required

The Azure principal used by the chaos pod needs the following role on the target resource group or subscription.

Recommended built-in role: Virtual Machine Contributor

Custom role (minimum actions):

{
  "Name": "Harness Chaos VM Stress",
  "Actions": [
    "Microsoft.Compute/virtualMachines/read",
    "Microsoft.Compute/virtualMachines/runCommand/action",
    "Microsoft.Compute/virtualMachines/runCommands/read",
    "Microsoft.Compute/virtualMachines/runCommands/write",
    "Microsoft.Compute/virtualMachines/runCommands/delete",
    "Microsoft.Compute/virtualMachineScaleSets/virtualMachines/runCommand/action"
  ],
  "AssignableScopes": ["/subscriptions/<SUBSCRIPTION_ID>"]
}

Go to Azure fault permissions to read the full permission catalog.

Authentication

Pick one of the following methods. Go to Azure authentication methods to read the full setup.

Method	When to use it	How to configure
Service principal	Chaos infrastructure runs outside AKS, or you want explicit static credentials	Upload the service principal JSON file as a File Secret in Harness Secret Manager and reference it via `AZURE_AUTHENTICATION_SECRET`
Workload identity	Chaos infrastructure runs on AKS with workload identity enabled	Annotate the chaos infra service account with `azure.workload.identity/client-id`
Managed identity	Chaos infrastructure runs on AKS with a managed identity on the node pool	No tunable changes; the pod inherits the identity from IMDS

Fault tunables

Configure the following fault parameters when you add Azure instance CPU hog to an experiment in Chaos Studio. Defaults are shown for reference.

Required parameters

Tunable	Description	Default
`AZURE_INSTANCE_NAMES`	Comma-separated list of VM names.	(required)
`RESOURCE_GROUP`	Resource group that contains the VMs.	(required)

Stress parameters

Tunable	Description	Default
`CPU_CORES`	Number of CPU cores to stress. Ignored when `CPU_LOAD > 0`.	`500`
`CPU_LOAD`	Target CPU utilization percentage (0-100). Set to `0` to saturate `CPU_CORES` instead.	`0`

Chaos parameters

Tunable	Description	Default
`TOTAL_CHAOS_DURATION`	Total duration of the fault in seconds.	`30`
`CHAOS_INTERVAL`	Delay in seconds between successive iterations when running for more than one cycle.	`30`
`SCALE_SET`	Set to `enable` when the VMs belong to a Virtual Machine Scale Set.	`""`
`SEQUENCE`	Order in which multiple instances are stressed: `parallel` or `serial`.	`parallel`
`RAMP_TIME`	Wait period in seconds before and after the fault. Go to ramp time to read how it is applied.	`0`

Authentication

Tunable	Description	Default
`AZURE_SUBSCRIPTION_ID`	Target Azure subscription ID. Required when using workload identity or managed identity.	`""`
`AZURE_CLIENT_ID`	Client ID of a user-assigned managed identity.	`""`
`AZURE_AUTHENTICATION_SECRET`	Identifier of the File Secret in Harness Secret Manager that contains the service principal JSON.	`""`

Tunables that apply to every fault are documented in common tunables for all faults.

Fault execution in brief

Uses the Azure VM run-command extension to launch a CPU stress workload on each VM in AZURE_INSTANCE_NAMES, driving utilization to CPU_LOAD percent across CPU_CORES cores for TOTAL_CHAOS_DURATION seconds, then terminates the workload.

Expected behavior during fault execution

CPU utilization on the affected VMs climbs to CPU_LOAD percent (or saturates CPU_CORES cores) for the duration.
Application latency increases proportionally to the load.
Azure Monitor Percentage CPU reflects the spike.
After the duration ends, the stress workload exits and CPU utilization returns to baseline.

When the fault ends

The chaos pod terminates the stress workload through a follow-up run-command call. CPU utilization returns to baseline within seconds.

Signals to watch

Attach resilience probes to assert each layer:

CPU on the VM: Use a Prometheus probe on node_cpu_seconds_total (Linux) or windows_cpu_time_total (Windows) and assert the spike is observed.
Application latency: Use an HTTP probe on the user-visible endpoint and assert p95 stays inside the SLA.

Verify the fault execution effect

Inspect Azure Monitor CPU metric.

az monitor metrics list \
  --resource /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm> \
  --metric "Percentage CPU" \
  --interval PT1M

You should see the spike during the chaos window.

SSH into the VM and run top.

The stress process should be visible during the chaos window.

Recovery and cleanup

End of duration: The chaos pod terminates the stress workload via run-command.
Abort the experiment: Stopping the experiment from Chaos Studio also terminates the workload.
Manual recovery: SSH into the VM and kill the stress process if it survived.
Workload recovery: Latency returns to baseline within seconds once stress stops.

Limitations

VM Agent dependency: The Azure VM Agent and run-command extension must be running on the target VM.
Same-subscription targeting: A single experiment targets one AZURE_SUBSCRIPTION_ID.
CPU_LOAD vs CPU_CORES: CPU_LOAD > 0 applies to all available cores; CPU_LOAD=0 saturates exactly CPU_CORES cores.
OS guest dependency: Stress runs inside the guest OS; OS limits (cgroups, ulimits) apply.

Troubleshooting

Azure instance CPU hog fails with run-command failed in Harness Chaos Engineering

The Azure VM Agent must be running and the run-command extension must be installed. Verify with az vm get-instance-view -g <rg> -n <vm> --query 'instanceView.vmAgent.statuses'. Reinstall the VM Agent if the status is not Ready.

Azure instance CPU hog fails with AuthorizationFailed

The Azure principal is missing Microsoft.Compute/virtualMachines/runCommand/action. Assign Virtual Machine Contributor (or a custom role with the runCommand action) on the target resource group or subscription.

CPU spike not visible in Azure Monitor

Azure Monitor metrics aggregate at 1-minute granularity. For short chaos windows, inspect node-level metrics through the OS (top/Windows Task Manager) or a Prometheus node exporter scrape.

Azure instance memory hog: Stress memory instead of CPU.
Azure instance IO stress: Stress disk IO instead of CPU.
Azure instance stop: Stop the VM instead of stressing it.

Use cases​

Prerequisites​

Supported environments​

Permissions required​

Authentication​

Fault tunables​

Fault execution in brief​

Expected behavior during fault execution​

Signals to watch​

Verify the fault execution effect​

Recovery and cleanup​

Limitations​

Troubleshooting​

Related faults​