Skip to main content

Azure instance CPU hog

Last updated on

Azure instance CPU hog is an Azure chaos fault that drives CPU utilization to CPU_LOAD percent (or saturates CPU_CORES cores when CPU_LOAD=0) on each VM listed in AZURE_INSTANCE_NAMES (in RESOURCE_GROUP, subscription AZURE_SUBSCRIPTION_ID) for TOTAL_CHAOS_DURATION seconds. The fault runs a stress workload inside the target VM via the Azure VM run-command extension, then terminates it when the duration ends.

Use this fault to test how a workload behaves when compute headroom shrinks: whether latency stays inside the SLA, whether the OS throttles correctly, whether autoscaling responds, and whether monitoring detects CPU saturation within the alerting SLA.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • CPU pressure: When CPU utilization climbs, does application latency stay inside the SLA?
  • Autoscaling fidelity: Does VMSS autoscale, AKS HPA, or App Service autoscale add capacity inside the alerting SLA?
  • OS scheduling: Do critical processes (kubelet, agent daemons) keep getting CPU during the saturation window?
  • Monitoring fidelity: Do alerts on Percentage CPU in Azure Monitor fire inside the alerting SLA?

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
  • Target VMs reachable: Each entry in AZURE_INSTANCE_NAMES exists in RESOURCE_GROUP and is in running state.
  • VM Agent and run-command enabled: The Azure VM Agent must be running on the target VM and the run-command extension must be reachable.
  • Azure credentials available: A service principal File Secret in Harness Secret Manager, workload identity on AKS, or managed identity on the AKS node pool.
  • RBAC granted: The principal includes the role listed below.

Supported environments

PlatformSupport status
Standalone Linux VMsSupported
Standalone Windows VMsSupported (use Windows CPU stress for native Windows support)
VMSS instancesSupported (set SCALE_SET=enable)
AKS worker nodes (VMSS-backed)Supported with SCALE_SET=enable

Permissions required

The Azure principal used by the chaos pod needs the following role on the target resource group or subscription.

Recommended built-in role: Virtual Machine Contributor

Custom role (minimum actions):

{
"Name": "Harness Chaos VM Stress",
"Actions": [
"Microsoft.Compute/virtualMachines/read",
"Microsoft.Compute/virtualMachines/runCommand/action",
"Microsoft.Compute/virtualMachines/runCommands/read",
"Microsoft.Compute/virtualMachines/runCommands/write",
"Microsoft.Compute/virtualMachines/runCommands/delete",
"Microsoft.Compute/virtualMachineScaleSets/virtualMachines/runCommand/action"
],
"AssignableScopes": ["/subscriptions/<SUBSCRIPTION_ID>"]
}

Go to Azure fault permissions to read the full permission catalog.


Authentication

Pick one of the following methods. Go to Azure authentication methods to read the full setup.

MethodWhen to use itHow to configure
Service principalChaos infrastructure runs outside AKS, or you want explicit static credentialsUpload the service principal JSON file as a File Secret in Harness Secret Manager and reference it via AZURE_AUTHENTICATION_SECRET
Workload identityChaos infrastructure runs on AKS with workload identity enabledAnnotate the chaos infra service account with azure.workload.identity/client-id
Managed identityChaos infrastructure runs on AKS with a managed identity on the node poolNo tunable changes; the pod inherits the identity from IMDS

Fault tunables

Configure the following fault parameters when you add Azure instance CPU hog to an experiment in Chaos Studio. Defaults are shown for reference.

Required parameters

TunableDescriptionDefault
AZURE_INSTANCE_NAMESComma-separated list of VM names.(required)
RESOURCE_GROUPResource group that contains the VMs.(required)

Stress parameters

TunableDescriptionDefault
CPU_CORESNumber of CPU cores to stress. Ignored when CPU_LOAD > 0.500
CPU_LOADTarget CPU utilization percentage (0-100). Set to 0 to saturate CPU_CORES instead.0

Chaos parameters

TunableDescriptionDefault
TOTAL_CHAOS_DURATIONTotal duration of the fault in seconds.30
CHAOS_INTERVALDelay in seconds between successive iterations when running for more than one cycle.30
SCALE_SETSet to enable when the VMs belong to a Virtual Machine Scale Set.""
SEQUENCEOrder in which multiple instances are stressed: parallel or serial.parallel
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

Authentication

TunableDescriptionDefault
AZURE_SUBSCRIPTION_IDTarget Azure subscription ID. Required when using workload identity or managed identity.""
AZURE_CLIENT_IDClient ID of a user-assigned managed identity.""
AZURE_AUTHENTICATION_SECRETIdentifier of the File Secret in Harness Secret Manager that contains the service principal JSON.""

Tunables that apply to every fault are documented in common tunables for all faults.


Fault execution in brief

Uses the Azure VM run-command extension to launch a CPU stress workload on each VM in AZURE_INSTANCE_NAMES, driving utilization to CPU_LOAD percent across CPU_CORES cores for TOTAL_CHAOS_DURATION seconds, then terminates the workload.


Expected behavior during fault execution

  • CPU utilization on the affected VMs climbs to CPU_LOAD percent (or saturates CPU_CORES cores) for the duration.
  • Application latency increases proportionally to the load.
  • Azure Monitor Percentage CPU reflects the spike.
  • After the duration ends, the stress workload exits and CPU utilization returns to baseline.
When the fault ends

The chaos pod terminates the stress workload through a follow-up run-command call. CPU utilization returns to baseline within seconds.

Signals to watch

Attach resilience probes to assert each layer:

  • CPU on the VM: Use a Prometheus probe on node_cpu_seconds_total (Linux) or windows_cpu_time_total (Windows) and assert the spike is observed.
  • Application latency: Use an HTTP probe on the user-visible endpoint and assert p95 stays inside the SLA.

Verify the fault execution effect

  1. Inspect Azure Monitor CPU metric.

    az monitor metrics list \
    --resource /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm> \
    --metric "Percentage CPU" \
    --interval PT1M

    You should see the spike during the chaos window.

  2. SSH into the VM and run top.

    The stress process should be visible during the chaos window.


Recovery and cleanup

  • End of duration: The chaos pod terminates the stress workload via run-command.
  • Abort the experiment: Stopping the experiment from Chaos Studio also terminates the workload.
  • Manual recovery: SSH into the VM and kill the stress process if it survived.
  • Workload recovery: Latency returns to baseline within seconds once stress stops.

Limitations

  • VM Agent dependency: The Azure VM Agent and run-command extension must be running on the target VM.
  • Same-subscription targeting: A single experiment targets one AZURE_SUBSCRIPTION_ID.
  • CPU_LOAD vs CPU_CORES: CPU_LOAD > 0 applies to all available cores; CPU_LOAD=0 saturates exactly CPU_CORES cores.
  • OS guest dependency: Stress runs inside the guest OS; OS limits (cgroups, ulimits) apply.

Troubleshooting

Azure instance CPU hog fails with run-command failed in Harness Chaos Engineering

The Azure VM Agent must be running and the run-command extension must be installed. Verify with az vm get-instance-view -g <rg> -n <vm> --query 'instanceView.vmAgent.statuses'. Reinstall the VM Agent if the status is not Ready.

Azure instance CPU hog fails with AuthorizationFailed

The Azure principal is missing Microsoft.Compute/virtualMachines/runCommand/action. Assign Virtual Machine Contributor (or a custom role with the runCommand action) on the target resource group or subscription.

CPU spike not visible in Azure Monitor

Azure Monitor metrics aggregate at 1-minute granularity. For short chaos windows, inspect node-level metrics through the OS (top/Windows Task Manager) or a Prometheus node exporter scrape.