Skip to main content

Azure instance memory hog

Last updated on

Azure instance memory hog is an Azure chaos fault that consumes MEMORY_CONSUMPTION MB (or MEMORY_PERCENTAGE percent when set) of RAM through NUMBER_OF_WORKERS worker processes on each VM listed in AZURE_INSTANCE_NAMES (in RESOURCE_GROUP, subscription AZURE_SUBSCRIPTION_ID) for TOTAL_CHAOS_DURATION seconds. The fault uses the Azure VM run-command extension to launch a stress workload inside the target VM.

Use this fault to test how a workload behaves when memory headroom shrinks: whether the OOM killer fires on the right process, whether GC-heavy applications pause, whether memory-pressure alerts fire inside the SLA, and whether monitoring detects the saturation within the alerting SLA.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • Memory pressure: When RAM utilization climbs, does the OOM killer target the expected process?
  • GC behavior: Does the JVM/CLR pause for an unacceptable duration under memory pressure?
  • VMSS autoscale: Does scale-out trigger correctly on memory metrics?
  • Monitoring fidelity: Do alerts on Available Memory Bytes in Azure Monitor fire inside the alerting SLA?

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
  • Target VMs reachable: Each entry in AZURE_INSTANCE_NAMES exists in RESOURCE_GROUP and is in running state.
  • VM Agent and run-command enabled: The Azure VM Agent must be running on the target VM.
  • Azure credentials available: A service principal File Secret in Harness Secret Manager, workload identity on AKS, or managed identity on the AKS node pool.
  • RBAC granted: The principal includes the role listed below.

Supported environments

PlatformSupport status
Standalone Linux VMsSupported
Standalone Windows VMsSupported (use Windows memory stress for native Windows support)
VMSS instancesSupported (set SCALE_SET=enable)
AKS worker nodes (VMSS-backed)Supported with SCALE_SET=enable

Permissions required

The Azure principal used by the chaos pod needs the following role on the target resource group or subscription.

Recommended built-in role: Virtual Machine Contributor

Custom role (minimum actions): see the Azure instance CPU hog permissions (same actions apply).

Go to Azure fault permissions to read the full permission catalog.


Authentication

Pick one of the following methods. Go to Azure authentication methods to read the full setup.

MethodWhen to use itHow to configure
Service principalChaos infrastructure runs outside AKSUpload the service principal JSON file as a File Secret in Harness Secret Manager and reference it via AZURE_AUTHENTICATION_SECRET
Workload identityChaos infrastructure runs on AKS with workload identity enabledAnnotate the chaos infra service account with azure.workload.identity/client-id
Managed identityChaos infrastructure runs on AKS with a managed identity on the node poolNo tunable changes; the pod inherits the identity from IMDS

Fault tunables

Configure the following fault parameters when you add Azure instance memory hog to an experiment in Chaos Studio. Defaults are shown for reference.

Required parameters

TunableDescriptionDefault
AZURE_INSTANCE_NAMESComma-separated list of VM names.(required)
RESOURCE_GROUPResource group that contains the VMs.(required)

Stress parameters

TunableDescriptionDefault
MEMORY_CONSUMPTIONMemory to consume in MB. Ignored when MEMORY_PERCENTAGE > 0.500
MEMORY_PERCENTAGETarget memory utilization percentage (0-100). Set to 0 to consume MEMORY_CONSUMPTION MB instead.0
NUMBER_OF_WORKERSNumber of worker processes that hold the memory.1

Chaos parameters

TunableDescriptionDefault
TOTAL_CHAOS_DURATIONTotal duration of the fault in seconds.30
CHAOS_INTERVALDelay in seconds between successive iterations when running for more than one cycle.30
SCALE_SETSet to enable when the VMs belong to a Virtual Machine Scale Set.""
SEQUENCEOrder in which multiple instances are stressed: parallel or serial.parallel
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

Authentication

TunableDescriptionDefault
AZURE_SUBSCRIPTION_IDTarget Azure subscription ID. Required when using workload identity or managed identity.""
AZURE_CLIENT_IDClient ID of a user-assigned managed identity.""
AZURE_AUTHENTICATION_SECRETIdentifier of the File Secret in Harness Secret Manager that contains the service principal JSON.""

Tunables that apply to every fault are documented in common tunables for all faults.


Fault execution in brief

Uses the Azure VM run-command extension to launch NUMBER_OF_WORKERS memory-stress workers on each VM in AZURE_INSTANCE_NAMES, holding a total of MEMORY_CONSUMPTION MB (or MEMORY_PERCENTAGE percent of RAM) for TOTAL_CHAOS_DURATION seconds, then terminates the workers.


Expected behavior during fault execution

  • Available memory drops on the target VMs for the duration.
  • Workloads with high memory pressure may hit GC pauses, swap, or OOM kill.
  • Azure Monitor Available Memory Bytes reflects the drop.
  • After the duration ends, the stress workers exit and memory returns to baseline.
When the fault ends

The chaos pod terminates the stress workers through a follow-up run-command call. Memory returns to baseline within seconds (no swap thrashing).

Signals to watch

  • VM memory: Use a Prometheus probe on node_memory_MemAvailable_bytes (Linux) or windows_memory_available_bytes and assert the drop is observed.
  • Application: Use an HTTP probe and assert error rate stays under threshold.

Verify the fault execution effect

  1. Inspect Azure Monitor Available Memory Bytes.

    az monitor metrics list \
    --resource /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm> \
    --metric "Available Memory Bytes" \
    --interval PT1M
  2. SSH and run free -h.

    available should drop during the chaos window.


Recovery and cleanup

  • End of duration: The chaos pod terminates the stress workers.
  • Abort the experiment: Stopping the experiment from Chaos Studio also terminates the workers.
  • Manual recovery: SSH into the VM and pkill -f memory-stress if the workers survived.
  • Workload recovery: Memory returns to baseline within seconds; affected applications may need a restart if they swapped or were OOM-killed.

Limitations

  • OOM risk: Setting MEMORY_PERCENTAGE close to 100 may trigger the OOM killer on critical processes; start conservatively.
  • Swap behavior varies: Linux VMs with swap configured may swap instead of OOM; behavior depends on guest OS settings.
  • Same-subscription targeting: A single experiment targets one AZURE_SUBSCRIPTION_ID.

Troubleshooting

Azure instance memory hog fails with run-command failed in Harness Chaos Engineering

The Azure VM Agent must be running and reachable. Verify with az vm get-instance-view -g <rg> -n <vm> --query 'instanceView.vmAgent.statuses'. Reinstall the VM Agent if its status is not Ready.

VM became unresponsive during memory hog

If MEMORY_PERCENTAGE was very high (>90%), the OOM killer may have terminated the SSH or VM Agent. Restart the VM with az vm restart -g <rg> -n <vm> and reduce MEMORY_PERCENTAGE.