Skip to main content

Azure instance IO stress

Last updated on

Azure instance IO stress is an Azure chaos fault that drives sustained disk read/write IO on the volume mounted at VOLUME_MOUNT_PATH of each VM listed in AZURE_INSTANCE_NAMES (in RESOURCE_GROUP, subscription AZURE_SUBSCRIPTION_ID) for TOTAL_CHAOS_DURATION seconds. The fault writes FILESYSTEM_UTILIZATION_BYTES GB (or FILESYSTEM_UTILIZATION_PERCENTAGE percent when set) across NUMBER_OF_WORKERS workers using the Azure VM run-command extension.

Use this fault to test how a workload behaves when the storage subsystem is saturated: whether application latency degrades gracefully, whether write-heavy paths back off correctly, whether IO autoscale (Premium SSD bursting) kicks in, and whether monitoring detects the saturation within the alerting SLA.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • IO pressure: When the disk is saturated, does application latency stay inside the SLA?
  • Premium SSD bursting: Does the disk burst credit cover the spike, or does throttling kick in?
  • Write-amplification: Do logs, journals, or background writers degrade gracefully?
  • Monitoring fidelity: Do alerts on Disk Read Bytes / Disk Write Bytes and disk queue depth fire inside the alerting SLA?

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
  • Target VMs reachable: Each entry in AZURE_INSTANCE_NAMES exists in RESOURCE_GROUP and is in running state.
  • VM Agent and run-command enabled: The Azure VM Agent must be running on the target VM.
  • Mount path exists: VOLUME_MOUNT_PATH (default /tmp) must be a writable directory on the VM with enough free space for FILESYSTEM_UTILIZATION_BYTES.
  • Azure credentials available: A service principal File Secret in Harness Secret Manager, workload identity on AKS, or managed identity on the AKS node pool.

Supported environments

PlatformSupport status
Standalone Linux VMsSupported
Standalone Windows VMsSupported (use Windows disk stress for native Windows support)
VMSS instancesSupported (set SCALE_SET=enable)
AKS worker nodes (VMSS-backed)Supported with SCALE_SET=enable

Permissions required

The Azure principal used by the chaos pod needs the following role on the target resource group or subscription.

Recommended built-in role: Virtual Machine Contributor

Custom role (minimum actions): see the Azure instance CPU hog permissions (same actions apply).

Go to Azure fault permissions to read the full permission catalog.


Authentication

Go to Azure authentication methods to set up Service principal, Workload identity, or Managed identity.


Fault tunables

Configure the following fault parameters when you add Azure instance IO stress to an experiment in Chaos Studio. Defaults are shown for reference.

Required parameters

TunableDescriptionDefault
AZURE_INSTANCE_NAMESComma-separated list of VM names.(required)
RESOURCE_GROUPResource group that contains the VMs.(required)

Stress parameters

TunableDescriptionDefault
FILESYSTEM_UTILIZATION_BYTESTotal bytes (GB) written by the stress workers. Ignored when FILESYSTEM_UTILIZATION_PERCENTAGE > 0.5
FILESYSTEM_UTILIZATION_PERCENTAGEPercentage of filesystem free space to use. Set to 0 to use FILESYSTEM_UTILIZATION_BYTES.0
NUMBER_OF_WORKERSNumber of parallel IO stress workers.1
VOLUME_MOUNT_PATHFilesystem path used as the stress target./tmp

Chaos parameters

TunableDescriptionDefault
TOTAL_CHAOS_DURATIONTotal duration of the fault in seconds.30
CHAOS_INTERVALDelay in seconds between successive iterations when running for more than one cycle.30
SCALE_SETSet to enable when the VMs belong to a Virtual Machine Scale Set.""
SEQUENCEOrder in which multiple instances are stressed: parallel or serial.parallel
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

Authentication

TunableDescriptionDefault
AZURE_SUBSCRIPTION_IDTarget Azure subscription ID.""
AZURE_CLIENT_IDClient ID of a user-assigned managed identity.""
AZURE_AUTHENTICATION_SECRETIdentifier of the File Secret in Harness Secret Manager that contains the service principal JSON.""

Tunables that apply to every fault are documented in common tunables for all faults.


Fault execution in brief

Uses the Azure VM run-command extension to launch NUMBER_OF_WORKERS IO stress workers on each VM, writing FILESYSTEM_UTILIZATION_BYTES GB (or FILESYSTEM_UTILIZATION_PERCENTAGE percent of free space) to VOLUME_MOUNT_PATH for TOTAL_CHAOS_DURATION seconds, then terminates the workers and cleans up the stress files.


Expected behavior during fault execution

  • Disk read/write throughput on the affected VMs climbs for the duration.
  • Application latency on read/write paths grows in proportion to the load.
  • Azure Monitor Disk Read Bytes/sec, Disk Write Bytes/sec, and queue depth reflect the spike.
  • After the duration ends, the workers exit, the stress files are cleaned up, and IO returns to baseline.
When the fault ends

The chaos pod terminates the workers and removes the stress files written under VOLUME_MOUNT_PATH. Disk usage returns to baseline.

Signals to watch

  • VM disk metrics: Use a Prometheus probe on node_disk_io_time_seconds_total and assert the spike is observed.
  • Application latency: Use an HTTP probe on a write-heavy endpoint.

Verify the fault execution effect

  1. Inspect Azure Monitor disk metrics.

    az monitor metrics list \
    --resource /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm> \
    --metric "Disk Read Bytes,Disk Write Bytes" \
    --interval PT1M
  2. SSH into the VM and run iostat -x 2 during the chaos window.

    %util should climb and await should grow.


Recovery and cleanup

  • End of duration: The chaos pod terminates the workers and removes the stress files from VOLUME_MOUNT_PATH.
  • Abort the experiment: Stopping the experiment from Chaos Studio also cleans up.
  • Manual recovery: SSH into the VM and remove any stress-* files left behind under VOLUME_MOUNT_PATH.

Limitations

  • Filesystem space requirement: The chosen VOLUME_MOUNT_PATH must have enough free space; the fault errors out if it does not.
  • Premium SSD bursting: Bursting credit may mask the impact on small VMs; size FILESYSTEM_UTILIZATION_BYTES accordingly.
  • Same-subscription targeting: A single experiment targets one AZURE_SUBSCRIPTION_ID.

Troubleshooting

Azure instance IO stress fails with No space left on device in Harness Chaos Engineering

VOLUME_MOUNT_PATH does not have enough free space. Verify with df -h <path> on the VM, then either lower FILESYSTEM_UTILIZATION_BYTES or pick a different mount path.

No IO spike visible in Azure Monitor

Azure Monitor metrics aggregate at 1-minute granularity. For short chaos windows, inspect node-level metrics through iostat or a Prometheus node exporter scrape. Premium SSD burst credits may also absorb short bursts.