Skip to main content

Azure instance stop

Last updated on

Azure instance stop is an Azure chaos fault that stops (deallocates) one or more Virtual Machines listed in AZURE_INSTANCE_NAMES (in RESOURCE_GROUP, subscription AZURE_SUBSCRIPTION_ID) for TOTAL_CHAOS_DURATION seconds, then starts them again. When SCALE_SET=enable, the fault deallocates VMSS instances; the scale set's auto-recovery decides whether to bring them back.

Use this fault to test how a workload behaves when a VM disappears: whether load balancers shift traffic, whether VMSS auto-healing recreates the instance inside the alerting SLA, whether AKS node-down handling reschedules pods, and whether monitoring detects the outage within the alerting SLA.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • VM disappears: When the target VM deallocates, do load balancers fail traffic over inside the SLA?
  • VMSS recovery: Does the scale set recreate the deallocated instance with the expected boot time?
  • AKS node-down handling: If the VM is an AKS worker, does the cluster drain pods and reschedule them on healthy nodes?
  • Monitoring fidelity: Do alerts on Microsoft.Compute/virtualMachines/Availability, instance count, and end-to-end availability fire within the alerting SLA?

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
  • Target VMs reachable: Each entry in AZURE_INSTANCE_NAMES exists in RESOURCE_GROUP inside AZURE_SUBSCRIPTION_ID.
  • VM in running state: The fault refuses to deallocate a VM that is already stopped or deallocated.
  • Azure credentials available: A service principal JSON delivered as a File Secret in Harness Secret Manager, workload identity bound to the chaos infra service account, or managed identity on the AKS node pool.
  • RBAC granted: The principal includes the role listed below.

Supported environments

PlatformSupport status
Standalone Virtual MachinesSupported
Virtual Machine Scale Set instancesSupported (set SCALE_SET=enable)
AKS worker nodes (VMSS-backed)Supported with SCALE_SET=enable
Spot VMsSupported (note: Azure may not start them back automatically)

Permissions required

The Azure principal used by the chaos pod (service principal, workload identity, or managed identity) needs the following role on the target resource group or subscription.

Recommended built-in role: Virtual Machine Contributor

Custom role (minimum actions):

{
"Name": "Harness Chaos VM Stop",
"Actions": [
"Microsoft.Compute/virtualMachines/read",
"Microsoft.Compute/virtualMachines/start/action",
"Microsoft.Compute/virtualMachines/deallocate/action",
"Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read",
"Microsoft.Compute/virtualMachineScaleSets/virtualMachines/start/action",
"Microsoft.Compute/virtualMachineScaleSets/virtualMachines/deallocate/action"
],
"AssignableScopes": ["/subscriptions/<SUBSCRIPTION_ID>"]
}

Go to Azure fault permissions to read the full permission catalog.


Authentication

Pick one of the following methods. Go to Azure authentication methods to read the full setup.

MethodWhen to use itHow to configure
Service principalChaos infrastructure runs outside AKS, or you want explicit static credentialsUpload the service principal JSON file as a File Secret in Harness Secret Manager and reference it via AZURE_AUTHENTICATION_SECRET
Workload identityChaos infrastructure runs on AKS with OIDC issuer + workload identity enabledAnnotate the chaos infra service account with azure.workload.identity/client-id; the pod authenticates without static credentials
Managed identityChaos infrastructure runs on AKS with a system-assigned or user-assigned managed identity on the node poolNo tunable changes; the pod inherits the identity from IMDS

Fault tunables

Configure the following fault parameters when you add Azure instance stop to an experiment in Chaos Studio. Defaults are shown for reference.

Required parameters

TunableDescriptionDefault
AZURE_INSTANCE_NAMESComma-separated list of VM names (for example vm-1,vm-2).(required)
RESOURCE_GROUPResource group that contains the VMs.(required)

Chaos parameters

TunableDescriptionDefault
TOTAL_CHAOS_DURATIONTotal duration of the fault in seconds. The VMs stay deallocated for this period.30
CHAOS_INTERVALDelay in seconds between successive iterations when running for more than one cycle.30
SCALE_SETSet to enable when the VMs belong to a Virtual Machine Scale Set. Otherwise leave empty.""
SEQUENCEOrder in which multiple instances are stopped: parallel or serial.parallel
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

Authentication

TunableDescriptionDefault
AZURE_SUBSCRIPTION_IDTarget Azure subscription ID. Required when using workload identity or managed identity.""
AZURE_CLIENT_IDClient ID of a user-assigned managed identity (only if you have multiple identities attached).""
AZURE_AUTHENTICATION_SECRETIdentifier of the File Secret in Harness Secret Manager that contains the service principal JSON.""

Tunables that apply to every fault are documented in common tunables for all faults.


Fault execution in brief

Calls the Azure Resource Manager API to deallocate each VM in AZURE_INSTANCE_NAMES (in RESOURCE_GROUP), waits for TOTAL_CHAOS_DURATION seconds, then starts the VMs again.


Expected behavior during fault execution

  • The target VMs transition runningstoppingdeallocated and stay there for TOTAL_CHAOS_DURATION.
  • For AKS worker nodes: pods on the node go to NotReady/Unknown, then the scheduler reschedules them.
  • For VMs behind a load balancer: backend health probes fail; traffic shifts to healthy backends.
  • After the duration ends, the VMs transition back to running.
When the fault ends

The chaos pod calls start on every targeted VM. Boot time depends on the OS image and post-boot init scripts.

Signals to watch

Attach resilience probes to assert each layer:

  • Instance state: Use a command probe running az vm get-instance-view -g <rg> -n <vm> --query 'instanceView.statuses[?starts_with(code, ''PowerState/'')].code' and assert the state changed.
  • Application availability: Use an HTTP probe on the user-visible endpoint behind the load balancer.

Verify the fault execution effect

  1. Inspect VM power state with az.

    az vm get-instance-view --resource-group <rg> --name <vm> \
    --query "instanceView.statuses[?starts_with(code, 'PowerState/')].code"

    The state should be PowerState/deallocated during the chaos window and PowerState/running afterwards.

  2. Inspect Azure activity log.

    az monitor activity-log list --resource-group <rg> --max-events 20 \
    --query "[?contains(operationName.value,'deallocate')]"

Recovery and cleanup

  • End of duration: The chaos pod calls start on every targeted VM.
  • Abort the experiment: Stopping the experiment from Chaos Studio also calls start.
  • Manual recovery: If the chaos pod exited before restart, run az vm start --resource-group <rg> --name <vm> manually.
  • Workload recovery: Boot time depends on the OS image and init scripts; AKS node Ready transitions usually complete within 2-3 minutes.

Limitations

  • Same-subscription targeting: A single experiment targets one AZURE_SUBSCRIPTION_ID.
  • Resource group scope: All entries in AZURE_INSTANCE_NAMES must be in RESOURCE_GROUP.
  • Spot VMs: Azure may not start back evicted Spot VMs automatically.
  • VMSS instance IDs: When SCALE_SET=enable, entries in AZURE_INSTANCE_NAMES are VMSS instance IDs (0, 1, ...), not VM names.

Troubleshooting

Azure instance stop fails with AuthorizationFailed in Harness Chaos Engineering

The Azure principal used by the chaos pod is missing Microsoft.Compute/virtualMachines/deallocate or start. Assign the Virtual Machine Contributor role (or a custom role with the required actions) on the target resource group or subscription.

Azure instance stop fails with ResourceNotFound

Confirm each VM name in AZURE_INSTANCE_NAMES exists in RESOURCE_GROUP with az vm list -g <rg> --query '[].name'. Confirm AZURE_SUBSCRIPTION_ID matches the subscription that owns the resource group.

VMs stayed deallocated after the experiment ended

If the chaos pod exited before start, run az vm start --resource-group <rg> --name <vm> manually. For VMSS instances, run az vmss start --resource-group <rg> --name <vmss> --instance-ids <id>.