Azure instance stop
Azure instance stop is an Azure chaos fault that stops (deallocates) one or more Virtual Machines listed in AZURE_INSTANCE_NAMES (in RESOURCE_GROUP, subscription AZURE_SUBSCRIPTION_ID) for TOTAL_CHAOS_DURATION seconds, then starts them again. When SCALE_SET=enable, the fault deallocates VMSS instances; the scale set's auto-recovery decides whether to bring them back.
Use this fault to test how a workload behaves when a VM disappears: whether load balancers shift traffic, whether VMSS auto-healing recreates the instance inside the alerting SLA, whether AKS node-down handling reschedules pods, and whether monitoring detects the outage within the alerting SLA.
If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.
Use cases
Run this fault when you want to answer concrete questions like:
- VM disappears: When the target VM deallocates, do load balancers fail traffic over inside the SLA?
- VMSS recovery: Does the scale set recreate the deallocated instance with the expected boot time?
- AKS node-down handling: If the VM is an AKS worker, does the cluster drain pods and reschedule them on healthy nodes?
- Monitoring fidelity: Do alerts on
Microsoft.Compute/virtualMachines/Availability, instance count, and end-to-end availability fire within the alerting SLA?
Prerequisites
- Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
- Target VMs reachable: Each entry in
AZURE_INSTANCE_NAMESexists inRESOURCE_GROUPinsideAZURE_SUBSCRIPTION_ID. - VM in
runningstate: The fault refuses to deallocate a VM that is alreadystoppedordeallocated. - Azure credentials available: A service principal JSON delivered as a File Secret in Harness Secret Manager, workload identity bound to the chaos infra service account, or managed identity on the AKS node pool.
- RBAC granted: The principal includes the role listed below.
Supported environments
| Platform | Support status |
|---|---|
| Standalone Virtual Machines | Supported |
| Virtual Machine Scale Set instances | Supported (set SCALE_SET=enable) |
| AKS worker nodes (VMSS-backed) | Supported with SCALE_SET=enable |
| Spot VMs | Supported (note: Azure may not start them back automatically) |
Permissions required
The Azure principal used by the chaos pod (service principal, workload identity, or managed identity) needs the following role on the target resource group or subscription.
Recommended built-in role: Virtual Machine Contributor
Custom role (minimum actions):
{
"Name": "Harness Chaos VM Stop",
"Actions": [
"Microsoft.Compute/virtualMachines/read",
"Microsoft.Compute/virtualMachines/start/action",
"Microsoft.Compute/virtualMachines/deallocate/action",
"Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read",
"Microsoft.Compute/virtualMachineScaleSets/virtualMachines/start/action",
"Microsoft.Compute/virtualMachineScaleSets/virtualMachines/deallocate/action"
],
"AssignableScopes": ["/subscriptions/<SUBSCRIPTION_ID>"]
}
Go to Azure fault permissions to read the full permission catalog.
Authentication
Pick one of the following methods. Go to Azure authentication methods to read the full setup.
| Method | When to use it | How to configure |
|---|---|---|
| Service principal | Chaos infrastructure runs outside AKS, or you want explicit static credentials | Upload the service principal JSON file as a File Secret in Harness Secret Manager and reference it via AZURE_AUTHENTICATION_SECRET |
| Workload identity | Chaos infrastructure runs on AKS with OIDC issuer + workload identity enabled | Annotate the chaos infra service account with azure.workload.identity/client-id; the pod authenticates without static credentials |
| Managed identity | Chaos infrastructure runs on AKS with a system-assigned or user-assigned managed identity on the node pool | No tunable changes; the pod inherits the identity from IMDS |
Fault tunables
Configure the following fault parameters when you add Azure instance stop to an experiment in Chaos Studio. Defaults are shown for reference.
Required parameters
| Tunable | Description | Default |
|---|---|---|
AZURE_INSTANCE_NAMES | Comma-separated list of VM names (for example vm-1,vm-2). | (required) |
RESOURCE_GROUP | Resource group that contains the VMs. | (required) |
Chaos parameters
| Tunable | Description | Default |
|---|---|---|
TOTAL_CHAOS_DURATION | Total duration of the fault in seconds. The VMs stay deallocated for this period. | 30 |
CHAOS_INTERVAL | Delay in seconds between successive iterations when running for more than one cycle. | 30 |
SCALE_SET | Set to enable when the VMs belong to a Virtual Machine Scale Set. Otherwise leave empty. | "" |
SEQUENCE | Order in which multiple instances are stopped: parallel or serial. | parallel |
RAMP_TIME | Wait period in seconds before and after the fault. Go to ramp time to read how it is applied. | 0 |
Authentication
| Tunable | Description | Default |
|---|---|---|
AZURE_SUBSCRIPTION_ID | Target Azure subscription ID. Required when using workload identity or managed identity. | "" |
AZURE_CLIENT_ID | Client ID of a user-assigned managed identity (only if you have multiple identities attached). | "" |
AZURE_AUTHENTICATION_SECRET | Identifier of the File Secret in Harness Secret Manager that contains the service principal JSON. | "" |
Tunables that apply to every fault are documented in common tunables for all faults.
Fault execution in brief
Calls the Azure Resource Manager API to deallocate each VM in AZURE_INSTANCE_NAMES (in RESOURCE_GROUP), waits for TOTAL_CHAOS_DURATION seconds, then starts the VMs again.
Expected behavior during fault execution
- The target VMs transition
running→stopping→deallocatedand stay there forTOTAL_CHAOS_DURATION. - For AKS worker nodes: pods on the node go to
NotReady/Unknown, then the scheduler reschedules them. - For VMs behind a load balancer: backend health probes fail; traffic shifts to healthy backends.
- After the duration ends, the VMs transition back to
running.
The chaos pod calls start on every targeted VM. Boot time depends on the OS image and post-boot init scripts.
Signals to watch
Attach resilience probes to assert each layer:
- Instance state: Use a command probe running
az vm get-instance-view -g <rg> -n <vm> --query 'instanceView.statuses[?starts_with(code, ''PowerState/'')].code'and assert the state changed. - Application availability: Use an HTTP probe on the user-visible endpoint behind the load balancer.
Verify the fault execution effect
-
Inspect VM power state with
az.az vm get-instance-view --resource-group <rg> --name <vm> \--query "instanceView.statuses[?starts_with(code, 'PowerState/')].code"The state should be
PowerState/deallocatedduring the chaos window andPowerState/runningafterwards. -
Inspect Azure activity log.
az monitor activity-log list --resource-group <rg> --max-events 20 \--query "[?contains(operationName.value,'deallocate')]"
Recovery and cleanup
- End of duration: The chaos pod calls
starton every targeted VM. - Abort the experiment: Stopping the experiment from Chaos Studio also calls
start. - Manual recovery: If the chaos pod exited before restart, run
az vm start --resource-group <rg> --name <vm>manually. - Workload recovery: Boot time depends on the OS image and init scripts; AKS node
Readytransitions usually complete within 2-3 minutes.
Limitations
- Same-subscription targeting: A single experiment targets one
AZURE_SUBSCRIPTION_ID. - Resource group scope: All entries in
AZURE_INSTANCE_NAMESmust be inRESOURCE_GROUP. - Spot VMs: Azure may not start back evicted Spot VMs automatically.
- VMSS instance IDs: When
SCALE_SET=enable, entries inAZURE_INSTANCE_NAMESare VMSS instance IDs (0,1, ...), not VM names.
Troubleshooting
Azure instance stop fails with AuthorizationFailed in Harness Chaos Engineering
The Azure principal used by the chaos pod is missing Microsoft.Compute/virtualMachines/deallocate or start. Assign the Virtual Machine Contributor role (or a custom role with the required actions) on the target resource group or subscription.
Azure instance stop fails with ResourceNotFound
Confirm each VM name in AZURE_INSTANCE_NAMES exists in RESOURCE_GROUP with az vm list -g <rg> --query '[].name'. Confirm AZURE_SUBSCRIPTION_ID matches the subscription that owns the resource group.
VMs stayed deallocated after the experiment ended
If the chaos pod exited before start, run az vm start --resource-group <rg> --name <vm> manually. For VMSS instances, run az vmss start --resource-group <rg> --name <vmss> --instance-ids <id>.
Related faults
- Azure AKS node down: Deallocate AKS VMSS nodes selected by node pool or zone.
- Azure disk loss: Detach disks instead of stopping VMs.
- Azure web app stop: Stop an App Service web app instead of a VM.