GCP VM instance stop
GCP VM instance stop is a GCP chaos fault that stops one or more Compute Engine VM instances listed in VM_INSTANCE_NAMES (in ZONES, project GCP_PROJECT_ID) for TOTAL_CHAOS_DURATION seconds, then starts them again. When MANAGED_INSTANCE_GROUP=enable, the fault does not start the stopped instances; it relies on the managed instance group (MIG) auto-healer to recreate them.
Use this fault to test how a workload behaves when a VM disappears: whether managed instance groups recreate the VM inside the alerting SLA, whether clients fail over cleanly, whether GKE node-down handling kicks in (if the VM is a GKE node), and whether monitoring detects the outage within the alerting SLA.
If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.
Use cases
Run this fault when you want to answer concrete questions like:
- VM disappears: When the target VM stops, do dependents (load balancers, MIGs, GKE) recover inside the SLA?
- MIG auto-healing: Does the managed instance group recreate the VM with the expected boot time?
- GKE node-down handling: If the VM is a GKE node, does the cluster drain pods and reschedule them on healthy nodes?
- Client failover: Do clients connected to the stopped VM fail over to surviving instances cleanly?
- Monitoring fidelity: Do alerts on
compute.googleapis.com/instance/uptime, instance count, and end-to-end availability fire within the alerting SLA?
Prerequisites
- Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
- Target VMs reachable: Each entry in
VM_INSTANCE_NAMESexists in the corresponding zone inZONESandGCP_PROJECT_ID. - VM in
RUNNINGstate: The fault refuses to stop a VM that is alreadyTERMINATEDorSTOPPING. - GCP credentials available: Either a Google service account JSON key uploaded as a File Secret in Harness Secret Manager (referenced via
GCP_AUTHENTICATION_SECRET) or Workload Identity for chaos infrastructure running on GKE. - IAM permissions granted: The service account includes the permissions listed below.
Supported environments
| Platform | Support status |
|---|---|
| Compute Engine VMs (any machine type) | Supported |
| GKE worker nodes (Compute Engine MIGs) | Supported |
| GKE Autopilot nodes | Not supported (nodes are managed by GCP) |
| Spot/Preemptible VMs | Supported (note: GCP may not start them back automatically) |
| Multi-zone targeting in a single run | Supported via comma-separated ZONES matching the order of VM_INSTANCE_NAMES |
Permissions required
The Google service account used by the chaos pod (delivered through GCP_AUTHENTICATION_SECRET or Workload Identity) needs the following IAM permissions on the target project.
{
"permissions": [
"compute.instances.get",
"compute.instances.start",
"compute.instances.stop",
"compute.instances.list"
]
}
Granting the predefined role roles/compute.instanceAdmin.v1 covers these and is the simplest setup.
Go to GCP IAM integration to use Workload Identity instead of a service account key.
Authentication
The fault supports two credential delivery models. Pick one based on how your chaos infrastructure is deployed.
| Method | When to use it | How to configure |
|---|---|---|
| Harness Secret Manager File Secret | Chaos infrastructure runs outside GKE, or you want explicit static credentials | Upload the GCP service account JSON key as a File Secret in Harness Secret Manager and reference its identifier via GCP_AUTHENTICATION_SECRET |
| Workload Identity | Chaos infrastructure runs on GKE with Workload Identity enabled | Bind a Google service account to the chaos infra Kubernetes service account; no tunable changes required |
Go to Creating secrets for GCP experiments to read the secret format. Go to GCP IAM integration for Workload Identity.
Fault tunables
Configure the following fault parameters when you add GCP VM instance stop to an experiment in Chaos Studio. Defaults are shown for reference.
Required parameters
| Tunable | Description | Default |
|---|---|---|
GCP_PROJECT_ID | ID of the GCP project that contains the VM instances. | (required) |
VM_INSTANCE_NAMES | Comma-separated list of VM instance names to stop (for example vm-1,vm-2). | (required) |
ZONES | Comma-separated list of zones in the same order as VM_INSTANCE_NAMES (for example us-central1-a,us-central1-b). | (required) |
Chaos parameters
| Tunable | Description | Default |
|---|---|---|
TOTAL_CHAOS_DURATION | Total duration of the fault in seconds. The VMs stay stopped for this period. | 60 |
CHAOS_INTERVAL | Delay in seconds between successive iterations when running for more than one cycle. | 60 |
MANAGED_INSTANCE_GROUP | When enable, the fault does not start the instances after the chaos; the MIG auto-healer recreates them. | disable |
SEQUENCE | Order in which multiple instances are stopped: parallel stops all at once; serial stops them one at a time. | parallel |
RAMP_TIME | Wait period in seconds before and after the fault. Go to ramp time to read how it is applied. | 0 |
Authentication
| Tunable | Description | Default |
|---|---|---|
GCP_AUTHENTICATION_SECRET | Identifier of the File Secret in Harness Secret Manager that contains the GCP service account JSON key. Not required when using Workload Identity. | "" |
Tunables that apply to every fault are documented in common tunables for all faults.
Fault execution in brief
Calls the Compute Engine API to stop each VM in VM_INSTANCE_NAMES (in the matching zone from ZONES), waits for TOTAL_CHAOS_DURATION seconds, then starts the VMs again (unless MANAGED_INSTANCE_GROUP=enable).
Expected behavior during fault execution
- The target VMs transition
RUNNING→STOPPING→TERMINATEDand stay there forTOTAL_CHAOS_DURATION. - For GKE worker nodes: pods on the node go to
NotReady/Unknown, then the scheduler reschedules them onto healthy nodes. - For VMs behind a load balancer: health checks on the affected backends start to fail and traffic shifts to healthy backends.
- When
MANAGED_INSTANCE_GROUP=enable: the MIG auto-healer launches a replacement VM with a new instance ID. - After the duration ends (and
MANAGED_INSTANCE_GROUP=disable), the VMs transition back toRUNNING.
The chaos pod calls instances.start on every targeted VM unless MANAGED_INSTANCE_GROUP=enable. Boot time depends on the machine image and startup scripts.
Signals to watch
Attach resilience probes to assert each layer:
- Instance state: Use a command probe running
gcloud compute instances describe <vm> --zone=<zone> --format='value(status)'and assert the state changed. - Application availability: Use an HTTP probe on the user-visible endpoint behind the load balancer.
- MIG health: Use a command probe running
gcloud compute instance-groups managed describe <mig>to confirm the auto-healer recreated the VM.
Verify the fault execution effect
While the experiment is running, confirm the VM stopped and then restarted:
-
Inspect VM state with gcloud.
gcloud compute instances describe <vm-name> \--zone=<zone> \--format="value(status)"The status should be
STOPPING/TERMINATEDduring the chaos window andRUNNINGafterwards. -
Inspect Cloud Monitoring metrics.
Use the Cloud Console to inspect
compute.googleapis.com/instance/uptimeand confirm the gap during the chaos window. -
Inspect Compute Engine audit logs.
gcloud logging read 'resource.type=gce_instance AND protoPayload.methodName=v1.compute.instances.stop' --limit=10The
stopcall from the chaos pod's service account should appear.
Recovery and cleanup
- End of duration: The chaos pod calls
instances.starton every targeted VM (unlessMANAGED_INSTANCE_GROUP=enable). - Abort the experiment: Stopping the experiment from Chaos Studio also calls
instances.start. - Manual recovery: If the chaos pod exited before restarting the VM, run
gcloud compute instances start <vm-name> --zone=<zone>manually. - Workload recovery: Boot time depends on the machine image and startup scripts; GKE node
Readytransitions usually complete within 2-3 minutes.
Limitations
- Same-project targeting: A single experiment targets one
GCP_PROJECT_ID. Use multiple experiments for cross-project scope. - Zone alignment:
ZONESmust matchVM_INSTANCE_NAMESpositionally; mismatches return anInstance not founderror. - Spot/preemptible behavior: GCP may not start preempted Spot VMs back automatically; combine with
MANAGED_INSTANCE_GROUP=enablefor MIG-managed VMs. - GKE Autopilot: Not supported because GCP manages the underlying nodes.
- MIG mode skips restart: When
MANAGED_INSTANCE_GROUP=enable, recovery is fully driven by the MIG auto-healer; the fault does not callinstances.start.
Troubleshooting
GCP VM instance stop fails with PermissionDenied in Harness Chaos Engineering
The service account used by the chaos pod does not have compute.instances.stop and compute.instances.start. Grant roles/compute.instanceAdmin.v1 (or the four permissions listed above) on the target project and re-run.
GCP VM instance stop fails with Instance not found
VM_INSTANCE_NAMES and ZONES must align positionally. Confirm with gcloud compute instances list --filter='name=<vm>' --format='value(zone)' that the zone is correct. Also confirm GCP_PROJECT_ID matches the project that owns the VMs.
VMs stayed STOPPED after the experiment ended
If MANAGED_INSTANCE_GROUP=disable and the chaos pod exited before restart, run gcloud compute instances start <vm-name> --zone=<zone> manually. If MANAGED_INSTANCE_GROUP=enable, check the MIG auto-healer (gcloud compute instance-groups managed describe <mig>) and confirm it created a replacement.
Related faults
- GCP VM instance stop by label: Stop a percentage of VMs selected by label instead of named ones.
- GCP VM disk loss: Detach disks instead of stopping VMs.
- GCP SQL instance failover: Failover a Cloud SQL instance.