Skip to main content

GCP VM instance stop

Last updated on

GCP VM instance stop is a GCP chaos fault that stops one or more Compute Engine VM instances listed in VM_INSTANCE_NAMES (in ZONES, project GCP_PROJECT_ID) for TOTAL_CHAOS_DURATION seconds, then starts them again. When MANAGED_INSTANCE_GROUP=enable, the fault does not start the stopped instances; it relies on the managed instance group (MIG) auto-healer to recreate them.

Use this fault to test how a workload behaves when a VM disappears: whether managed instance groups recreate the VM inside the alerting SLA, whether clients fail over cleanly, whether GKE node-down handling kicks in (if the VM is a GKE node), and whether monitoring detects the outage within the alerting SLA.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • VM disappears: When the target VM stops, do dependents (load balancers, MIGs, GKE) recover inside the SLA?
  • MIG auto-healing: Does the managed instance group recreate the VM with the expected boot time?
  • GKE node-down handling: If the VM is a GKE node, does the cluster drain pods and reschedule them on healthy nodes?
  • Client failover: Do clients connected to the stopped VM fail over to surviving instances cleanly?
  • Monitoring fidelity: Do alerts on compute.googleapis.com/instance/uptime, instance count, and end-to-end availability fire within the alerting SLA?

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
  • Target VMs reachable: Each entry in VM_INSTANCE_NAMES exists in the corresponding zone in ZONES and GCP_PROJECT_ID.
  • VM in RUNNING state: The fault refuses to stop a VM that is already TERMINATED or STOPPING.
  • GCP credentials available: Either a Google service account JSON key uploaded as a File Secret in Harness Secret Manager (referenced via GCP_AUTHENTICATION_SECRET) or Workload Identity for chaos infrastructure running on GKE.
  • IAM permissions granted: The service account includes the permissions listed below.

Supported environments

PlatformSupport status
Compute Engine VMs (any machine type)Supported
GKE worker nodes (Compute Engine MIGs)Supported
GKE Autopilot nodesNot supported (nodes are managed by GCP)
Spot/Preemptible VMsSupported (note: GCP may not start them back automatically)
Multi-zone targeting in a single runSupported via comma-separated ZONES matching the order of VM_INSTANCE_NAMES

Permissions required

The Google service account used by the chaos pod (delivered through GCP_AUTHENTICATION_SECRET or Workload Identity) needs the following IAM permissions on the target project.

{
"permissions": [
"compute.instances.get",
"compute.instances.start",
"compute.instances.stop",
"compute.instances.list"
]
}

Granting the predefined role roles/compute.instanceAdmin.v1 covers these and is the simplest setup.

Go to GCP IAM integration to use Workload Identity instead of a service account key.


Authentication

The fault supports two credential delivery models. Pick one based on how your chaos infrastructure is deployed.

MethodWhen to use itHow to configure
Harness Secret Manager File SecretChaos infrastructure runs outside GKE, or you want explicit static credentialsUpload the GCP service account JSON key as a File Secret in Harness Secret Manager and reference its identifier via GCP_AUTHENTICATION_SECRET
Workload IdentityChaos infrastructure runs on GKE with Workload Identity enabledBind a Google service account to the chaos infra Kubernetes service account; no tunable changes required

Go to Creating secrets for GCP experiments to read the secret format. Go to GCP IAM integration for Workload Identity.


Fault tunables

Configure the following fault parameters when you add GCP VM instance stop to an experiment in Chaos Studio. Defaults are shown for reference.

Required parameters

TunableDescriptionDefault
GCP_PROJECT_IDID of the GCP project that contains the VM instances.(required)
VM_INSTANCE_NAMESComma-separated list of VM instance names to stop (for example vm-1,vm-2).(required)
ZONESComma-separated list of zones in the same order as VM_INSTANCE_NAMES (for example us-central1-a,us-central1-b).(required)

Chaos parameters

TunableDescriptionDefault
TOTAL_CHAOS_DURATIONTotal duration of the fault in seconds. The VMs stay stopped for this period.60
CHAOS_INTERVALDelay in seconds between successive iterations when running for more than one cycle.60
MANAGED_INSTANCE_GROUPWhen enable, the fault does not start the instances after the chaos; the MIG auto-healer recreates them.disable
SEQUENCEOrder in which multiple instances are stopped: parallel stops all at once; serial stops them one at a time.parallel
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

Authentication

TunableDescriptionDefault
GCP_AUTHENTICATION_SECRETIdentifier of the File Secret in Harness Secret Manager that contains the GCP service account JSON key. Not required when using Workload Identity.""

Tunables that apply to every fault are documented in common tunables for all faults.


Fault execution in brief

Calls the Compute Engine API to stop each VM in VM_INSTANCE_NAMES (in the matching zone from ZONES), waits for TOTAL_CHAOS_DURATION seconds, then starts the VMs again (unless MANAGED_INSTANCE_GROUP=enable).


Expected behavior during fault execution

  • The target VMs transition RUNNINGSTOPPINGTERMINATED and stay there for TOTAL_CHAOS_DURATION.
  • For GKE worker nodes: pods on the node go to NotReady/Unknown, then the scheduler reschedules them onto healthy nodes.
  • For VMs behind a load balancer: health checks on the affected backends start to fail and traffic shifts to healthy backends.
  • When MANAGED_INSTANCE_GROUP=enable: the MIG auto-healer launches a replacement VM with a new instance ID.
  • After the duration ends (and MANAGED_INSTANCE_GROUP=disable), the VMs transition back to RUNNING.
When the fault ends

The chaos pod calls instances.start on every targeted VM unless MANAGED_INSTANCE_GROUP=enable. Boot time depends on the machine image and startup scripts.

Signals to watch

Attach resilience probes to assert each layer:

  • Instance state: Use a command probe running gcloud compute instances describe <vm> --zone=<zone> --format='value(status)' and assert the state changed.
  • Application availability: Use an HTTP probe on the user-visible endpoint behind the load balancer.
  • MIG health: Use a command probe running gcloud compute instance-groups managed describe <mig> to confirm the auto-healer recreated the VM.

Verify the fault execution effect

While the experiment is running, confirm the VM stopped and then restarted:

  1. Inspect VM state with gcloud.

    gcloud compute instances describe <vm-name> \
    --zone=<zone> \
    --format="value(status)"

    The status should be STOPPING/TERMINATED during the chaos window and RUNNING afterwards.

  2. Inspect Cloud Monitoring metrics.

    Use the Cloud Console to inspect compute.googleapis.com/instance/uptime and confirm the gap during the chaos window.

  3. Inspect Compute Engine audit logs.

    gcloud logging read 'resource.type=gce_instance AND protoPayload.methodName=v1.compute.instances.stop' --limit=10

    The stop call from the chaos pod's service account should appear.


Recovery and cleanup

  • End of duration: The chaos pod calls instances.start on every targeted VM (unless MANAGED_INSTANCE_GROUP=enable).
  • Abort the experiment: Stopping the experiment from Chaos Studio also calls instances.start.
  • Manual recovery: If the chaos pod exited before restarting the VM, run gcloud compute instances start <vm-name> --zone=<zone> manually.
  • Workload recovery: Boot time depends on the machine image and startup scripts; GKE node Ready transitions usually complete within 2-3 minutes.

Limitations

  • Same-project targeting: A single experiment targets one GCP_PROJECT_ID. Use multiple experiments for cross-project scope.
  • Zone alignment: ZONES must match VM_INSTANCE_NAMES positionally; mismatches return an Instance not found error.
  • Spot/preemptible behavior: GCP may not start preempted Spot VMs back automatically; combine with MANAGED_INSTANCE_GROUP=enable for MIG-managed VMs.
  • GKE Autopilot: Not supported because GCP manages the underlying nodes.
  • MIG mode skips restart: When MANAGED_INSTANCE_GROUP=enable, recovery is fully driven by the MIG auto-healer; the fault does not call instances.start.

Troubleshooting

GCP VM instance stop fails with PermissionDenied in Harness Chaos Engineering

The service account used by the chaos pod does not have compute.instances.stop and compute.instances.start. Grant roles/compute.instanceAdmin.v1 (or the four permissions listed above) on the target project and re-run.

GCP VM instance stop fails with Instance not found

VM_INSTANCE_NAMES and ZONES must align positionally. Confirm with gcloud compute instances list --filter='name=<vm>' --format='value(zone)' that the zone is correct. Also confirm GCP_PROJECT_ID matches the project that owns the VMs.

VMs stayed STOPPED after the experiment ended

If MANAGED_INSTANCE_GROUP=disable and the chaos pod exited before restart, run gcloud compute instances start <vm-name> --zone=<zone> manually. If MANAGED_INSTANCE_GROUP=enable, check the MIG auto-healer (gcloud compute instance-groups managed describe <mig>) and confirm it created a replacement.