Skip to main content

VMware CPU hog

Last updated on

VMware CPU hog is a VMware chaos fault that drives CPU utilization to CPU_LOAD percent across CPU_CORES cores on the Linux VM VM_NAME (hosted in vCenter) for TOTAL_CHAOS_DURATION seconds, then stops the stress workload. The fault uses VMware Tools (Guest Operations API) to run the stress workload inside the guest as VM_USER_NAME and reverts cleanly at the end.

Use this fault to test how a workload on a VMware-hosted VM behaves when compute headroom shrinks: whether latency stays inside the SLA, whether the OS scheduler keeps critical processes responsive, whether vSphere DRS responds correctly, and whether monitoring detects CPU saturation within the alerting SLA.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • CPU pressure on a vSphere VM: When CPU utilization climbs, does application latency stay inside the SLA?
  • DRS migration: Does vSphere DRS migrate the VM to a less-loaded host when sustained CPU pressure persists?
  • Co-tenant impact: Do other VMs on the same ESXi host degrade because of the CPU steal time?
  • Monitoring fidelity: Do vCenter performance counters and downstream alerts fire inside the alerting SLA?

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
  • vCenter reachable: The chaos infrastructure can reach GOVC_URL over port 443.
  • VMware Tools running on the guest: Verify with vmware-toolbox-cmd -v inside the VM.
  • Stress binary installed inside the guest: Go to VMware Linux binary installation to install the CPU stress prerequisite (stress-ng and pkill).
  • vCenter chaos role: The vCenter user (GOVC_USERNAME) is mapped to the chaos role described in VMware permissions.

Supported environments

PlatformSupport status
Linux VMs hosted on vSphere / vCenter (any distro with VMware Tools)Supported
Linux VMs without VMware ToolsNot supported (the fault drives the guest via Guest Operations)
Windows VMsNot supported (use VMware Windows CPU hog)

Permissions required

Two layers of permissions apply.

On vCenter. Map GOVC_USERNAME to the chaos role described in VMware permissions. For this Basic fault, the role needs at minimum:

  • Virtual machine → Guest operations → Program execution, Modifications, Queries.

On the guest OS. VM_USER_NAME must be able to execute the CPU stress binary and (on abort) run pkill. For non-root accounts, configure sudo for the stress binary if needed.


Authentication

Two credential sets are required.

LayerTunables
vCenter (control plane)GOVC_URL, GOVC_USERNAME, GOVC_PASSWORD, GOVC_INSECURE
Guest OS (target VM)VM_USER_NAME, VM_PASSWORD

Store each credential as a text secret in Harness Secret Manager and reference the secret identifier when configuring the experiment.

Set GOVC_INSECURE=true only if your vCenter certificate is self-signed and not yet trusted.


Fault tunables

Configure the following fault parameters when you add VMware CPU hog to an experiment in Chaos Studio. Defaults are shown for reference.

Required parameters

TunableDescriptionDefault
VM_NAMEName of the target VM as it appears in vCenter.(required)
VM_USER_NAMEOS user account on the target VM.(required)
VM_PASSWORDPassword for VM_USER_NAME.(required)

Stress parameters

TunableDescriptionDefault
CPU_CORESNumber of CPU cores to stress.2
CPU_LOADTarget CPU utilization percentage per stressed core (0-100).100

Chaos parameters

TunableDescriptionDefault
TOTAL_CHAOS_DURATIONTotal duration of the fault in seconds.30
CHAOS_INTERVALDelay in seconds between successive iterations when running for more than one cycle.10
SEQUENCEOrder in which multiple targets are stressed: parallel or serial.parallel
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

vCenter authentication

TunableDescriptionDefault
GOVC_URLvCenter server URL (without scheme), for example vcenter.example.com.""
GOVC_USERNAMEvCenter user mapped to the chaos role.""
GOVC_PASSWORDPassword for GOVC_USERNAME.""
GOVC_INSECURESkip SSL certificate verification when set to true.true

Tunables that apply to every fault are documented in common tunables for all faults.


Fault execution in brief

Authenticates to vCenter (GOVC_URL), opens a Guest Operations session on VM_NAME as VM_USER_NAME, launches a CPU stress workload that targets CPU_CORES at CPU_LOAD percent for TOTAL_CHAOS_DURATION seconds, then terminates the workload.


Expected behavior during fault execution

  • CPU utilization on the target VM climbs to CPU_LOAD percent on CPU_CORES cores for the duration.
  • Application latency may grow in proportion to the load.
  • vCenter performance counters (cpu.usage.average) reflect the spike on the VM and may show steal time impact on co-tenant VMs.
  • After the duration ends, the stress workload exits and CPU utilization returns to baseline.
When the fault ends

The chaos pod stops the stress workload via Guest Operations. CPU utilization returns to baseline within seconds.

Signals to watch

  • VM CPU: Use a Prometheus probe on node_cpu_seconds_total from a node exporter inside the VM.
  • Application latency: Use an HTTP probe on a user-visible endpoint.

Verify the fault execution effect

  1. Inspect vCenter performance counters.

    In vCenter UI, open the VM → Monitor → Performance, switch to CPU view. You should see a spike during the chaos window.

  2. SSH into the VM and run top.

    The stress process should be visible during the chaos window.


Recovery and cleanup

  • End of duration: The chaos pod stops the stress workload through Guest Operations.
  • Abort the experiment: Stopping the experiment from Chaos Studio also stops the workload.
  • Manual recovery: SSH into the VM and sudo pkill -f stress-ng if the workload survived.

Limitations

  • VMware Tools required: Without VMware Tools, vCenter cannot inject the workload.
  • Guest user privileges: If VM_USER_NAME cannot run the stress binary, the fault errors out.
  • Single VM per run: Each fault run targets one VM_NAME. Use multiple experiments to fan out.
  • ESXi co-tenant impact: Aggressive CPU_LOAD on CPU_CORES close to the VM's vCPU count can affect co-tenants; size conservatively in production.

Troubleshooting

VMware CPU hog fails with VMware Tools not running in Harness Chaos Engineering

The Guest Operations API requires VMware Tools to be installed and running on the target VM. Install or restart open-vm-tools / VMware Tools on the guest and retry.

VMware CPU hog fails with authentication failure

Verify GOVC_URL, GOVC_USERNAME, GOVC_PASSWORD against vCenter and VM_USER_NAME / VM_PASSWORD against the guest. For self-signed vCenter certificates, set GOVC_INSECURE=true.

No CPU spike visible after starting the fault

Confirm the stress binary is installed inside the guest. SSH into the VM and run which stress-ng. Reinstall it per the VMware Linux binary installation page if missing.