Skip to main content

VMware process kill

Last updated on

VMware process kill is a VMware chaos fault that terminates the processes listed in PROCESS_IDS (PIDs) on the Linux VM VM_NAME for TOTAL_CHAOS_DURATION seconds, then waits VERIFICATION_WINDOW seconds to confirm the outcome. Set FORCE=true to send SIGKILL; otherwise the fault sends SIGTERM. The fault uses VMware Tools (Guest Operations API) to act inside the guest as VM_USER_NAME.

Use this fault to test how a workload running on a VMware-hosted VM behaves when a critical process is killed: whether the supervisor (systemd, supervisord, runit) restarts it inside the SLA, whether replicas absorb the load, whether monitoring detects the regression within the alerting SLA, and whether on-call alerts fire correctly.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

  • Crash resilience: When a critical PID dies, does the supervisor restart it inside the SLA?
  • Replica absorption: When one replica's process dies, do peers absorb the traffic inside the SLO budget?
  • Alert fidelity: Do downstream alerts fire inside the alerting SLA?

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
  • vCenter reachable: The chaos infrastructure can reach GOVC_URL over port 443.
  • VMware Tools running on the guest: Verify with vmware-toolbox-cmd -v.
  • Process IDs: You know the PID(s) to kill, or your workload includes a wrapper that reports the PID(s) of supervised processes.
  • Sudo permissions: VM_USER_NAME can kill the target PID(s) (process owner or root via sudo).
  • vCenter chaos role: GOVC_USERNAME is mapped to the chaos role per VMware permissions.

Supported environments

PlatformSupport status
Linux VMs hosted on vSphere / vCenter (any distro with VMware Tools)Supported
Linux VMs without VMware ToolsNot supported
Windows VMsNot supported (use Windows process kill)

Permissions required

On vCenter. Map GOVC_USERNAME to the chaos role described in VMware permissions. The role needs Guest Operations (Program execution, Modifications, Queries).

On the guest OS. VM_USER_NAME must own the target processes or have sudo to kill them.


Authentication

LayerTunables
vCenterGOVC_URL, GOVC_USERNAME, GOVC_PASSWORD, GOVC_INSECURE
Guest OSVM_USER_NAME, VM_PASSWORD

Store each credential as a text secret in Harness Secret Manager and reference the secret identifier when configuring the experiment.


Fault tunables

Required parameters

TunableDescriptionDefault
VM_NAMEName of the target VM as it appears in vCenter.(required)
VM_USER_NAMEOS user account on the target VM.(required)
VM_PASSWORDPassword for VM_USER_NAME.(required)
PROCESS_IDSComma-separated list of PIDs to kill on the target VM.(required)

Chaos parameters

TunableDescriptionDefault
FORCEIf true, send SIGKILL instead of SIGTERM.false
TOTAL_CHAOS_DURATIONTotal duration of the fault in seconds.30
VERIFICATION_WINDOWTime window in seconds after the kill during which the fault verifies the outcome.10
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

vCenter authentication

TunableDescriptionDefault
GOVC_URLvCenter server URL.""
GOVC_USERNAMEvCenter user mapped to the chaos role.""
GOVC_PASSWORDPassword for GOVC_USERNAME.""
GOVC_INSECURESkip SSL certificate verification when set to true.true

Tunables that apply to every fault are documented in common tunables for all faults.


Fault execution in brief

Authenticates to vCenter, opens a Guest Operations session on VM_NAME as VM_USER_NAME, sends SIGTERM (or SIGKILL when FORCE=true) to each PID in PROCESS_IDS, waits VERIFICATION_WINDOW seconds, and reports success once every targeted PID is gone.


Expected behavior during fault execution

  • Each PID in PROCESS_IDS receives the kill signal.
  • A supervisor (systemd, supervisord, runit) typically respawns the process inside its own restart policy.
  • Application metrics may dip while the process restarts; replicas may absorb traffic if the workload is clustered.
  • After the duration ends, the fault exits without further action; supervised processes are expected to be running normally.
When the fault ends

The fault does not restart processes. Recovery depends on the guest's process supervisor or the user's manual intervention.

Signals to watch

  • Process up: Use a command probe running pgrep -x <name> and assert the process is back inside the SLA.
  • Workload health: Use an HTTP probe on a user-visible endpoint.

Verify the fault execution effect

  1. SSH into the VM during the chaos window.

    ps -p <PID>

    The PID should briefly disappear and be replaced by a new PID for the same command when the supervisor restarts it.

  2. Inspect the supervisor log.

    journalctl -u <unit> -n 50

Recovery and cleanup

  • Supervised processes: The supervisor restarts them automatically.
  • Unsupervised processes: Restart them manually (sudo systemctl start <unit> or your own runner).
  • Abort: Stopping the experiment from Chaos Studio also stops further iterations of the fault.

Limitations

  • PID-based: The fault targets exact PIDs, not process names. If the workload's PID changes between iterations, you must look up the new PID.
  • No auto-restart: The fault does not restart killed processes; supervision is the user's responsibility.
  • VMware Tools required: Without VMware Tools, the fault cannot run.
  • Single VM per run: Each fault run targets one VM_NAME.

Troubleshooting

VMware process kill fails with no such process in Harness Chaos Engineering

The PIDs in PROCESS_IDS may have changed since you looked them up. SSH into the VM, look up the current PIDs (pgrep <name>), update PROCESS_IDS, and retry.

Killed process did not restart

The fault only kills the process; restart is up to the guest's supervisor. Check journalctl -u <unit> and ensure the unit has Restart=on-failure (or similar) in its systemd unit file.


  • VMware service stop: Stop a service (which the supervisor manages) instead of killing a PID directly.