LogoLogo
Reliability HubAPI DocsPlatform
  • Welcome to Steadybit
  • Quick Start
    • First Steps
    • Compatibility
    • Install Agent and Extensions
    • Run an Experiment
    • Deploy Example Application
  • Concepts
    • Actions
    • Discovery
    • Query Language
  • Install and Configure
    • Install Agent
      • Architecture
      • Install on Kubernetes
      • Install on Linux Hosts
      • Install using Docker Compose
      • Install on Amazon ECS
      • Extension Registration
      • Using Mutual TLS for Extensions
      • Configuration Options
      • Agent State
      • Agent API
    • Install On-Prem Platform
      • Install on Minikube
      • Advanced Agent Authentication
      • Configuration Options
      • Maintenance & Incident Support
      • Syncing Teams via OIDC Attribute
    • Manage Environments
    • Manage Teams and Users
      • Users
      • Teams
      • Permissions
    • Manage Experiment Templates
  • Use Steadybit
    • Experiments
      • Design
      • Run
      • Run History
      • Schedule
      • Variables
      • Emergency Stop
      • Share
        • Templates
        • Duplicate
        • File
      • OpenTelemetry Integration
    • Explorer
      • Landscape
      • Targets
      • Advice
    • Reporting
  • Integrate with Steadybit
    • Extensions
      • Anatomy of an Extension
      • Extension Installation
      • Extension Kits
      • Available Extensions
    • API
      • Interactive API Documentation
    • CLI
    • Badges
    • Webhooks
      • Custom Webhooks
      • Preflight Webhooks
    • Preflight Actions
    • Slack Notifications
    • Audit Log
    • Hubs
  • Troubleshooting
    • How to troubleshoot
    • Common fixes
      • Extensions
      • Agents
      • On-prem platform
Powered by GitBook

Extension Docs

  • ActionKit
  • DiscoveryKit
  • EventKit

More Resources

  • Reliability Hub
  • API Docs
On this page

Was this helpful?

Edit on GitHub
  1. Use Steadybit
  2. Experiments

Run

Last updated 13 days ago

Was this helpful?

After having your you can simply use the Run-button to execute it. This action can be performed if all the following conditions are met:

  1. No validation errors

  2. Every attack resolves at that moment to at least one target.

  3. You are member of the same team as the experiment

  4. has not been triggered.

Otherwise, you'll get an error message and the experiment is not started.

As soon as the experiment starts, the platform automatically switches over to the run view. The first step of the platform is to establish the connection to the matching agents. In addition, the running experiment is indicated at the top right run icon.

The run view consists of the following elements.

  • Run Log: The run log lists more details to the experiment attacks and actions. For instance, you can see which exact containers are affected by the attack or what is the exact reason for a failed experiment

  • Deployment Replica Count: When using an experiment in a Kubernetes context we will automatically monitor how many PODs are ready in your cluster and indicate whenever there is a discrepancy.

  • Kubernetes Event Log: When using an experiment in a Kubernetes context we provide you access to the Kubernetes Events so that you can identify what exactly happens in the Kubernetes cluster.

  • HTTP Call: If your experiment contains a HTTP Call-action you can see the response time as well as HTTP response status as a separate widget in the run window.

Every experiment run has a unique identifier (e.g. #33131), which you can use to identify older experiment runs (visible on the left side).

Experiment runs can have the following states:

State
Description

REQUESTED

The experiment was requested by a user, api call or a schedule.

CREATED

The experiment was created and all targets were resolved.

PREPARED

The experiment was prepared and all agents are ready to execute the needed actions.

RUNNING

The experiment is currently running.

COMPLETED

Entire experiment (all attacks, actions and checks) were successfully executed - so no failure reported by any check.

CANCELED

The experiment was canceled by user interaction and all attacks were rolled back.

FAILED

The run failed due to some failing checks, for example a HTTP Check not reaching the required success rate.

ERRORED

The run failed due to some technical reasons like Failed attack execution or Agent disconnected unexpectedly. This shouldn't happen frequently, in case it does, let us know. We are constantly improving the platform to reduce failure states.

In case an agent looses the connection to the platform during an experiment, it will immediately stop and rollback running attacks. There are some attacks (like Stop Container) which can't be rolled back due to it's nature.

Experiment Player: At the top you see the sequence defined previously in the . While the experiment is running a special marker indicates the current point of time. Some attacks need a little bit of extra time before being started which is indicated by a light green colouring in the front. The extra time is added to the timing of the attack and is currently caused by technical reasons.

Monitoring Events: In case your admin has installed a monitoring extension to Steadybit (see ) you can see occuring events and alerts of your setup directly in the run view.

monitoring extensions in Reliability Hub
experiment fully designed
design
Emergency stop
Experiment Run View