LogoLogo
Reliability HubAPI DocsPlatform
  • Welcome to Steadybit
  • Quick Start
    • First Steps
    • Compatibility
    • Install Agent and Extensions
    • Run an Experiment
    • Deploy Example Application
  • Concepts
    • Actions
    • Discovery
    • Query Language
  • Install and Configure
    • Install Agent
      • Architecture
      • Install on Kubernetes
      • Install on Linux Hosts
      • Install using Docker Compose
      • Install on Amazon ECS
      • Extension Registration
      • Using Mutual TLS for Extensions
      • Configuration Options
      • Agent State
      • Agent API
    • Install On-Prem Platform
      • Install on Minikube
      • Advanced Agent Authentication
      • Configuration Options
      • Maintenance & Incident Support
      • Syncing Teams via OIDC Attribute
    • Manage Environments
    • Manage Teams and Users
      • Users
      • Teams
      • Permissions
    • Manage Experiment Templates
  • Use Steadybit
    • Experiments
      • Design
      • Run
      • Run History
      • Schedule
      • Variables
      • Emergency Stop
      • Share
        • Templates
        • Duplicate
        • File
      • OpenTelemetry Integration
    • Explorer
      • Landscape
      • Targets
      • Advice
    • Reporting
  • Integrate with Steadybit
    • Extensions
      • Anatomy of an Extension
      • Extension Installation
      • Extension Kits
      • Available Extensions
    • API
      • Interactive API Documentation
    • CLI
    • Badges
    • Webhooks
      • Custom Webhooks
      • Preflight Webhooks
    • Preflight Actions
    • Slack Notifications
    • Audit Log
    • Hubs
  • Troubleshooting
    • How to troubleshoot
    • Common fixes
      • Extensions
      • Agents
      • On-prem platform
Powered by GitBook

Extension Docs

  • ActionKit
  • DiscoveryKit
  • EventKit

More Resources

  • Reliability Hub
  • API Docs
On this page
  • Learning about Platform Activity
  • Admin API Endpoints
  • Maintenance
  • Incidents

Was this helpful?

Edit on GitHub
  1. Install and Configure
  2. Install On-Prem Platform

Maintenance & Incident Support

Last updated 9 months ago

Was this helpful?

This part of the documentation is only intended in the context of a supported PoC (Proof of Concept) together with the Steadybit team. Please, to scope your PoC before continuing to evaluate the on-prem solution.

If you just want to try out Steadybit, we recommend you .

During the operation of the on-premise Steadybit platform, you may run into situations that require users to be informed about incidents and planned maintenance that will affect on-premise Steadybit users. To inform these users, you can tell the Steadybit platform about incidents and maintenance windows. Once done, the platform can

  • show banners in the user interface indicating the specific situation,

  • require explicit confirmation when executing experiments via the user interface and

  • optionally disable experiment runs entirely during an ongoing incident.

This feature is available in versions > 1.0.2 of the Steadybit platform.

Learning about Platform Activity

You can use the following metrics to learn whether maintenance on the Steadybit platform would be affecting any users:

  • platform.experiments.executing is exposed by the Steadybit platform as a Prometheus metric. It represents the number of currently executing experiments.

Admin API Endpoints

The maintenance and incident API endpoints are only reachable through the admin port 9090 on the platform Kubernetes deployment workload. This port is not exposed to users by default, and you should only use it for administrative purposes. API endpoints on port 9090 do not require authentication.

One option to reach this port from your local machine is a port forward, as the following snippet shows:

kubectl port-forward -n steadybit-platform deployment/steadybit-platform 9090

Maintenance

To configure a maintenance window, you can use the API endpoints under path /actuator/systemstatusmaintenance. You can retrieve the maintenance configuration via HTTP GET or clear it via HTTP DELETE. The following example shows how you could leverage HTTP POST to configure a maintenance window.

curl -X POST \
  -H "Content-Type: application/json" \
  http://localhost:9090/actuator/systemstatusmaintenance -d '
{
  "title": "Planned maintenance",
  "message": "We need to perform maintenance on the platform",
  "statusPage": "https://status.example.com",
  "start": "2024-06-07T00:00:00Z",
  "end": "2024-06-08T00:00:00Z"
}
'

Incidents

You can use incidents to inform users of service disruption. In contrast to scheduled maintenance, an incident is never scheduled and has varying severity. To configure an incident, you can use the API endpoints under path /actuator/systemstatusincident. You can retrieve the incident configuration via HTTP GET or clear it via HTTP DELETE. The following example shows how you could leverage HTTP POST to configure an incident.

curl -X POST \
  -H "Content-Type: application/json" \
  http://localhost:9090/actuator/systemstatusincident -d '
{
  "title": "Platform Overloaded",
  "message": "We are experiencing high load on the platform",
  "statusPage": "https://status.example.com",
  "allowExperimentExecution": true,
  "severity": "DEGRADED_PERFORMANCE"
}
'

The following severities are supported by the API. The user interface will adapt its banner style according to the chosen incident severity.

  • UNDER_MAINTENANCE

  • DEGRADED_PERFORMANCE

  • PARTIAL_OUTAGE

  • MAJOR_OUTAGE

book an appointment
sign up for our SaaS platform
System banner appearing at the top of the Steadybit UI presenting the information provided through the maintenance API
System banner appearing at the top of the Steadybit UI presenting the information provided through the incident API