For the complete documentation index, see llms.txt. This page is also available as Markdown.

Database Runbooks

This page applies to on-premise installations of the Steadybit platform. It assumes you are running a PostgreSQL 15+ database that the platform connects to. See Database Configuration for connection settings.

The Steadybit platform stores its state in PostgreSQL. This page covers two operational scenarios:

In both cases, monitoring should give you advance warning — see Monitoring the Platform for the alerts we recommend.

Blocking Transactions

Symptom: the SteadybitHikariPoolCritical alert fires, target updates from agents are not being processed, or the UI feels stuck on operations that read or write the target table.

Cause: PostgreSQL transactions that started but were never committed or rolled back continue to hold row-level locks. New connections from the platform queue up waiting for those locks and the HikariCP connection pool fills up.

Step 1 — List transactions that hold locks

SELECT a.datname,
       l.relation::regclass,
       l.transactionid,
       l.mode,
       l.granted,
       a.usename,
       a.query,
       a.query_start,
       age(now(), a.query_start) AS age,
       a.pid
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
ORDER BY a.query_start;

Step 2 — Narrow down to transactions older than 1 hour

Most legitimate platform queries finish in under a second. Anything older than an hour is almost certainly stuck.

Step 3 — Terminate the offending backend

Replace <pid> with the process ID returned by the previous query:

pg_terminate_backend rolls back the transaction and closes the connection. The platform reconnects automatically; no restart is required.

Step 4 — Verify recovery

  • Re-run the query from Step 1 — the count of held locks should drop.

  • Check the Datasource connections panel; hikaricp_connections_active should fall back to its baseline.

  • Confirm that the message queue lead time (queue_lead_time) starts decreasing.

Running Out of Disk Space

Symptom: Postgres logs report could not extend file or No space left on device, or your storage monitoring is approaching the volume capacity.

Cause: historical execution data, target snapshots, or audit log entries have accumulated beyond what the configured retention removes. Bloat in heavily updated tables (notably target) can also consume far more space than the live row count would suggest.

Step 1 — Identify the largest tables

The most common offenders are:

Table
Why it grows

target

Snapshot of every target seen by every agent. Heavy update churn → bloat.

target_stats, target_submission_tracking

Per-submission counters that the platform vacuums periodically.

audit_log

One row per administrative action.

experiment_execution, execution_log_event, execution_metric_event

Grow with the number of experiment runs.

Step 2 — Reclaim space with VACUUM

The platform runs scheduled VACUUM/ANALYZE (see STEADYBIT_DB_MAINTENANCE_* in Configuration Options). When you need to reclaim space immediately, run a full vacuum on a low-traffic windowVACUUM FULL takes an exclusive lock on the affected table:

For a non-blocking alternative on PostgreSQL, install and use pg_repack.

Step 3 — Delete obsolete data (if needed)

Only delete data after confirming that the platform's built-in retention is not enough for your situation, and after taking a backup. The data the platform tolerates losing the most are old target snapshots:

After a large delete, run VACUUM FULL on the affected table again to release the space.

Step 4 — Increase the volume size

If after vacuuming, the database is still close to full, the safest fix is to increase the underlying volume. On AWS RDS this is a non-disruptive operation; on a self-hosted Postgres, follow your storage provider's resize procedure.

Plan ahead by reviewing your Machine Requirements — at ~100 k targets we recommend at least 20 GB of database storage.

Preventive Maintenance

The platform performs automatic maintenance on a configurable schedule. The defaults are sensible for most installations; review them if you operate at scale or have observed bloat.

Variable
Default
Description

STEADYBIT_DB_MAINTENANCE_ENABLED

true

Enable automatic VACUUM/ANALYZE.

STEADYBIT_DB_MAINTENANCE_CRON

0 0 0 ? * SAT *

Saturday at midnight.

STEADYBIT_DB_MAINTENANCE_TABLES

Tables included in the maintenance window.

Combined with the alerts described in Monitoring the Platform, this is usually enough to keep the database healthy without manual intervention.

Last updated

Was this helpful?