Monitoring the Platform
This page applies to on-premise installations of the Steadybit platform. If you are using our SaaS at platform.steadybit.com, monitoring is fully managed by Steadybit.
When you operate the Steadybit platform yourself, observing a small set of metrics is enough to spot the vast majority of issues before they impact your users. This page describes:
the metrics endpoint exposed by the platform,
the four Golden Signals to monitor (latency, traffic, errors, saturation),
a downloadable Grafana dashboard — the same one we use internally,
a ready-to-use set of recommended Prometheus alert rules with the thresholds we run in production.
Metrics Endpoint
The platform exposes Prometheus-compatible metrics on the management port 9090:
http://<platform-host>:9090/actuator/prometheusPort 9090 is administrative and is not exposed to end users. Scrape it from your Prometheus instance the same way you would any Spring Boot Actuator endpoint. All metric names referenced on this page come from this endpoint.
The Four Golden Signals
1. Latency
How long requests take to complete. Two views matter:
HTTP request latency
http_server_requests_seconds_sum / http_server_requests_seconds_count
End-user perceived UI/API responsiveness.
Message queue lead time
queue_lead_time
How far behind the platform is in processing target updates from agents. Spikes here are the earliest symptom of an overloaded platform.
Recommended thresholds:
Mean GET response time (2 min window)
> 200 ms
> 500 ms
Mean POST response time (2 min window)
> 400 ms
> 600 ms
Mean POST response time, UI endpoints (/ui/experiments/validate, /ui/targets/count)
> 5 s
> 10 s
Message queue accumulated lead time
> 10 min
> 30 min
2. Traffic
Request volume hitting the platform. Sudden spikes often correlate with too many agents reconnecting or a misbehaving integration.
http_server_requests_seconds_count per second
> 5 req/s
> 10 req/s
These thresholds are conservative; tune them to match your installed scale (number of agents, tenants, targets).
3. Errors
The percentage of requests returning HTTP 5xx.
http_server_requests_seconds_count{status=~"5.."} / total request rate
> 5 % over 5 minutes
A sustained error rate above 5 % almost always points to a database issue, a broken upstream integration, or a recently failed deployment.
4. Saturation
How "full" the platform is. The two limits to watch:
JVM heap memory
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}
> 80 % for 5 min
—
Database connection pool
hikaricp_connections_active / hikaricp_connections_max
> 60 % for 5 min
> 80 % for 5 min
A connection pool above 80 % is a strong indicator of either long-running transactions on the database or an undersized pool. See the Database Runbooks page for cleanup procedures.
Grafana Dashboard
We publish the same Grafana dashboard we use to operate Steadybit SaaS. It is grouped into four sections — Message Queues, Target Ingestion, Platform Chaos Engineering Activity, and Platform Resource Consumption — each backed by the metrics described above.
Download: steadybit-platform-dashboard.json
To install, in Grafana go to Dashboards → New → Import, then either upload the file or paste its contents. Select your Prometheus data source when prompted.
Recommended Prometheus Alert Rules
The following PrometheusRule is the exact configuration we run in production. Adjust the thresholds to your scale, but keep the structure — each rule maps to a Golden Signal above.
When an Alert Fires
Queue lead time
Database CPU and HikariCP saturation. The post-processing pipeline writes to Postgres on every step.
GET / POST latency
Database, then JVM heap.
5xx error rate
Platform logs (Platform Log Events panel) for stack traces.
JVM heap > 80 %
Capture a heap dump as described in Troubleshooting › On-prem platform.
HikariCP pool > 80 %
Long-running or blocked transactions. Follow the Database Runbooks.
Blocked outgoing requests
A user webhook or hub points to a host that is blocked by your egress policy.
Related Pages
Database Runbooks — recover from blocking transactions or disk pressure.
Configuration Options — JVM and database tuning parameters.
Maintenance & Incident Support — communicate planned maintenance and incidents to your users.
Troubleshooting › On-prem platform — common fixes for installation issues.
Last updated
Was this helpful?
