Agents
We get certificate errors when the agent connects to our on-premise Steadybit installation. What do we need to do?
The most common reasons for connectivity issues are related to:
Self-signed certificates that are not trusted.
Root certificate authorities that are not trusted.
Incomplete certificate chains.
At runtime, this will typically manifest in the agent through log entries like this:
To analyze the situation from your end, we recommend starting by ensuring that TLS is correctly configured. You can do so via the openssl
command-line tool.
This should print a whole lot of information. Most relevant are the Verify return code:
sections. These sections should always report Verify return code: 0 (ok)
. For other return codes, please check out the following sub-sections:
Verify return code: 19 (self-signed certificate in certificate chain)
This happens for self-signed certificates. You can typically resolve these problems through the STEADYBIT_AGENT_EXTRA_CERTS_PATH
environment variable. This environment variable should point to a directory containing your root certificates. The agent will load all certificates in this directory into the Java key store upon agent startup. This, in turn, makes the agent trust your custom certificates.
You can also check the environment variable's effect by adding the -CAfile root-certificate.pem
command-line argument to the openssl
command shown in the previous section.
Verify return code: 21 (unable to verify the first certificate)
The agent requires a complete certificate chain configuration for security reasons and this return code indicates that your server responded with an incomplete certificate chain. You can fix this issue by modifying the server that terminates the TLS connection. Please refer to your server/proxy/CDN documentation to learn how to configure a complete certificate chain.
Installation agent on AWS EKS cluster
First: make sure to configure the Amazon-EBS-CSI-Driver
Afterwards add the Amazon-EBS-CSI-Driver addon on your EKS cluster, with newly created IAM role
Then add your first node group to the cluster.
Occasional connection timeouts on the agent -> extension discovery calls, which cause remove targets from discovery
We are using resilience4j for the retry mechanism. The default configuration is to retry 3 times with a wait duration of 30s with an exponential backoff multiplier of 2. This means that the first retry will be after 30s, the second after 60s, and the third after 120s. If all retries fail, the agent will remove the target from the discovery. You can configure the retry mechanism by setting the following environment variables:
Environment Variable | Description |
---|---|
| Optional - Resilience4j: The maximum number of attempts (including the initial call as the first attempt) for DiscoveryKit resources |
| Optional - Resilience4j: A fixed wait duration between retry attempts for DiscoveryKit resources |
| Optional - Resilience4j: Enable or disable exponential backoff for DiscoveryKit resources |
| Optional - Resilience4j: The multiplier for exponential backoff for DiscoveryKit resources |
| Optional - Resilience4j: Enable/Disable the retry mechanism. Default is true / enabled |
Agent takes a long time register the extensions via auto-discovery and to submit the first targets
In a very large cluster it might take a while to read all pods in your cluster and scan them for extensions. You can limit the extension autodiscovery to a single namespace using the environment variable STEADYBIT_AGENT_EXTENSIONS_AUTODISCOVERY_NAMESPACE
(helm-value agent.extensions.autodiscovery.namespace
).
Install Agent and extension-kubernetes in a managed Kubernetes cluster where you are only allowed to deploy to one namespace
Install the agent/extension with the following helm settings to use roles instead of clusterroles:
Full example:
Last updated