Containerising the DataDog Agent for HTTP health-checks using DigitalOcean App Platform and Terraform
We have been big fans of DataDog and the level of telemetry/monitoring it provides us for many years now. One such aspect of monitoring that we employ throughout the services we maintain is HTTP health checks, which are intentionally run on a separate cloud provider to our primary, which is AWS. DataDog has provided the ability to handle running these checks via their agent for many years, offering us a sufficient black-box means of ensuring a service is functioning as expected. This past week, we explored the viability of containerising this responsibility into a service that could be run on a serverless platform such as the DigitalOcean App Platform.
How It Was…
In the past, we have been running the DataDog Agent, which powers the HTTP check integration, on a separate cloud provider’s virtual private server (VPS). This was provisioned using Puppet, and we were responsible for handling all server concerns (patching, upgrades, etc.). This worked well at the time as the instance was also responsible for several other infrastructural tasks. However, these responsibilities have slowly been delegated to other means, and as such, the idea of hosting a full VPS just for handling HTTP checks was not ideal.
Over the past year or so, we have explored using DataDog Synthetic Monitoring, which provides a means of not only evaluating HTTP checks but also includes in-depth performance analysis and browser recording capabilities. These additional features, combined with Terraform support, are extremely powerful, and we have had good success porting several key endpoint request checks over to this product. Unfortunately, this additional functionality comes with an increased price tag, and we do not see an increase in monitoring confidence by moving all our current checks over to this product. As such, we decided that it would be ideal to keep the same HTTP health checks in place but manage this service at a higher level of compute (Serverless 😎).
Containerising the DataDog Agent
With this in mind, we looked into the viability of containerising the DataDog Agent itself.
Fortunately, there are many use cases of carrying out such a task using Docker and container orchestration platforms such as Kubernetes and ECS.
Our end goal, however, was a little different from the norm, as we wished to only ship and run the agent as a standalone service, not provide a means for other ancillary containers to relay data back to DataDog.
With a little help from the DataDog documentation, we were able to assemble a Dockerfile
that enabled only the desired built-in HTTP check integration like so.
FROM datadog/agent:7
ENV DD_HOSTNAME "httpcheck"
ENV DD_USE_DOGSTATSD "false"
ENV DD_PROCESS_AGENT_ENABLED "false"
ENV DD_APM_ENABLED "false"
ENV DD_LOGS_ENABLED "false"
ENV DD_ENABLE_METADATA_COLLECTION "false"
RUN rm -rf /etc/datadog-agent/conf.d
ADD http_check.yaml /etc/datadog-agent/conf.d/http_check.d/conf.yaml
An example definition of what can be found in the http_check.yaml
file is presented below.
init_config:
instances:
- name: Homepage
url: https://www.mybuilder.com
http_response_status_code: 200
content_match: 'Hire an exceptional tradesperson'
tls_verify: true
Upon experimentation with the built image, we noticed that by default, the hostname is discerned from the running container instance ID.
To keep a stable instance hostname within DataDog, we opted to override the DD_HOSTNAME
environment variable.
Deploying the Service
Now that we had built the service into a self-managed container, we investigated the platforms available to deploy and run it. There are many platforms available that offer such means, from AWS Fargate to Google Cloud Run. Upon review, we opted for DigitalOcean App Platform, which provided a cost-effective means ($5 a month!) of spinning up and running a container.
Not only is App Platform cost-efficient, but it also provided egress network access (a requirement for such a service as ours) and the ability to seamlessly integrate the deployment process into a GitHub repository pipeline. This allows us to simply push changes to the service repository within GitHub and let DigitalOcean handle building and deploying the new image.
We are big proponents of Infrastructure as Code (IaC) at MyBuilder, and another selling point to us was the ability to define the App Platform service within Terraform. Coupled with DataDog, we were able to combine the two third-party providers into our desired configuration like so.
resource "digitalocean_app" "datadog_http_check" {
spec {
name = "datadog-http-check"
region = "lon"
worker {
name = "datadog-http-check"
source_dir = "/"
dockerfile_path = "Dockerfile"
instance_count = 1
instance_size_slug = "basic-xxs"
github {
repo = "mybuilder/datadog-http-check"
branch = "main"
deploy_on_push = true
}
env {
key = "DD_API_KEY"
value = datadog_api_key.datadog_http_check.key
scope = "RUN_TIME"
type = "SECRET"
}
}
}
}
resource "datadog_api_key" "datadog_http_check" {
name = "datadog-http-check"
}
Finally, we were able to provision a DataDog monitor that handled alerting us to any unexpected changes found by the newly created HTTP check service.
resource "datadog_monitor" "datadog_http_check" {
type = "service check"
name = "HTTP checks"
query = "\"http.can_connect\".over(\"*\").by(\"url\").last(4).count_by_status()"
message = <<-EOM
{{#is_alert}}
We are having issues connecting to {{url.name}}. {{check_message}}
{{/is_alert}}
{{#is_recovery}}
We are now able to connect to {{url.name}}. {{check_message}}
{{/is_recovery}}
EOM
monitor_thresholds {
critical = 3
warning = 1
ok = 1
}
}
One Small Gotcha…
One additional change we were required to make to the built image for use with DigitalOcean App Platform was the inclusion of an overridden /proc/cpuinfo
file.
Upon investigation, it seems the /proc/cpuinfo
output provided by the DigitalOcean host machine includes an unknown
value for the CPU stepping
attribute.
This is expected to be an integer within the Go package used by the DataDog Agent and, as such, causes the Agent to critically restart upon initialisation.
Fortunately, the package provides a means of supplying an overridden host /proc
directory in which we were able to stub out this stepping attribute to ensure the agent initialises correctly.
Conclusion
To conclude, we are very happy with the ease in which we have been able to abstract this service to a higher level of compute, removing the concerns of having to manage the underlying host machine in the process. Now only maintaining ownership of a Docker image means we can trivially move this service to another cloud provider platform in the future if we so wish.