Collection of terraform modules to deploy the prometheus ecosystem to Cloud foundry.
The prometheus_all module is a good starting point as it includes all the other modules. Check the variables in prometheus_all for a description of all configuration options.
github.com/DFE-Digital/cf-monitoring
SpaceAuditor
on each monitored space as well as BillingManager
on the whole organisation.Wrapper module abstracting all the other modules. It should be sufficient for most use cases but underlying modules can also be used directly.
The prometheus_all
module creates two instances of the Prometheus application:
module prometheus_all {
source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all"
monitoring_instance_name = "teaching-vacancies"
monitoring_org_name = "dfe"
monitoring_space_name = "teaching-vacancies-monitoring"
paas_exporter_username = var.paas_exporter_username
paas_exporter_password = var.paas_exporter_password
grafana_admin_password = var.grafana_admin_password
}
The git reference can be changed. For example for the dev
branch:
source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all?ref=dev"
The default retention policy in influxdb is 30 days. After which all the metrics are deleted. It is possible to keep some metrics for 12 months using influxdb downsampling and enable yearly prometheus.
1.8
from https://portal.influxdata.com/downloads/cf install-plugin conduit
cf conduit <influxdb instance> -- influx
Create the one_year
retention policy
CREATE RETENTION POLICY one_year on defaultdb DURATION 52w REPLICATION 1
Create the continuous query to aggregate data automatically. For the billing data enter:
CREATE CONTINUOUS QUERY cost_1y ON defaultdb BEGIN SELECT max(value) AS value INTO defaultdb.one_year.cost FROM defaultdb.default_retention_policy.cost GROUP BY time(1d),* END
Prometheus-yearly is an extra prometheus instance reading data from the one_year retention policy in influxdb. It is disabled by default. To enable it set enable_prometheus_yearly
to true:
module prometheus_all {
source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all"
...
enable_prometheus_yearly = true
}
It is possible to include modules selectively to help onboarding to prometheus_all step-by-step. See the list of modules in enabled_modules.
module prometheus_all {
source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all"
enabled_modules = ["prometheus", "influxdb"]
monitoring_instance_name = "teaching-vacancies"
monitoring_org_name = "dfe"
monitoring_space_name = "teaching-vacancies-monitoring"
}
By default authentication is only via username/password for the admin account. Autentication via Google single-sign-on can be configured. It provides readonly access to users by default. Additional permissions are not persisted.
It provides several datasources:
A number of Grafana dashboards are included and are usable out-of-the-box to monitor your apps and services. By default it shows all your resources, then you can filter them via drop-down menus.
You can add your own dashboards via the grafana_json_dashboards
parameter.
See Grafana README
Basic metrics are available in the CF databases
dashboard. The PostgreSQL advanced
dashboard provides more advanced metrics via the postgres_prometheus_exporter
module.
See postgres_prometheus_exporter README
Generic Postgres alerting can be enabled for selected databases.
This will add alerts that will trigger as specified below
PreReqs.
Set the following variables in tf or env.tfvars.json file as per your configuration to enable generic alerting.
postgres_dashboard_url (string): the grafana url for the cf-databases dashboard
alertable_postgres_services (map): a map of the postgres instances to have alerting enabled, and optional alert thresholds. If any thresholds are not listed they will default as below
e.g. (for json format)
"postgres_dashboard_url": "https://grafana-service.london.cloudapps.digital/d/azzzBNMz"
"alertable_postgres_services": {
"bat-qa/apply-postgres-qa": {
"max_cpu": 65,
"min_mem": 0.5,
"min_stg": 2
},
"bat-qa/register-postgres-qa": {
},
"bat-qa/teacher-training-api-postgres-qa": {
"min_mem": 0.5
}
}
Generic Application alerting can be enabled for selected apps.
Set the following variables in tf or env.tfvars.json file as per your configuration to enable generic alerting.
apps_dashboard_url (string): the grafana url for the cf-apps dashboard
alertable_apps (map): a map of the app instances to have alerting enabled, and optional alert thresholds. If any thresholds are not listed they will default as below
PreReqs.
e.g. (for json format)
"apps_dashboard_url": "https://grafana-service.london.cloudapps.digital/d/azzzBNMz"
"alertable_apps": {
"tra-dev/find-a-lost-trn-dev": {
},
"tra-dev/qualified-teachers-api-dev": {
"response_threshold": 5
}
}
If your application uses Redis you may want to include a Redis metrics exporter for each instance of Redis you use. This is accomplished by passing in an array of strings. Each string takes the form
of "space/service"
, for example:
redis_services = [ "get_into_teaching/redis_service_one" , "get_into_teaching/redis_service_two" , ... ]
List of external endpoints which can be queried via /metrics
. Can be used for apps deployed to Cloud foundry or any external services.
They must be accessible via https.
Pass a list of applications deployed to Cloud Foundry and prometheus will find each individual instance and scrape metrics from them. The format is:
["<app1_name>.<internal_domain>[:port]", "<app2_name>.<internal_domain>[:port]"]
If the port is not specified, the default Cloud Foundry port will be used (8080).
Internal routing must be configured so that prometheus can access them.
prometheus_all
outputs both prometheus app name and id to help create the network policy.
To allow useful aggregation and optimise time series storage, the applications should decorate the metrics with a label called app_instance
representing the id of the Cloud Foundry app instance. It can be obtained at runtime from the CF_INSTANCE_INDEX
environment variable.
For ruby applications, the yabeda is a powerful framework to expose custom metrics and provides a lot of metrics out of the box such as yabeda-rails and yabeda-sidekiq.
It is recommended to decorate the yabeda metrics as such:
if ENV.key?('VCAP_APPLICATION')
vcap_config = JSON.parse(ENV['VCAP_APPLICATION'])
Yabeda.configure do
default_tag :app, vcap_config['name']
default_tag :app_instance, ENV['CF_INSTANCE_INDEX']
default_tag :organisation, vcap_config['organization_name']
default_tag :space, vcap_config['space_name']
end
end
A default configuration is provided but it doesn’t send any notification. You can configure slack to publish to a webhook or provide your own configuration.
Deploying apps that depends on Dockerhub image pull can result in failure because of error You have reached your pull rate limit if not authenticated to Dockerhub.
Dockerhub credentials can be passed into the modules as follows:
``` docker_credentials = { username = var.dockerhub_username password = var.dockerhub_password }