cf-monitoring

Cloud Foundry monitoring

Collection of terraform modules to deploy the prometheus ecosystem to Cloud foundry.

The prometheus_all module is a good starting point as it includes all the other modules. Check the variables in prometheus_all for a description of all configuration options.

Source

github.com/DFE-Digital/cf-monitoring

Table of contents

Prerequisites

prometheus_all

Wrapper module abstracting all the other modules. It should be sufficient for most use cases but underlying modules can also be used directly.

The prometheus_all module creates two instances of the Prometheus application:

Minimal configuration

module prometheus_all {
  source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all"

  monitoring_instance_name = "teaching-vacancies"
  monitoring_org_name      = "dfe"
  monitoring_space_name    = "teaching-vacancies-monitoring"
  paas_exporter_username   = var.paas_exporter_username
  paas_exporter_password   = var.paas_exporter_password
  grafana_admin_password   = var.grafana_admin_password
}

Use a specific cf-monitoring version

The git reference can be changed. For example for the dev branch:

source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all?ref=dev"

Retention policy

The default retention policy in influxdb is 30 days. After which all the metrics are deleted. It is possible to keep some metrics for 12 months using influxdb downsampling and enable yearly prometheus.

Influxdb downsampling

Prometheus-yearly

Prometheus-yearly is an extra prometheus instance reading data from the one_year retention policy in influxdb. It is disabled by default. To enable it set enable_prometheus_yearly to true:

module prometheus_all {
  source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all"
  ...
  enable_prometheus_yearly = true
}

Enable specific modules

It is possible to include modules selectively to help onboarding to prometheus_all step-by-step. See the list of modules in enabled_modules.

module prometheus_all {
  source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all"
  enabled_modules          = ["prometheus", "influxdb"]
  monitoring_instance_name = "teaching-vacancies"
  monitoring_org_name      = "dfe"
  monitoring_space_name    = "teaching-vacancies-monitoring"
}

Grafana

By default authentication is only via username/password for the admin account. Autentication via Google single-sign-on can be configured. It provides readonly access to users by default. Additional permissions are not persisted.

It provides several datasources:

A number of Grafana dashboards are included and are usable out-of-the-box to monitor your apps and services. By default it shows all your resources, then you can filter them via drop-down menus.

You can add your own dashboards via the grafana_json_dashboards parameter.

See Grafana README

PostgreSQL

Basic metrics are available in the CF databases dashboard. The PostgreSQL advanced dashboard provides more advanced metrics via the postgres_prometheus_exporter module.

See postgres_prometheus_exporter README

Generic PostgreSQL alerting

Generic Postgres alerting can be enabled for selected databases.

This will add alerts that will trigger as specified below

PreReqs.

  1. Monitoring must be configured for the postgres instances
  2. Alerting must already be configured for your service (alertmanager)

Set the following variables in tf or env.tfvars.json file as per your configuration to enable generic alerting.

postgres_dashboard_url (string): the grafana url for the cf-databases dashboard

alertable_postgres_services (map): a map of the postgres instances to have alerting enabled, and optional alert thresholds. If any thresholds are not listed they will default as below

e.g. (for json format)

"postgres_dashboard_url": "https://grafana-service.london.cloudapps.digital/d/azzzBNMz"

"alertable_postgres_services": {
  "bat-qa/apply-postgres-qa": {
    "max_cpu": 65,
    "min_mem": 0.5,
    "min_stg": 2
  },
  "bat-qa/register-postgres-qa": {
  },
  "bat-qa/teacher-training-api-postgres-qa": {
    "min_mem": 0.5
  }
}

Generic Application alerting

Generic Application alerting can be enabled for selected apps.

Set the following variables in tf or env.tfvars.json file as per your configuration to enable generic alerting.

apps_dashboard_url (string): the grafana url for the cf-apps dashboard

alertable_apps (map): a map of the app instances to have alerting enabled, and optional alert thresholds. If any thresholds are not listed they will default as below

PreReqs.

  1. Monitoring must be configured for the app instances
  2. Alerting must already be configured for your service (alertmanager)

e.g. (for json format)

"apps_dashboard_url": "https://grafana-service.london.cloudapps.digital/d/azzzBNMz"

"alertable_apps": {
  "tra-dev/find-a-lost-trn-dev": {
  },
  "tra-dev/qualified-teachers-api-dev": {
    "response_threshold": 5
  }
}

Redis Services

If your application uses Redis you may want to include a Redis metrics exporter for each instance of Redis you use. This is accomplished by passing in an array of strings. Each string takes the form of "space/service", for example:

redis_services = [ "get_into_teaching/redis_service_one" , "get_into_teaching/redis_service_two" , ... ]

External exporters

List of external endpoints which can be queried via /metrics. Can be used for apps deployed to Cloud foundry or any external services.

They must be accessible via https.

Internal applications

Configuration

Pass a list of applications deployed to Cloud Foundry and prometheus will find each individual instance and scrape metrics from them. The format is:

["<app1_name>.<internal_domain>[:port]", "<app2_name>.<internal_domain>[:port]"]

If the port is not specified, the default Cloud Foundry port will be used (8080).

Internal routing must be configured so that prometheus can access them. prometheus_all outputs both prometheus app name and id to help create the network policy.

Important

To allow useful aggregation and optimise time series storage, the applications should decorate the metrics with a label called app_instance representing the id of the Cloud Foundry app instance. It can be obtained at runtime from the CF_INSTANCE_INDEX environment variable.

Yabeda

For ruby applications, the yabeda is a powerful framework to expose custom metrics and provides a lot of metrics out of the box such as yabeda-rails and yabeda-sidekiq.

It is recommended to decorate the yabeda metrics as such:

if ENV.key?('VCAP_APPLICATION')
  vcap_config = JSON.parse(ENV['VCAP_APPLICATION'])

  Yabeda.configure do
    default_tag :app, vcap_config['name']
    default_tag :app_instance, ENV['CF_INSTANCE_INDEX']
    default_tag :organisation, vcap_config['organization_name']
    default_tag :space, vcap_config['space_name']
  end
end

alertmanager

A default configuration is provided but it doesn’t send any notification. You can configure slack to publish to a webhook or provide your own configuration.

See alertmanager README

Dockerhub pull rate limit

Deploying apps that depends on Dockerhub image pull can result in failure because of error You have reached your pull rate limit if not authenticated to Dockerhub.

Dockerhub credentials can be passed into the modules as follows:

``` docker_credentials = { username = var.dockerhub_username password = var.dockerhub_password }