Disaster Recovery

Overview

This document outlines how the platform is restored or operated during incidents affecting application availability or hosting infrastructure.

Hosting footprint (Azure)

Services run on Azure and are provisioned via Infrastructure as Code. Typical components include: - Azure App Service (API, Admin, Frontend) - Azure networking and supporting resources - Azure SQL (or equivalent data persistence if configured) - Container Registry and CI/CD pipelines

Most application components are stateless containers. Persistent state (e.g., database) must be recoverable via standard backup/restore procedures.

Recovery principles

Reprovision quickly using IaC.
Restore minimal viable service first (API, then Frontend/Admin).
Keep secrets out of repos; reconfigure via environment variables or secure stores.
Validate health endpoints before reopening traffic.

Scenarios and actions

App Service or container failure
- Redeploy the affected app via CI/CD.
- If platform issue persists, recreate the Web App and rebind configuration.
- Verify health checks and application logs.
Region/resource outage
- Recreate infrastructure in the same or alternate region using IaC variables (based DfE ELZ teams recommendations).
- Re-deploy images from the container registry.
- Update DNS or traffic manager/front-door targets if endpoints change.
Database unavailability or corruption
- Failover to secondary (if configured) or perform point-in-time restore.
- Re-point connection strings and run health checks.
- Validate application read/write paths.
Configuration or secret loss
- Re-seed environment variables/app settings from secure sources.
- Rotate credentials if exposure is suspected.
- Re-run deployment to ensure consistency.

RTO/RPO targets

RTO: restore core service within DfE standards timeframe by automating infrastructure and deployments.
RPO: limited data loss based on database backup frequency and retention; use PITR/failover for reduction.

Operational checklist

Confirm scope and impact; notify stakeholders.
Choose recovery path (redeploy vs. reprovision).
Apply IaC to recreate resources if needed.
Deploy latest known-good container images.
Restore database (failover/PITR) if required.
Validate health checks, logs, and synthetic tests.
Restore traffic and monitor.

Prevention and readiness

Regularly validate backups and restore procedures.
Keep IaC current with production state.
Automate smoke tests post-deploy.
Document environment variables and required app settings (stored securely).

Glossary

IaC: Infrastructure as Code
RTO: Recovery Time Objective
RPO: Recovery Point Objective
PITR: Point-in-time restore
Failover: Manual or automated failover
Synthetic tests: Automated tests to validate application health
Synthetic monitoring: Automated monitoring to validate application health
Synthetic monitoring dashboard: Dashboard to monitor synthetic tests
Synthetic monitoring alerts: Alerts to notify stakeholders of test failures