Guides

Disaster Recovery in the Cloud: Planning RTO and RPO Correctly

Q: How much does disaster recovery cost in the cloud?

It depends on your strategy: Backup & Restore costs only storage (a few euros per TB per month). Pilot Light runs at 10-20% of production costs. Warm Standby: 30-50%. Active-Active: 200-300%. For most mid-sized businesses, Pilot Light hits the sweet spot - balancing cost and availability.

Q: How often should you test DR?

At least quarterly. Mission-critical systems warrant monthly tests. Every drill should be documented: what worked, what didn’t, and what actions are needed. Automated DR testing - powered by Terraform and CI/CD - enables frequent validation without manual effort.

Q: Is backup enough for DR?

For non-critical systems, yes. But if your RTO requirement is under four hours, backups fall short - the restore process simply takes too long. Restoring multi-terabyte databases can take hours. Pilot Light or Warm Standby become far better options.

Q: What happens during a multi-region outage?

Multi-region outages at a single provider are extremely rare - but possible (e.g., global DNS failures or widespread IAM bugs). For maximum resilience, enterprises adopt multi-cloud DR: production on AWS, DR on Azure - or vice versa. Complexity rises sharply, but for critical infrastructure, it’s justified.

Q: How do I plan DR for Kubernetes workloads?

Velero is the de facto standard for Kubernetes DR: it backs up cluster state and persistent volumes, and can restore them into another region or cluster. For stateless workloads, GitOps (e.g., Argo CD) suffices - the cluster state lives in Git and can be deployed anywhere, anytime.

The question isn’t whether a cloud region will fail – it’s when. AWS us-east-1, Azure South Central, and GCP europe-west1 have all suffered significant outages in recent years.

By Alec Chizhik December 5, 2024 4 min read

Disaster Recovery in the Cloud: Planning RTO and RPO Correctly

TL;DR

RTO (Recovery Time Objective) and RPO (Recovery Point Objective) define the disaster recovery (DR) design.
Multi-region active-active delivers RTO < 1 minute – but costs two to three times as much.
Pilot Light and Warm Standby offer cost-efficient DR with RTOs of 10-60 minutes.
Infrastructure as Code makes DR environments reproducible and testable.
Regular DR testing is mandatory – 40% of companies that skip it fail catastrophically when disaster strikes.

The question isn’t whether a cloud region will fail – it’s when. AWS us-east-1, Azure South Central, and GCP europe-west1 have all suffered significant outages in recent years. Companies without a tested disaster recovery plan face an existential question in those moments: How quickly can we get back online – and how much data will we lose?

RTO and RPO: The Two Metrics That Define Everything

Recovery Time Objective (RTO) answers: What’s the maximum acceptable downtime? An RTO of four hours means the service must be fully restored within four hours of failure.

Recovery Point Objective (RPO) answers: How much data loss is acceptable? An RPO of one hour means no more than the last hour’s worth of data may be lost.

Together, these metrics shape your DR architecture – and its price tag. Zero RTO and zero RPO demand active-active multi-region setups – the most expensive option. A 24-hour RTO and RPO can be met with daily backups – the most economical. The real challenge lies in striking the right balance between business needs and budget constraints.

40%

of companies that skip testing fail catastrophically when disaster strikes

20%

of production costs. Warm Standby: Scaled-down version

50%

of production costs. Active-Active Multi-Region: Full capacity

The Four Cloud DR Strategies

Backup & Restore: Regular backups stored in another region. During DR: Spin up infrastructure and restore from backup. RTO: 4-24 hours. RPO: Depends on backup frequency. Cost: Minimal (storage only). Best for non-critical workloads.

Pilot Light: Minimal infrastructure – like database replicas and DNS – runs continuously in the DR region. During DR: Scale up compute resources and reroute traffic. RTO: 30-60 minutes. RPO: Minutes (via replication). Cost: 10-20% of production costs.

Warm Standby: A scaled-down version of production runs continuously in the DR region. During DR: Scale up and redirect traffic. RTO: 10-30 minutes. RPO: Seconds. Cost: 30-50% of production costs.

Active-Active Multi-Region: Full production capacity runs in both regions, with traffic distributed across them. If one region fails, traffic automatically shifts to the other. RTO: < 1 minute. RPO: 0. Cost: 200-300% of production costs (plus added complexity). Ideal for mission-critical, customer-facing services.

Infrastructure as Code as a DR Enabler

DR plans based on manual runbooks crumble under pressure – stymied by stress, time constraints, and missing credentials. Infrastructure as Code (Terraform, CloudFormation, Pulumi) makes DR environments declarative and repeatable.

The pattern is simple: Production infrastructure is fully codified. During DR, that same code deploys into the target region – using region-specific variables. What once took hours manually now takes minutes. And it’s testable: DR drills run regularly to verify the code actually works.

Data Replication: Synchronous vs. Asynchronous

Synchronous replication writes data to both regions simultaneously. RPO: 0 (no data loss). Downside: Latency increases, because every write waits for confirmation from the DR region. Only viable between regions with sub-50ms latency.

Asynchronous replication writes locally first, then replicates with a delay. RPO: Seconds to minutes (depending on replication lag). Advantage: Zero latency impact on production. The standard for cross-region DR.

Managed services like Aurora Global Database, Cosmos DB, and Cloud Spanner provide multi-region replication with configurable consistency – eliminating the operational overhead of building custom replication pipelines.

DR Testing: The Overlooked Success Factor

A DR plan that’s never tested isn’t a plan – it’s wishful thinking. Research shows that 40% of organizations who’ve never validated their DR plan fail outright during real incidents. Expired credentials, changed API endpoints, incompatible data formats – these pitfalls only surface through rigorous testing.

Best practice: Quarterly DR drills where the DR region actually handles live traffic. Netflix pioneered continuous resilience testing with “Chaos Engineering” (Chaos Monkey, regional outage simulations). Today, AWS Fault Injection Service and Azure Chaos Studio bring managed chaos engineering to everyone.

Frequently Asked Questions

How much does disaster recovery cost in the cloud?

It depends on your strategy: Backup & Restore costs only storage (a few euros per TB per month). Pilot Light runs at 10-20% of production costs. Warm Standby: 30-50%. Active-Active: 200-300%. For most mid-sized businesses, Pilot Light hits the sweet spot – balancing cost and availability.

How often should you test DR?

At least quarterly. Mission-critical systems warrant monthly tests. Every drill should be documented: what worked, what didn’t, and what actions are needed. Automated DR testing – powered by Terraform and CI/CD – enables frequent validation without manual effort.

Is backup enough for DR?

For non-critical systems, yes. But if your RTO requirement is under four hours, backups fall short – the restore process simply takes too long. Restoring multi-terabyte databases can take hours. Pilot Light or Warm Standby become far better options.

What happens during a multi-region outage?

Multi-region outages at a single provider are extremely rare – but possible (e.g., global DNS failures or widespread IAM bugs). For maximum resilience, enterprises adopt multi-cloud DR: production on AWS, DR on Azure – or vice versa. Complexity rises sharply, but for critical infrastructure, it’s justified.

How do I plan DR for Kubernetes workloads?

Velero is the de facto standard for Kubernetes DR: it backs up cluster state and persistent volumes, and can restore them into another region or cluster. For stateless workloads, GitOps (e.g., Argo CD) suffices – the cluster state lives in Git and can be deployed anywhere, anytime.

Header Image Source: Pexels / Jakub Zerdzicki

Also available in

Français Español Deutsch