Active-Passive Multi-Region Failover for a Read-Heavy API

Stand up a warm standby in a second region for the read-heavy profile API so DNS fails over automatically when the primary's health check goes red, without paying for a full duplicate stack while the standby is idle.

#networking#resilience#data#dns

Lab Tindrloop, a 40-person "dating for developers" startup whose single us-east-1 deployment went dark for 90 minutes during the last regional event — long enough to trend on Hacker News for the wrong reasons. all labs

02 - Actions

score -- - -- votes

-- completed

State loading

03 - Scenario

Tindrloop, a 40-person "dating for developers" startup whose single us-east-1 deployment went dark for 90 minutes during the last regional event — long enough to trend on Hacker News for the wrong reasons.

Constraints

RTO < 5 minutes, fully automated — no human in the failover path
Idle standby cost < $50/month (no always-on compute duplication)
Everything as Terraform — no console clicks, reviewable in a PR
Data layer must be multi-region with single-digit-ms reads in both

Use template repo

Scenario AWS - intermediate

05 - Steps

STEP_01

Scaffold two regions with aliased providers

Declare two AWS provider blocks — a default primary (us-east-1) and an aliased secondary (us-west-2). Use a data source to confirm each provider resolves to the region you expect, and adopt default tags so every resource is labelled with the lab name and which region it belongs to.
Terraform docs
- aws data source: aws_region Confirm each aliased provider resolves to the intended region
- aws guide: resource-tagging Apply default_tags on each provider so failover resources are traceable
STEP_02

Make the data layer multi-region

Create a DynamoDB table for profile reads and give it a replica in the secondary region so it becomes a global table. Enable streams and point-in-time recovery. Note in your README why a global table satisfies the single-digit-ms read constraint where cross-region reads would not.

Hint: Global tables require the table to use PAY_PER_REQUEST or have streams enabled with NEW_AND_OLD_IMAGES — Terraform will error if streams are off.
Terraform docs
- aws resource: aws_dynamodb_table The replica block turns a table into a multi-region global table
STEP_03

Add a health check on the primary

Define a Route 53 health check that polls your primary region's API health endpoint. Tune the failure threshold and request interval so a genuine outage trips it inside your 5-minute RTO without flapping on a single slow response.
Terraform docs
- aws resource: aws_route53_health_check failure_threshold x request_interval is your detection budget
STEP_04

Wire failover routing records

Create two Route 53 records for the same name — a PRIMARY failover record tied to the health check and a SECONDARY record pointing at the standby. When the health check goes unhealthy, Route 53 stops answering with the primary and serves the secondary automatically.
Terraform docs
- aws resource: aws_route53_record set_identifier + failover_routing_policy is the cutover mechanism
- aws resource: aws_cloudfront_distribution Optional — front both origins with CloudFront to keep TLS terminations warm
STEP_05

Prove the failover

Force the primary health check unhealthy (block the health path or point it at a dead endpoint), then resolve the DNS name repeatedly and watch the answer flip to the secondary. Capture the wall-clock time from "primary down" to "DNS serving secondary" and compare it against your 5-minute RTO.
Terraform docs
- aws data source: aws_route53_zone Read the zone back to confirm which record is being served

Steps 5 tasks

06 - Deliverables

A Terraform root module that provisions both regions and applies cleanly
A README documenting the measured failover time vs. the 5-minute RTO target
A short note on the RPO of the DynamoDB global table during the cutover
`terraform destroy` output proving the standby tears down to near-zero idle cost

Deliverables 4 required

07 - Rubric

Both providers are aliased and resources are correctly assigned per region 25%

DynamoDB global table replicates to the secondary with streams + PITR 25%

Failover records cut over automatically on an unhealthy health check 30%

Measured failover time is documented and within the 5-minute RTO 20%

Rubric self-assessed

Active-Passive Multi-Region Failover for a Read-Heavy API

Tindrloop, a 40-person "dating for developers" startup whose single us-east-1 deployment went dark for 90 minutes during the last regional event — long enough to trend on Hacker News for the wrong reasons.

Constraints

Scaffold two regions with aliased providers

Terraform docs

Make the data layer multi-region

Terraform docs

Add a health check on the primary

Terraform docs

Wire failover routing records

Terraform docs

Prove the failover

Terraform docs