SHART.CLOUD | DESKTOP | LABROULETTE | AWS-MULTI-REGION-ACTIVE-PASSIVE.YAML
01 - Labroulette - Active-Passive Multi-Region Failover for a Read-Heavy API
| shart.cloud / labroulette / aws-multi-region-active-passive |
AWS **O ~90 min $4-9 reviewed 6/22/2026

Active-Passive Multi-Region Failover for a Read-Heavy API

Stand up a warm standby in a second region for the read-heavy profile API so DNS fails over automatically when the primary's health check goes red, without paying for a full duplicate stack while the standby is idle.

#networking#resilience#data#dns
Lab Tindrloop, a 40-person "dating for developers" startup whose single us-east-1 deployment went dark for 90 minutes during the last regional event — long enough to trend on Hacker News for the wrong reasons. all labs
02 - Actions
score -- - -- votes
-- completed
State loading
03 - Scenario

Tindrloop, a 40-person "dating for developers" startup whose single us-east-1 deployment went dark for 90 minutes during the last regional event — long enough to trend on Hacker News for the wrong reasons.

Stand up a warm standby in a second region for the read-heavy profile API so DNS fails over automatically when the primary's health check goes red, without paying for a full duplicate stack while the standby is idle.

Constraints

  • RTO < 5 minutes, fully automated — no human in the failover path
  • Idle standby cost < $50/month (no always-on compute duplication)
  • Everything as Terraform — no console clicks, reviewable in a PR
  • Data layer must be multi-region with single-digit-ms reads in both
Scenario AWS - intermediate
05 - Steps
  1. STEP_01

    Scaffold two regions with aliased providers

    Declare two AWS provider blocks — a default primary (us-east-1) and an aliased secondary (us-west-2). Use a data source to confirm each provider resolves to the region you expect, and adopt default tags so every resource is labelled with the lab name and which region it belongs to.

    Terraform docs

  2. STEP_02

    Make the data layer multi-region

    Create a DynamoDB table for profile reads and give it a replica in the secondary region so it becomes a global table. Enable streams and point-in-time recovery. Note in your README why a global table satisfies the single-digit-ms read constraint where cross-region reads would not.

    Hint: Global tables require the table to use PAY_PER_REQUEST or have streams enabled with NEW_AND_OLD_IMAGES — Terraform will error if streams are off.

    Terraform docs

  3. STEP_03

    Add a health check on the primary

    Define a Route 53 health check that polls your primary region's API health endpoint. Tune the failure threshold and request interval so a genuine outage trips it inside your 5-minute RTO without flapping on a single slow response.

    Terraform docs

  4. STEP_04

    Wire failover routing records

    Create two Route 53 records for the same name — a PRIMARY failover record tied to the health check and a SECONDARY record pointing at the standby. When the health check goes unhealthy, Route 53 stops answering with the primary and serves the secondary automatically.

    Terraform docs

  5. STEP_05

    Prove the failover

    Force the primary health check unhealthy (block the health path or point it at a dead endpoint), then resolve the DNS name repeatedly and watch the answer flip to the secondary. Capture the wall-clock time from "primary down" to "DNS serving secondary" and compare it against your 5-minute RTO.

    Terraform docs

Steps 5 tasks
06 - Deliverables
  • A Terraform root module that provisions both regions and applies cleanly
  • A README documenting the measured failover time vs. the 5-minute RTO target
  • A short note on the RPO of the DynamoDB global table during the cutover
  • `terraform destroy` output proving the standby tears down to near-zero idle cost
Deliverables 4 required
07 - Rubric
Both providers are aliased and resources are correctly assigned per region 25%
DynamoDB global table replicates to the secondary with streams + PITR 25%
Failover records cut over automatically on an unhealthy health check 30%
Measured failover time is documented and within the 5-minute RTO 20%
Rubric self-assessed