AWS

DynamoDB DR Setup #

DynamoDB

Solution Overview: #

CDC Data

  • This option can be used when global table is not available
  • DynamoDB has a native CDC feature called DynamoDB Streams
  • This will hold the CDC data up to hours
  • We will use the Lambda function to read the data from the DynamoDB streams.
  • DDB streams will automatically trigger the lambda.
  • Then the lambda will be reading the data from DDB streams and then replicating those events to the DR region DDB table.

Historical Data Export

  • DynamoDB supports direct export to the S3 bucket.
  • We will export the data into Native JSON format to S3.
  • Then on the DR region, we’ll run an import job.

Monitoring

  • Cloudwatch will be used to monitor the replication.
  • In the lambda, we can see a metic called IteratorAge that will tell the last event’s timestamp that had been processed by Lambda.

Further Optimization:

  • Lambda will be triggered as soon as possible from DynamoDB streams.
  • But this will end up in more lambda executions. But the events will be processed in near real time.
  • We can control this behavior by setting up the batch size or batch window to process the records from the streams.

Aurora DR setup: #

Aurora

Solution Overview: #

  • Aurora’s global database provides the capability to expand the cluster into multiple regions and manages the replication and failover out of the box.
  • We can create the complete cluster on the DR region and that will replicate the data from the primary region.
  • There is another cost-optimized way while setting up the Global cluster is, we can make the DR region cluster the Headless cluster.
  • The headless cluster will only have the aurora’s storage without any writer or reader nodes.
  • Headless clusters can be provisioned via the CLI tool only.
  • During the disaster, we can add nodes into the headless cluster and start using the same cluster endpoint for the application connectivity.

Monitoring

  • RDS cloudwatch metrics provide the aurora replica lag metric.
  • Also, the minimum managed RPO is 20 seconds, but Aurora replicates within a few milliseconds.
  • When Aurora reaches the lag of 20 seconds, then it’ll pause the writes on the primary until the lag gets reduced.