AWS

Enable Logical Replication On Aurora Without Reboot

aurora , postgresql

Enable Logical Replication On Aurora without Reboot

Logical replication in PostgreSQL is a CDC feature to stream the database events to the consumers. AWS Aurora PostgreSQL supports this logical replication and it has to be enabled via Cluster parameter group. Generally in RDS PostgreSQL, we have to enable logical replication in the Parameter Group and then we need to do a reboot to apply the logical replication change. But if your databases are not allowed to get downtime for the reboot then we can’t enable it. I had a similar situation where the reboot took more than 4mins and the business won’t allow to do the maintenance for more than 2mins.

Workaround: #

Then I found a small hack to enable this logical replication on Aurora without reboot but with a very minimal downtime.

My thought process behind this is that the Cluster paramter group is common for all the nodes in the Cluster. All the values in the Cluster parameter group is applied on all the nodes. So if I enable the logical replication flag and it will be applied on the Writter and Reader nodes. But it won’t show the wal_level as logical until we reboot it. Instead of reboot, just do a failover then the Reader node will become the Writer and the logical replication flags are enabled on that node.

Lets test it: #

On the Reader Node, lets see the current wal_level before enabling it.

replica => select wal_level;
 wal_level
-----------
 replica

Now, enable the logical replication on the cluster parameter group and check the wal level on both the nodes.

primary => select wal_level;
 wal_level
-----------
 replica

replica => select wal_level;
 wal_level
-----------
 replica

Till here, it is expected behaviour, now lets do the failover and the reader will become writer. Lets see the wal level now.

replica => select wal_level;
 wal_level
-----------
 logical

Hehhhh!! It worked!!

Conclusion: #

Still this is not a zero downtime solution, but comparing with reboot, the failover took just less than 10 seconds which is good. But I dont know whether the same thing will work on RDS with multi-az. If someone did it then please do comment below.