The snapshot in Debezium will do a historical data load from the source database to the Kafka topics. But generally its not a good practice to this if you have a huge data in your tables. Recently I have published many blog posts to perform this snapshot from Read Replica(with/without GTID, AWS Aurora). One guy commented that, in GCP the MySQL managed service is called CloudSQL. There we don’t have much control to stop replication, perform the modifications that we want. So how can we avoid snapshots in CloudSQL and take debezium snapshots from CloudSQL Read Replica? I have spent some time today and figured out a way to do this.
Neo4j’s multi datacenter deployments are well suited for a geo-distributed workload and also provide a better disaster recovery solution. But to be frank, its not an actual distributed databases like Google Spanner or CocroachDB. Here it’s just grouping/labeling your Neo4j Nodes with different data center names. Even though it has a lot more benefits, like load balancing to a particular group, replicating the data to read replica from the existing read replica instead of replicating from master and etc. Like my previous blog, this also just guides to setting up the Multi datacenter cluster in AWS and GCP.
Neo4j is one of the top-rated Graph database platforms which supports community based Graph database and Enterprise as well. If you want to make the Neo4j database would be highly available, then we have to go with Enterprise edition that has the feature called Causal Cluster. This in blog, we are going to see how to setup and configure Neo4j causal cluster on GCP and AWS cloud platforms. The Neo4j’s documentation has well explained about this cluster setup, but I ran into some issues while deploying this on my own. Those issues made me write this blog.
I have published enough Debezium MySQL connector tutorials for taking snapshots from Read Replica. To continue my research I wanted to do something for AWS RDS Aurora as well. But aurora is not using binlog bases replication. So we can’t use the list of tutorials that I published already. In Aurora, we can get the binlog file name and its position from its snapshot of the source Cluster. So I used a snapshot for loading the historical data, and once it’s loaded we can resume the CDC from the main cluster.
In my previous post, I have shown you how to take the snapshot from Read Replica with GTID for Debezium MySQL connector. GTID concept is awesome, but still many of us using the replication without GTID. For these cases, we can take a snapshot from Read replica and then manually push the Master binlog information to the offsets topic. Injecting manual entry for offsets topic is already documented in Debezium. I’m just guiding you the way to take snapshot from Read replica without GTID.
When you installed the Debezium MySQL connector, then it’ll start read your historical data and push all of them into the Kafka topics. This setting can we changed via
snapshot.mode parameter in the connector. But if you are going to start a new sync, then Debezium will load the existing data its called Snapshot. Unfortunately, if you have a busy transactional MySQL database, then it may lead to some performance issues. And your DBA will never agree to read the data from Master Node.[Disclaimer: I’m a DBA :) ]. So I was thinking of figuring out to take the snapshot from the Read Replica, once the snapshot is done, then start read the realtime data from the Master. I found this useful information in a StackOverflow answer.
In any relational database, if you didn’t close the session properly, then it’ll lock your DDL queries. It’s applicable to RedShift as well. A few days back I got a scenario that we have to run some DROP TABLE commands to create some lookup tables. But every time while triggering this DDL it got stuck. Then we realize there were some sessions that are still open and those sessions are causing this locking. There we 30+ sessions. I know we can fix this by properly closing the session from the application side. But in some emergency cases, we need to kill all open sessions or locking session in Redshift.
Debezium is providing out of the box CDC solution from various databases. In my last blog post, I have published how to configure the Debezium MySQL connector. This is the next part of that post. Once we deployed the debezium, to we need some kind of monitoring to keep track of whats happening in the debezium connector. Luckily Debezium has its own metrics that are already integrated with the connectors. We just need to capture them using the JMX exporter agent. Here I have written how to monitor Debezium MySQL connector with Prometheus and Grafana. But the dashboard is having the basic metrics only. You can build your own dashboard for more detailed monitoring.
We are living in the DataLake world. Now almost every organizations wants their reporting in Near Real Time. Kafka is of the best streaming platform for realtime reporting. Based on the Kafka connector, RedHat designed the Debezium which is an OpenSource product and high recommended for real time CDC from transnational databases. I referred many blogs to setup this cluster. But I found just basic installation steps. So I setup this cluster for AWS with Production grade and publishing this blog.
In my previous post, I explained how to unload all the tables in the RedShift database to S3 Bucket. But there was a limitation. We should export all the tables, you can’t specify some list of tables or all the tables in a specific schema. Its because of I can’t give comma separated values in RedShift stored procedure. But after spending few days I found a solution for this.