In my last post, I have experimented to reconstruct the complete
stl_querytext table in RedShift using
LISTAGG() function. But there is a limitation in that. If you read AWS Doc for LISTAGG, you can’t group the rows which are more than 65535 characters. What happens if we have tens of thousands of lines of SQL queries? It’ll throw the limitation error for sure. I didn’t find a much better solution in RedShift to solve this. So I decided to use other services which are cheaper to do this work. And that’s why you are here. Yes, this time I used AWS Athena an interactive analytics platform where we can query the data in an S3 bucket directly. Lets see how to reconstruct all the queries in
stl_querytext table using Athena.
If you are going to migrate your MySQL workloads to GCP’s managed database service CloudSQL, then you have to keep these points in mind. We have done a lot of CloudSQL migrations. But sometimes it’s not smooth as we thought. Generally, people don’t even think that these thinks will make the replication failure. I listing 3 things that ate our brain and time while migrating to CloudSQL.
If you are managing the RedShift clusters then STL_QUERY and STL_QUERYTEXT tables are not new to you. STL_Query can’t hold the complete SQL query instead we can use STL_QueryText to read the complete query. But there is a challenge, we can’t read that table as it is. Since your queries are saved in multiple rows. So we need to combine all these rows into a single row with LISTAGG function which is well documented here.
Good news for the RedShift customers now we can
GRANT column-level permissions on the tables. It’s not only limited to tables, but we can also grant on views and materialized views as well. When the Lake formation was announced, this feature was a part of it. But unfortunately, we need to use Redshift Spectrum to achieve this. The wait is over now. Redshift natively supports the column level restrictions. Im experimenting and walk through this feature and test all the statements mentioned in the Redshift documentation.
I have learned a lot more internal things about Google Cloud Spanner from past two days. I read some of the portions of the Spanner white paper and the deep internal things from the Google Cloud Next event videos from Youtube. I’ll share the video links here, but I want to summarize all the learnings in one place. Thats why I wrote this blog post. A special thanks to Deepti Srivastava(Product Manager for Spanner) who presented the Spanner Deep Dive sessions in the Google Cloud Next Event.
The snapshot in Debezium will do a historical data load from the source database to the Kafka topics. But generally its not a good practice to this if you have a huge data in your tables. Recently I have published many blog posts to perform this snapshot from Read Replica(with/without GTID, AWS Aurora). One guy commented that, in GCP the MySQL managed service is called CloudSQL. There we don’t have much control to stop replication, perform the modifications that we want. So how can we avoid snapshots in CloudSQL and take debezium snapshots from CloudSQL Read Replica? I have spent some time today and figured out a way to do this.
Neo4j’s multi datacenter deployments are well suited for a geo-distributed workload and also provide a better disaster recovery solution. But to be frank, its not an actual distributed databases like Google Spanner or CocroachDB. Here it’s just grouping/labeling your Neo4j Nodes with different data center names. Even though it has a lot more benefits, like load balancing to a particular group, replicating the data to read replica from the existing read replica instead of replicating from master and etc. Like my previous blog, this also just guides to setting up the Multi datacenter cluster in AWS and GCP.
Neo4j is one of the top-rated Graph database platforms which supports community based Graph database and Enterprise as well. If you want to make the Neo4j database would be highly available, then we have to go with Enterprise edition that has the feature called Causal Cluster. This in blog, we are going to see how to setup and configure Neo4j causal cluster on GCP and AWS cloud platforms. The Neo4j’s documentation has well explained about this cluster setup, but I ran into some issues while deploying this on my own. Those issues made me write this blog.
I have published enough Debezium MySQL connector tutorials for taking snapshots from Read Replica. To continue my research I wanted to do something for AWS RDS Aurora as well. But aurora is not using binlog bases replication. So we can’t use the list of tutorials that I published already. In Aurora, we can get the binlog file name and its position from its snapshot of the source Cluster. So I used a snapshot for loading the historical data, and once it’s loaded we can resume the CDC from the main cluster.
In my previous post, I have shown you how to take the snapshot from Read Replica with GTID for Debezium MySQL connector. GTID concept is awesome, but still many of us using the replication without GTID. For these cases, we can take a snapshot from Read replica and then manually push the Master binlog information to the offsets topic. Injecting manual entry for offsets topic is already documented in Debezium. I’m just guiding you the way to take snapshot from Read replica without GTID.