AWS Glue Custom Output File Size And Fixed Number Of Files

AWS Glue is the serverless version of EMR clusters. Many organizations now adopted to use Glue for their day to day BigData workloads. I have written a blog in Searce’s Medium publication for Converting the CSV/JSON files to parquet using AWS Glue. Till now its many people are reading that and implementing on their infra. But many people are commenting about the Glue is producing a huge number for output files(converted Parquet files) in S3, even for converting 100MB of CSV file will produce 500+ Parquet files. we need to customize this output file size and number of files.

4 min read

RedShift Unload All Tables To S3

RedShift unload function will help us to export/unload the data from the tables to S3 directly. It actually runs a select query to get the results and them store them into S3. But unfortunately, it supports only one table at a time. You need to create a script to get the all the tables then store it in a variable, and loop the unload query with the list of tables. Here I have done with PL/SQL way to handle this. You can Export/Unload all the tables to S3 with partitions.

5 min read

How GCP Browser Based SSH Works

If you are using GCP platform and lazy to setup VPN or bastion host, then you may familiar with using SSH connection via a browser. Yeah, its just one click to login to the Linux VM. Here are some questions for you.

3 min read

CloudWatch Custom Log Filter Alarm For Kinesis Load Failed Event

Kinesis Firehose is pushing the realtime data to S3, Redshift, ElasticSearch, and Splunk for realtime/Near real-time analytics. Sometime the target may not available due to maintenance or any reason. So the Kinesis will automatically push the data to S3 and create the manifest file in the errors directory. Then later we can reload the data into our targets. But unfortunately, there is no one-step action to set notification if the load is failed. In this blog, im writing how can we setup Cloudwatch custom log filter alarm for kinesis load failed events.

2 min read

Relationalize Unstructured Data In AWS Athena with GrokSerDe

Managing the logs in a centralized repository is one of the most common best practices in the DevOps world. Application logs, system logs, error logs, and any databases logs also will be pushed into your centralized repository. You can use ELK stack or Splunk to visualize the logs to get better insights about it. But as a SQL guy, I wanted to solve this problem with Bigdata ecosystem(use SQL). As a part of that process, we can relationalize unstructured data in AWS Athena with the help of GrokSerDe.

2 min read

RedShift Unload to S3 With Partitions - Stored Procedure Way

Redshift unload is the fastest way to export the data from Redshift cluster. In BigData world, generally people use the data in S3 for DataLake. So its important that we need to make sure the data in S3 should be partitioned. So we can use Athena, RedShift Spectrum or EMR External tables to access that data in an optimized way. If you are dealing with multiple tables, then you can loop the table names in a shell script or Python code. But as a SQL guy, I choose stored procedures to do this. It made export process simple.

2 min read

MySQL Convert Binlog Based Replication To GTID Replication Without Downtime

This title may be suitable for the new age MySQL Users. Because in 5.7 onwards its already supported to enable GTID online. But still few of my mission critical databases are in 5.6 and handling 70k QPS. So I know enabling GTID needs downtime for this. But in my case, the GTID has been already implemented. But still the replication is using Binlog file name and position for replicating the data.

2 min read

MongoDB Add Node To Replica Set Without Initial Sync In GCP/AWS

Adding a new node to the MongoDB replica set with huge amount of data will take a lot of time to perform the initial sync. Recently I was working on a replica set where I need to replace all the nodes in the existing shard and add a new node to the shard’s replica set. The data size is 3TB on each shard. The entire infra is on GCP. I planned to do this activity with native approach, but later I realized it won’t work as I expected.

4 min read

Database Mirroring is still a Mystery

Database Mirroring - A replication feature which supports automatic failover and supported in standard as well. During SQL server 2016, Microsoft announced that Mirror will be removed from further releases. But still, its there and Documentations are showing it’s going to deprecate soon. The Alwayson Availability Groups is the permanent High availability solution for SQL server.

4 min read

Monitor Cassandra Clusters with Percona PMM - JMX Grafana and Prometheus

While reading this title you may think about what this guy is going to do? Its all about JMX exporter from prometheus and Grafana. Yes, its already implemented by many companies and Grafana has some cool dashboards. But as a DBA, in Searce we are managing many customers and all of them are using many types of databases. So from a DBA’s perspective to monitor all databases in one place is always a great thing right. If you are a MySQL DBA, then you must have heard about PMM. Its an awesome monitoring tool and its open source. Also it has Dashboards for Linux metrics, MongoDB and PostgreSQL.

3 min read