Its not the new term for DBAs. MySQL has an awesome parameter
innodb-file-per-tables allows MySQL to create separate files for each tables. This helped a lot to manage the disk space in more efficient way. But when we perform a large batch job for delete or update the data in MySQL tables, you may face this fragmentation issue. Comparing with SQL server, MySQL’s fragmentation is not high. I had a similar situation where my Disk space was consuming 80% and when I check the huge files in OS, one table’s
idb file consumed 300GB+. I know it has some wasted blocks(but not actually wasted, MySQL will use this space, it’ll not return this to OS) Then I checked the
information schema to find out the data size and its index size. It was 27GB only. Then I realize, we did a batch operation to delete many billions of records in that table.
If you are dealing with the realtime data stream from Kinesis to RedShift, then you may face this situation where Redshift was down due to some maintenance activity and kinesis firehose was not able to ingest the data. But it has awesome features to retry after the next 60 Minutes. I had a situation that, there was a password change on the RedShift cluster on staging infra and we didn’t notice that the data load was failing. Now we need to backfill all those data into redshift using manifest files. One simple way, list all the files in errors manifest folders and generate the copy command and run those commands in loop or import as a
.sql file. But I wanted to automate this process with lambda.
AWS Glue is the serverless version of EMR clusters. Many organizations now adopted to use Glue for their day to day BigData workloads. I have written a blog in Searce’s Medium publication for Converting the CSV/JSON files to parquet using AWS Glue. Till now its many people are reading that and implementing on their infra. But many people are commenting about the Glue is producing a huge number for output files(converted Parquet files) in S3, even for converting 100MB of CSV file will produce 500+ Parquet files. we need to customize this output file size and number of files.
unload function will help us to export/unload the data from the tables to S3 directly. It actually runs a select query to get the results and them store them into S3. But unfortunately, it supports only one table at a time. You need to create a script to get the all the tables then store it in a variable, and loop the unload query with the list of tables. Here I have done with PL/SQL way to handle this. You can Export/Unload all the tables to S3 with partitions.
If you are using GCP platform and lazy to setup VPN or bastion host, then you may familiar with using SSH connection via a browser. Yeah, its just one click to login to the Linux VM. Here are some questions for you.
Kinesis Firehose is pushing the realtime data to S3, Redshift, ElasticSearch, and Splunk for realtime/Near real-time analytics. Sometime the target may not available due to maintenance or any reason. So the Kinesis will automatically push the data to S3 and create the manifest file in the
errors directory. Then later we can reload the data into our targets. But unfortunately, there is no one-step action to set notification if the load is failed. In this blog, im writing how can we setup Cloudwatch custom log filter alarm for kinesis load failed events.
Managing the logs in a centralized repository is one of the most common best practices in the DevOps world. Application logs, system logs, error logs, and any databases logs also will be pushed into your centralized repository. You can use ELK stack or Splunk to visualize the logs to get better insights about it. But as a SQL guy, I wanted to solve this problem with Bigdata ecosystem(use SQL). As a part of that process, we can relationalize unstructured data in AWS Athena with the help of GrokSerDe.
Redshift unload is the fastest way to export the data from Redshift cluster. In BigData world, generally people use the data in S3 for DataLake. So its important that we need to make sure the data in S3 should be partitioned. So we can use Athena, RedShift Spectrum or EMR External tables to access that data in an optimized way. If you are dealing with multiple tables, then you can loop the table names in a shell script or Python code. But as a SQL guy, I choose stored procedures to do this. It made export process simple.
This title may be suitable for the new age MySQL Users. Because in 5.7 onwards its already supported to enable GTID online. But still few of my mission critical databases are in 5.6 and handling
70k QPS. So I know enabling GTID needs downtime for this. But in my case, the GTID has been already implemented. But still the replication is using Binlog file name and position for replicating the data.
Adding a new node to the MongoDB replica set with huge amount of data will take a lot of time to perform the initial sync. Recently I was working on a replica set where I need to replace all the nodes in the existing shard and add a new node to the shard’s replica set. The data size is 3TB on each shard. The entire infra is on GCP. I planned to do this activity with native approach, but later I realized it won’t work as I expected.