AWS Athena is a schema on read platform. Now Athena is one of best services in AWS to build a Data Lake solutions and do analytics on flat files which are stored in the S3. In the backend its actually using presto clusters. So most of the Presto functions are support and you can use native SQL query to query your data. You can create partitions to speedup your query and reduce the cost for scanning. Here Im gonna explain automatically create AWS Athena partitions for cloudtrail between two dates. When it is introduced I used this for analyze CloudTrail Logs which was very helpful to get some particular activities like who launched this instance, track a particular user’s activity and etc.
But the challenge was I had 3 years of CloudTrail log. It was really a huge data. So If I query to find when an instance is launched then it’ll start scanning the complete data and it took more time and added additional cost for me. Also I used around 8 regions.
Here is my AWS CloudTrail Log path in S3.
So I need to create partitions from regions/year/month/day/
Execute create partition query for 3 years is not an easy job. Also I need to run the partition query on 8 regions for 3 years. Uuuuuhhhhh!! It doesn’t make sense. So I have created a Lambda function to make my job easier.
Create the table with Partitions
Create the Lambda function
Go to Lambda and create a new function with Python 3.6.
IAM role for Lambda:
S3 – ListBukcet.
S3 – Read Object on the Cloudtrail log bucket (My bucket Name: sqladmin-cloudtraillog).
S3 – Write Object on the athena results bucket (My results bucket: aws-athena-query-results-XXXXXXXXXXXXXXXX-us-east-1).
Athena – Create Names query and Execution.
AWS Lambda can run upto 5mins only. So if you want to create partitions for more than 3 months then give the start_date and end_date only for 3 months per execution. OR run this on a server.
#update: Now lambda supports 15mins execution
Parameters for S3 bucket and Athena
s3_bucket – Bucket name where your cloudtrail logs stored.
s3_prefix – Path for your cloudtrail logs (give the prefix before the regions. For eg: s3://bucket/AWSLogs/AccountID/Cloudtrail/regions/year/month/day/log_files. So you need to use path: AWSLogs/AccountID/Cloudtrail/ ).
s3_ouput – Path for where your Athena query results need to be saved.
database – Name of the DB where your cloudwatch logs table located.
table_name – Nanme of the table where your cloudwatch logs table located.
start_date – Start date for creating partition.
end_date – Last date for creating partition.
Function for get range from given two dates
Function for execute the Athena Query
Main Function to Automatically create AWS Athena partitions
Athena support 20 concurrent executions only. We need control the Lambda/Python to wait for sometime before execute the next loop. So I have given 2 secs before execute the next loop.