RedShift User Activity Log In Spectrum With Glue Grok
RedShift user activity log(useractivitylog) will be pushed from RedShift to our S3 bucket on every 1hr internal. This file contains all the SQL queries that are executed on our RedShift cluster. But unfortunately, this is a raw text file, completely unstructured. So we can directly use this file for further analysis. I have already published a blog about this to query this data with AWS Athena with a lot of substring and split_part functions, but it’s not much efficient to scan a massive amount of data. That’s why I want to figure out a way to fix this. The only way to structure unstructured data is to know the pattern and tell your database server how to retrieve the data with proper column names. So I thought to use the Glue Grok pattern to define the schema on top of the user activity log files. Then use Spectrum or even Athena can help you to query this.
Glue Custom Classifier Grok Pattern:
I found a grok pattern for this user activity log data on an AWS forum. But it didn’t work for me. Even I tried to change a few things, but no luck. But finally, with the help of AWS Support, we generated the working pattern.
Challenge in the data:
The next challenge is, the AWS generate useractivitylog file is not in a proper structure. It’s one an ln-line text files. I mean if you have a long query then that particular query having a new line character.
It has to be in a single line.
So I created a Lambda function, that will be triggered whenever the new useractivitylog file is put into the Redshift. And it’ll remove all the new line characters and upload them back to the bucket with a different location.
Please refer to my previous blog to understand the lambda function setup.
You can use the same python code to run it on EC2 instance as well. Download the files to ec2 and then run this function.
Download the files to /tmp/input/
Create Spectrum Schema:
Run the following query to create a spectrum schema. Make sure the following things are done.
RedShift subnets should have Glue Endpoint or Nat Gateway or Internet gateway.
RedShift IAM role to Access S3 and Glue catalog.
Crawl the data:
Now create a Glue crawler to crawl the S3 bucket where we have the cleansed files. It’ll create a table for you. Two advantages here, still you can use the same table with Athena or use Redshift Spectrum to query this. The data source is S3 and the target database is spectrum_db.
Once the crawler finished its crawling then you can see this table on the Glue catalog, Athena, and Spectrum schema as well.
Create Table in Athena with DDL:
If you are not happy to spend money to run the Glue crawler, then just simply paste the following DDL. (but replace the table name, S3 Location)
We are done now, Lets do a sample query. Here Im going to use Athena only, but the same query will on Spectrum.