In my last post, I have experimented to reconstruct the complete
stl_querytext table in RedShift using
LISTAGG() function. But there is a limitation in that. If you read AWS Doc for LISTAGG, you can’t group the rows which are more than 65535 characters. What happens if we have tens of thousands of lines of SQL queries? It’ll throw the limitation error for sure. I didn’t find a much better solution in RedShift to solve this. So I decided to use other services which are cheaper to do this work. And that’s why you are here. Yes, this time I used AWS Athena an interactive analytics platform where we can query the data in an S3 bucket directly. Lets see how to reconstruct all the queries in
stl_querytext table using Athena.
Unload the data:
Athena can’t use the RedShift directly to query the data, we have to export the data into S3 bucket. Here we are going to do the same. But a minor change, the data in stl_querytext table has some special characters like
\n t what ill break our process in Athena. So while exporting the data, we have to replace then with whitespace.
But unfortunately, I was not able to use this SQL syntax in unload command, it didn’t replace anything. (I have request AWS Support to solve this, I’ll this blog once I got the solution), but we can create a view on top of the
stl_querytext to replace these characters.
Now lest unload the data to S3 bucket, make sure you have a proper IAM role to access the S3 bucket and change the bucket name and path as per your need.
Instead of creating the view, you can use the following query to unload the data.
If you want to add compress, please enable it, which will improve your Athena query performance.
Now lets go and create a table in Athena to query the data in S3.
We are almost done, now let’s write some SQL queries to reconstruct the stl_querytext table.
Reconstruct the SQL Queries:
Ohhh!!! Im impressed by its performance, Last time, when I perform the same activity on RedShift some times it took a min to complete, but Athena did a great job. Try it out and comment below if you face any issues.
I faced an issue with Athena recently, what happened is the array_agg() is not maintaining the query order. So some queries may be incorrect in the result reset. We can solve them by using
array_join(array_agg(text ORDER BY sequence), '') AS text and again this syntax will not support in Athena because it 1.7x version. Im not sure when AWS is going to update the Athena’s presto verison. But if you want to use presto in EMR, it’ll work. Use the following query.