RedShift Unload to S3 With Partitions - Stored Procedure Way
Redshift unload is the fastest way to export the data from Redshift cluster. In BigData world, generally people use the data in S3 for DataLake. So its important that we need to make sure the data in S3 should be partitioned. So we can use Athena, RedShift Spectrum or EMR External tables to access that data in an optimized way. If you are dealing with multiple tables, then you can loop the table names in a shell script or Python code. But as a SQL guy, I choose stored procedures to do this. It made export process simple.
In this procedure I just used the options which are suitable for me, but you can use the same procedure and do whatever customization you want. Also you can track the activity of this unload in a metadata table.
Define the parameters:
starttime- timestamp when the process started
endtime- timestamp when the process end
sql- SQL Query that you want to export its results to S3.
s3_path- location of S3 with prefix. Make sure you have end this string with
iamrole- IAM role ARN which has s3 write access.
delimiter- If you are exporting as CSV, you can define your delimiter.
un_year- Partition YEAR
un_month- Partition MONTH
un_day- Partition DAY
In this procedure, I used
GETDATE() function to pass current day, month, year into partition variables. If you want custom one, you can get these variables from input in stored procedure.
Here, Im getting the table name from input. Then I used multiple options like parallel, max file size, include headers and compress. If you don’t want to use this, you can remove these options from the
unload_query. Also if you need you can get the s3 location and other parameters from the user input. You can do many customization here.
Lets try this.
Get the unload History from Meta table:
select * from unload_meta; id | 1 tablename | bhuvi start_time | 2019-08-27 03:42:57 end_time | 2019-08-27 03:43:03 export_query | unload ('select * from bhuvi') to 's3://bhuvi-datalake/clicksteam/2019/8/27/bhuvi_' iam_role 'arn:aws:iam::123123123:role/myredshiftrole' delimiter '|' MAXFILESIZE 100 MB PARALLEL ADDQUOTES HEADER GZIP
In my next blog, I’ll write about how to automate this Unload Process in AWS Glue and convert the CSV to Parquet format.
I have written the updated version of this stored procedure to unload all of the tables in a database to S3. You can read it here: https://thedataguy.in/redshift-unload-all-tables-to-s3/