The Data Guy

Redshift Copy Script From Sct Agent Export S3 Part1


AWS SCT Extraction Agents will help to pull the data from the various data sources and push it into the targets. S3 and Snowball also can be a target for SCT Agent. But if we run the SCT Agent with multiple tasks and multiple agents they will export the data into S3 or Snowball with some string folder structure. If you want to push it into the RedShift then it is very difficult to import from the complex random string(UUID kind of) path. Recently we had an experience with this where we have to import around 20TB+ data from Snowball(later exported to S3) to RedShift. In this blog, Im going to share my experience and script to generate RedShift copy command from SCT agent exported to S3 or Snowball with random string folders.

Redshift Copy Script From Sct Agent Export S3 Part

SCT Agent Terminology:

Before explaining the solution, let’s understand how the SCT agent works and its terminology.

  1. Once we created a migration task, it’ll split the data export between all the extraction agents.
  2. One table’s data will be exported in multiple chunks. This chunk is based on the Virtual Partition.
  3. The Virtual partition will be considered as Subtasks for the export.

So the export folder structure will be formed based on the above terminology.

Lets see an example

Now if you start exporting the data then it’ll use the following folder structure.


Lets compare this with our example.


S3 export Terminology:

We understand the folder/directory structure, now let’s see what kind of files we’ll get inside these folders.

Refer to the below image which shows the complete illustration of the SCT Agent directory structure. And if you notice the subtask ID and the statistic file’s name, both are the same.

RedShift Copy Command From SCT Agent Export In S3 RedShift Copy Command From SCT Agent Export In S3


Here can notice one thing, it’s already in good shape to load the CSV files to RedShift. Just get all the SQL files and run them, then what is the challenge here?

Optimized Approach:

We can generate a custom SQL file for COPY command that refers to all the CSV files from the multiple subfolders. So in one SQL command, we can load all of our CSV files to RedShift. So how do we get that?

  1. Get the list of all the unit.manifest files.
  2. Download the manifest files to your system.
  3. Extract the CSV files location from all the manifest files. For this Im going to use jq a command-line JSON parser for Linux. So just extract the entries[] array from the unit.manifest file.
  4. Create a new manifest file that contains all the CSV file locations that are extracted from the previous step. And upload this to the new S3 path.
  5. Generate a SQL command that refers to the newly generated manifest.
  6. If you have many tables instead of repeating this step for all the tables, pass the table names, and the S3 prefix in a CSV file, let the script to the complete work for you.

Remember how did you export?

  1. From SCT we can create a local task to export a single table, so the S3 prefix contains the data only for that particular table.
  2. Or you can upload a CSV file to SCT that contains a bunch of tables list. In this case, your S3 prefix location contains all the tables files. But luckily one subtask will export a table’s data. So it’ll not mix multiple tables data in one subtask.

And make sure you install jq.

apt install jq
yum install jq

Replace the following things in my script

Script to generate SQL for individually exported tables:

Prepare a CSV file that contains the list of table names and their s3 Prefix.

Eg Filename: source_generate_copy


Create necessary folders

mkdir /root/copy-sql  
mkdir /root/manifest-list  
mkdir /root/manifest_files

Download the Manifest

I want to download 20 files parallelly, so I used & and ((++count % 20 == 0 )) && wait. If you want to download more then use a higher number. But it better to keep less than 100.

while read -r s_table
table=$(echo $s_table | awk -F',' '{print $1}')
s3_loc=$(echo $s_table | awk -F',' '{print $2}')
aws s3 ls --human-readable --recursive s3://migration-bucket/$s3_loc/ | grep "unit.manifest" | awk -F ' ' '{print $5}' > /root/manifest-list/$table.manifest.list      
mkdir -p /root/manifest_files/$table/
while read -r r_manifest
aws s3 cp s3://migration-bucket/$r_manifest /root/manifest_files/$table/unit.manifest.$file_num &
file_num=$(( $file_num + 1 ))
((++count % 20 == 0 )) && wait
done < /root/manifest-list/$table.manifest.list
done < source_generate_copy

Merge Manifest into one

while read -r s_table
table=$(echo $s_table | awk -F',' '{print $1}')
files=$(ls /root/manifest_files/$table)
for file in $files
echo $file
cat /root/manifest_files/$table/$file | jq '.entries[]'  >> /root/manifest_files/$table/unit.merge
cat /root/manifest_files/$table/unit.merge | jq -s '' > /root/manifest_files/$table/$table.manifest
sed -i '1c\{"entries" : ['  /root/manifest_files/$table/$table.manifest
sed -i -e '$a\}'  /root/manifest_files/$table/$table.manifest
done < source_generate_copy

Upload Manifest

while read -r s_table
table=$(echo $s_table | awk -F',' '{print $1}')
aws s3 cp /root/manifest_files/$table/$table.manifest s3://migration-bucket/merged_manifest/
done < source_generate_copy

Generate COPY

Change the options as per your need.

while read -r s_table
table=$(echo $s_table | awk -F',' '{print $1}')
echo "COPY your_schema.$table from 's3://migration-bucket/merged_manifest/$table.manifest' MANIFEST iam_role 'arn:aws:iam::1231231231:role/s3-role' REGION 'ap-south-1' REMOVEQUOTES IGNOREHEADER 1 ESCAPE DATEFORMAT 'auto' TIMEFORMAT 'auto' GZIP DELIMITER '|' ACCEPTINVCHARS '?' COMPUPDATE FALSE STATUPDATE FALSE MAXERROR 0 BLANKSASNULL EMPTYASNULL  EXPLICIT_IDS"  > /root/copy-sql/copy-$table.sql
done < source_generate_copy

All tables COPY command will be generated and located on /root/copy-sql/ location.

I have written about how to generate the SQL command for bulk exported tables in Part 2.

· aws, redshift, shellscript, automation


Loading Comments