The Data Guy

RedShift COPY Command From SCT Agent - Multiple Tables

AWS SCT extraction agents will extract the data from various sources to S3/Snowball. We have an option to export multiple tables at a time. But all these tables data will be randomly distributed to multiple subdirectories based on the number of extraction agents. When you are trying to import the data from S3 to RedShift, you may not know which directory contains what table’s data. So We cannot generate a COPY command for this data. I have already blogged an article to understand the directory structure of the AWS SCT extraction agent. I highly recommend reading that blog before executing these scripts.

Multiple random output folders:

In my case, I had 7 AWS SCT extraction agents. During the export process, I have selected 40+ tables. Here is the output of the SCT Agent’s output from S3.

RedShift COPY Command From SCT Agent - Multiple Tables

  1. Bucket Name
  2. Folder Name that I gave during the export
  3. Extraction Agent ID
  4. Subtask ID
  5. Task ID
  6. CSV files (8 files in one Unit folder)
  7. Stats File - The naming conversion is taskid.json

The actual CSV files are located under the Unit_X folder. One unit folder contains 8 files(as per my SCT settings).

How to find Table Names:

The only way to find the table name is by extracting the table name from the Stats file. Each task ID folders contains a bunch of CSV files that belong to the same table. So if we extract the table name from the stats file, then we can easily generate the COPY command.

unit.manifest file:

Every Unit folders contain a file called unit.manifest. This is a manifest file that has the exact S3 path for all the 8 files in that particular Unit folder. So we can directly use this manifest file to import the data into RedShift via COPY command.

If you have 16 CSV files, then you will get 2 different unit folders(Unit_1 and Unit_2). But both the unit folders contain the unit.manifest file. And the name is also the same(unit.manifest).

Logic behind this script:

Once again, I recommend you to read my previous blog to understand this concept. List all files from S3 and grep the status files(its a JSON file, so grep JSON)

There is a common string between the stats file and the unit.manifest file’s path(the full path that you got from step 2) is the task ID.

Example: unit.manifest path:

s3://bucket/10tables-export/123abc/456def/678ghi/unit_1/unit.manifest

Each task ID contains its own stats file. For the above manifest file, the task ID is 678ghi. So this folder should have the stats file.

Stats file path:

s3://bucket/10tables-export/123abc/456def/678ghi/678ghi.json

678ghi is the common string between the manifest file and the stats file.

stats file example:

RedShift COPY Command From SCT Agent - Multiple Tables

At this stage you have all manifest files and they are organized by table wise.

Get the list of stats files

aws s3 ls --recursive s3://myexport-bucket/10tables-export/ | grep "json" | grep -v 'statistic.json'  | awk -F ' ' '{print $4}' > /root/group-tables/group-tables-id-file-list

Download all the stats files

mkdir -p  /root/group-tables/group-tables-id-files/
count=0
while read -r id_file
do
aws s3 cp s3://myexport-bucket/$id_file /root/group-tables/group-tables-id-files/ &
((++count % 20 == 0 )) && wait
done < /root/group-tables/group-tables-id-file-list

Get all the manifest files list (with full path)

aws s3 ls --human-readable --recursive s3://myexport-bucket/10tables-export/ | grep "unit.manifest" | awk -F ' ' '{print $5}' > /root/group-tables/group.manifest.list

Separate manifest files Table wise

mkdir -p /root/group-tables/manifest-list/
for files in /root/group-tables/group-tables-id-files/*
do
id=$(echo $files | sed 's\/root/group-tables/group-tables-id-files/\\g'|sed 's/.json//g')
table_name=$(cat $files | jq '.targetTableName' | sed 's/"//g')
manifest_files=$(grep -rn $id /root/group-tables/group.manifest.list | awk -F ':' '{print $2}' )
echo $manifest_files | sed 's/\s\+/\n/g'| sed '/^$/d' >> /root/group-tables/manifest-list/$table_name.manifest
done

Download the Manifest files - Table wise

mkdir -p  /root/group-tables/manifest_files/
for files in /root/group-tables/manifest-list/*
do
table_name=$(echo $files | awk -F '/' '{print $5}' |  sed 's/.manifest//g')
mkdir -p /root/group-tables/manifest_files/$table_name
file_num=0
count=0
while read -r l_manifest
do
aws s3 cp s3://myexport-bucket/$l_manifest  /root/group-tables/manifest_files/$table_name/unit.manifest.$file_num &
file_num=$(( $file_num + 1 ))
((++count % 20 == 0 )) && wait
done < $files
done

Merge all the manifest files into a single manifest file.

for mf_files in /root/group-tables/manifest_files/*
do
table=$(echo $mf_files | awk -F '/' '{print $5}')
files=$(ls /root/group-tables/manifest_files/$table)
for file in $files
do
cat /root/group-tables/manifest_files/$table/$file | jq '.entries[]'  >> /root/group-tables/manifest_files/$table/unit.merge
done
cat /root/group-tables/manifest_files/$table/unit.merge | jq -s '' > /root/group-tables/manifest_files/$table/$table.manifest
sed -i '1c\{"entries" : ['  /root/group-tables/manifest_files/$table/$table.manifest
sed -i -e '$a\}'  /root/group-tables/manifest_files/$table/$table.manifest
done

Upload the new manifest file to S3

for mf_files in /root/group-tables/manifest_files/*
do
table=$(echo $mf_files | awk -F '/' '{print $5}')
aws s3 cp /root/group-tables/manifest_files/$table/$table.manifest s3://myexport-bucket/new-manifest-file/
done

Generate the COPY command:

Please customize the options like region, file formation, compression, IAM role and etc.

mkdir /root/group-tables/copy-sql
for mf_files in /root/group-tables/manifest_files/*
do
table=$(echo $mf_files | awk -F '/' '{print $5}')
echo "COPY your_schema_name.$table from 's3://myexport-bucket/new-manifest-file/$table.manifest' MANIFEST iam_role 'arn:aws:iam::1231231231:role/Access-S3' REGION 'ap-south-1' REMOVEQUOTES IGNOREHEADER 1 ESCAPE DATEFORMAT 'auto' TIMEFORMAT 'auto' GZIP DELIMITER '|' ACCEPTINVCHARS '?' COMPUPDATE FALSE STATUPDATE FALSE MAXERROR 0 BLANKSASNULL EMPTYASNULL  EXPLICIT_IDS"  > /root/group-tables/copy-sql/copy-$table.sql
done

Conclusion:

This script saved us a lot more time from exporting 5K tables one by one. Hope you find this helpful. Please feel free to leave a comment in case if you have any issues.

· aws, redshift, shellscript, automation

Advertisement


Loading Comments