Backup RedShift Objects To GitHub With Lambda
Backup RedShift database objects like Schema, Tables, User grants, user groups to GitHub with AWS Lambda. We can use PyGithub library to commit the files from Lambda.
RedShift is built on top of PostgreSQL. But unfortunately, there are no native dump tools to backup the redshift objects. Of course,
pg_dump is there, but it’ll not dump the structure with Column encode, Distribution style, etc. To solve this problem AWS have an amazing repo RedShift Utilities. We have multiple Admin views to generate the DDL for almost all the objects. Im coming from SQL DBA background and I read a feature from DBAtools that will support export SQL Server objects to Github. So I was thinking to implement such export utilitiy for RedShift as well. And thats why you are here to read about it.
Prepare the infra:
- Lambda needs to be run on a private subnet. (to ensure the security), so those subnet’s IP range must be whitelisted on your RedShift’s security group.
- The Lambda subnets should have a NAT gateway to communicate with the GitHub.
- Lambda function’s security group should allow the RedShift’s Port number, HTTP, HTTPS ports in the outbound rules.
- Its better to use RedShift master user account.
- In this blog, Im using Environment variables but without encrypting any variables. I highly recommend to use KMS or Parameter store to encrypt the RedShift and Github credentials.
- Lambda basic execution and Lambda VPC execution policies are enough to the Lambda IAM role. If you need you can modify it.
- Create the following admin views in RedShift’s Admin schema.
Lambda Layer for Github & psycopg2 Library:
psycopg2 are not nativly avaiable in AWS Lambda. So we are going to use a Lambda layer to support these libraries. You can simply create a new layer and
upload this ZIP file. Or manually create the ZIP package by
using this steps. I used Python 3.8 runtime.
Github Acces Token:
In the Python code, to make the authendication to Github we can use either username and password or access token. I used the access tokens. for this blog. You can follow
this link to generate an access token.
- Once the Layer and the token is ready then create a new lambda function inside a VPC with Python 3.8 add the layer that you created in the previous step.
- Then from the Function code –> Actions select the
Upload a zip fileoption. Download the ZIP file here and upload it.
- It has many Environment variables. So create the following Env variables in Lambda.
REDSHIFT_DATABASE REDSHIFT_USER REDSHIFT_PASSWD REDSHIFT_PORT REDSHIFT_ENDPOINT GITHUB_ACCESS_TOKEN GITHUB_REPO GITHUB_PREFIX
I recommend to encrypt it with KMS or use Parameter store. If you use KMS then from the lambda function, you may need to make some chnges on the Env variables section (under the #Env variables line). Increase the function timeout to 15mins.
This is just a code walkthrough, not the complete code. Hit this link to get the complete code in ZIP file.
Get the available files from Github:
PyGithub library will create a new file or update an exising file. So we have make sure that if the file is not available already the use
CREATE attribute else use
UPDATE attribute. So get all the files from the repo.
# Get all the exsiting files repo = g.get_user().get_repo(GITHUB_REPO) all_files =  contents = repo.get_contents("") while contents: file_content = contents.pop(0) if file_content.type == "dir": contents.extend(repo.get_contents(file_content.path)) else: file = file_content all_files.append(str(file).replace('ContentFile(path="','').replace('")',''))
Execute the Admin views one by one
Get the list of queries from the
queries directory and then execute it on RedShift. Fetch the results and save them into a file.
query_list = ['gen_schema', 'gen_table', 'gen_usergroup', 'gen_adduser_group', 'gen_object_grants'] for query in query_list: # read SQL query with open ('queries/' + str(query) + '.sql') as file: sqlquery = file.read() cur.execute(sqlquery) result = cur.fetchall() # Process all the rows one by one rows =  for row in result: rows.append(''.join(row).encode().decode('utf-8')) # Write the results to a file with open(path + query + '.sql', 'w') as file: for item in rows: file.write("%s\n" % item)
Upload files to GitHub
Read the contents from the output files that are generated from the previous step and then upload it to Github.
GITHUB_PREFIX is the variables used to define the directory inside the GitHub repo to upload the files. Before uploading them check the files are already available. If NO then create them else update them.
# Read the result file with open(path + query + '.sql', 'r') as file: content = file.read() # Upload to github git_prefix = GITHUB_PREFIX git_file = git_prefix + query + '.sql' if git_file in all_files: contents = repo.get_contents(git_file) repo.update_file(contents.path, gitcomment, content, contents.sha, branch="master") print(git_file + ' UPDATED') else: repo.create_file(git_file, gitcomment, content, branch="master") print(git_file + ' CREATED')
DevOps is chaning the database culture as well. The intension of this blog to keep the RedShift schema or objects on Github and intergrate with your CI/CD pipeline. Consider this blog a kickstart doc and keep improvsing your data pipeline.