The Data Guy

Backup RedShift Objects To GitHub With Lambda

Backup RedShift database objects like Schema, Tables, User grants, user groups to GitHub with AWS Lambda. We can use PyGithub library to commit the files from Lambda.

RedShift is built on top of PostgreSQL. But unfortunately, there are no native dump tools to backup the redshift objects. Of course, pg_dump is there, but it’ll not dump the structure with Column encode, Distribution style, etc. To solve this problem AWS have an amazing repo RedShift Utilities. We have multiple Admin views to generate the DDL for almost all the objects. Im coming from SQL DBA background and I read a feature from DBAtools that will support export SQL Server objects to Github. So I was thinking to implement such export utilitiy for RedShift as well. And thats why you are here to read about it.

Prepare the infra:

Lambda Layer for Github & psycopg2 Library:

PyGithub and psycopg2 are not nativly avaiable in AWS Lambda. So we are going to use a Lambda layer to support these libraries. You can simply create a new layer and upload this ZIP file. Or manually create the ZIP package by using this steps. I used Python 3.8 runtime.

Github Acces Token:

In the Python code, to make the authendication to Github we can use either username and password or access token. I used the access tokens. for this blog. You can follow this link to generate an access token.

Lambda function:

I recommend to encrypt it with KMS or use Parameter store. If you use KMS then from the lambda function, you may need to make some chnges on the Env variables section (under the #Env variables line). Increase the function timeout to 15mins.

Code flow

This is just a code walkthrough, not the complete code. Hit this link to get the complete code in ZIP file.

Get the available files from Github:

PyGithub library will create a new file or update an exising file. So we have make sure that if the file is not available already the use CREATE attribute else use UPDATE attribute. So get all the files from the repo.

# Get all the exsiting files
repo = g.get_user().get_repo(GITHUB_REPO)
all_files = []
contents = repo.get_contents("")
while contents:
    file_content = contents.pop(0)
    if file_content.type == "dir":
        contents.extend(repo.get_contents(file_content.path))
    else:
        file = file_content
        all_files.append(str(file).replace('ContentFile(path="','').replace('")',''))

Execute the Admin views one by one

Get the list of queries from the queries directory and then execute it on RedShift. Fetch the results and save them into a file.

query_list = ['gen_schema', 'gen_table', 'gen_usergroup', 'gen_adduser_group', 'gen_object_grants']
for query in query_list:

    # read SQL query
    with open ('queries/' + str(query) + '.sql') as file:
        sqlquery = file.read()
    cur.execute(sqlquery)
    result = cur.fetchall()
    # Process all the rows one by one
    rows = []
    for row in result:
        rows.append(''.join(row).encode().decode('utf-8'))
    
    # Write the results to a file
    with open(path + query + '.sql', 'w') as file:
        for item in rows:
            file.write("%s\n" % item)

Upload files to GitHub

Read the contents from the output files that are generated from the previous step and then upload it to Github. GITHUB_PREFIX is the variables used to define the directory inside the GitHub repo to upload the files. Before uploading them check the files are already available. If NO then create them else update them.

# Read the result file
with open(path + query + '.sql', 'r') as file:
    content = file.read()

# Upload to github
git_prefix = GITHUB_PREFIX
git_file = git_prefix + query + '.sql'
if git_file in all_files:
    contents = repo.get_contents(git_file)
    repo.update_file(contents.path, gitcomment, content, contents.sha, branch="master")
    print(git_file + ' UPDATED')
else:
    repo.create_file(git_file, gitcomment, content, branch="master")
    print(git_file + ' CREATED')

Conclusion:

DevOps is chaning the database culture as well. The intension of this blog to keep the RedShift schema or objects on Github and intergrate with your CI/CD pipeline. Consider this blog a kickstart doc and keep improvsing your data pipeline.

· aws, redshift, lambda, python, github

Advertisement


Loading Comments