Automate RedShift Vacuum And Analyze with Script
Vacuum and Analyze process in AWS Redshift is a pain point to everyone, most of us trying to automate with their favorite scripting languge. AWS RedShift is an enterprise data warehouse solution to handle petabyte-scale data for you. AWS also improving its quality by adding a lot more features like Concurrency scaling, Spectrum, Auto WLM, etc. But for a DBA or a RedShift admin its always a headache to vacuum the cluster and do analyze to update the statistics. Since its build on top of the PostgreSQL database. But RedShift will do the Full vacuum without locking the tables. And they can trigger the auto vacuum at any time whenever the cluster load is less. But for a busy Cluster where everyday 200GB+ data will be added and modified some decent amount of data will not get benefit from the native auto vacuum feature. You know your workload, so you have to set a scheduled vacuum for your cluster and even we had such a situation where we need to build some more handy utility for my workload.
Vacuum Analyze Utility:
We all know that AWS has an awesome repository for community contributed utilities. We can see a utility for Vacuum as well. But due to some errors and python related dependencies (also this one module is referring modules from other utilities as well). So we wanted to have a utility with the flexibility that we are looking for. And that’s why you are here. We developed(replicated) a shell-based vacuum analyze utility which almost converted all the features from the existing utility also some additional features like DRY RUN and etc. Lets see how it works.
You can get the script from my github repo.
To trigger the vacuum you need to provide three mandatory things.
- RedShift Endpoint
- User Name
- Database Name
This utility will not support cross database vacuum, it’s the PostgreSQL limitation. There are some other parameters that will get generated automatically if you didn’t pass them as an argument. Please refer to the below table.
|-u||User name (super admin user)|
|-P||password for the redshift user||use pgpass file|
|-s||Schema name to vacuum/analyze, for multiple schemas then use comma (eg: ‘schema1,schema2’)||ALL|
|-t||Table name to vacuum/analyze, for multiple tables then use comma (eg: ‘table1,table2’)||ALL|
|-b||Blacklisted tables, these tables will be ignored from the vacuum/analyze||Nothing|
|-k||Blacklisted schemas, these schemas will be ignored from the vacuum/analyze||Nothing|
|-w||WLM slot count to allocate limited memory||1|
|-q||querygroup for the vacuum/analyze, Default=default (for now I didn’t use this in script)||default|
|-a||Perform analyze or not [Binary value, if 1 then Perform 0 means don’t Perform]||1|
|-v||Perform vacuum or not [Binary value, if 1 then Perform 0 means don’t Perform]||1|
|-o||vacuum options [FULL, SORT ONLY, DELETE ONLY, REINDEX ]||SORT ONLY|
|-c||vacuum threshold percentage||80|
|-x||Filter the tables based on unsorted rows from svv_table_info||10|
|-f||Filter the tables based on stats_off from svv_table_info||10|
|-z||DRY RUN - just print the vacuum and analyze queries on the screen [1 Yes, 0 No]||0|
For this, you just need
psql client only, no need to install any other tools/software.
Run vacuum and Analyze on all the tables.
Run vacuum and Analyze on the schema sc1, sc2.
Run vacuum FULL on all the tables in all the schema except the schema sc1. But don’t want Analyze
Run Analyze only on all the tables except the tables tb1,tbl3.
Use a password on the command line.
Run vacuum and analyze on the tables where unsorted rows are greater than 10%.
Run the Analyze on all the tables in schema sc1 where stats_off is greater than 5.
Run the vacuum only on the table tbl1 which is in the schema sc1 with the Vacuum threshold 90%.
Run analyze only the schema sc1 but set the analyze_threshold_percent=0.01
Do a dry run (generate SQL queries) for analyze all the tables on the schema sc2.
Do a dry run (generate SQL queries) for both vacuum and analyze for the table tbl3 on all the schema.
Schedule different vacuum options based on the day
We’ll not full the Vacuum full on daily basis, so If you want to run vacumm only on Sunday and do vacuum
SORT ONLY on the other day’s without creating a new cron job you can handle this from the script.
Just remove this piece of the code.
And add this lines.
For vacumm and Analyze:
For Dry Run:
If you found any issues or looking for a feature please feel free to open an issue on the github page, also if you want to contribute for this utility please comment below.