RStoolKit Results: Ghost rows
RedShift is built on top of PostgreSQL database. In PostgreSQL you may familiar with the term Dead Tuples/Rows. If you delete any row or update any row that particular row is marked for delete. In Redshift update means delete the current row and insert the new row with the updated values. This old row is marked as delete. These rows are called as Ghost rows.
Find Ghost rows:
Use the following query to find the ghost rows for all the tables. You can refer this link to get more detailed information about the ghost rows. https://thedataguy.in/find-ghost-rows-redshift/
with cte as ( select table_id, status, eventtime from stl_vacuum where ( status = 'Started' or status like '%Started Delete Only%' or status like '%Finished%') ), result_set as ( select table_id , max(a.eventtime)as vacuum_timestamp from cte a where a.status like '%Finished%' group by table_id ) , raw_gh as( select query, tbl, perm_table_name , segment, sum(a.rows_pre_filter) as rows_pre_filter , sum(a.rows_pre_user_filter) as rows_pre_user_filter , sum(a.rows_pre_filter-a.rows_pre_user_filter)as ghrows from stl_scan a LEFT JOIN result_set c on a.tbl = c.table_id where a.starttime > coalesce(c.vacuum_timestamp, CURRENT_TIMESTAMP - INTERVAL '5 days') and perm_table_name not in ('Internal Worktable', 'S3') and is_rlf_scan = 'f' and (a.rows_pre_filter <> 0 and a.rows_pre_user_filter <> 0 ) group by segment, query, tbl, perm_table_name ), ran as( select *, dense_rank() over (partition by tbl order by query desc, segment desc) as rnk from raw_gh ) select b."schema", b."table", sum(rows_pre_filter) as total_rows, sum(rows_pre_user_filter) as valid_rows, sum(ghrows) as ghost_rows from ran a join pg_catalog.svv_table_info b on a.tbl = b.table_id where rnk = 1 group by "schema" , "table";
How to fix this problem:
Vacuum is the only way to remove the ghost rows from the tables. Anyhow RedShift also will run the vacuum process automatically when there is no load on the cluster.
- Run the vacuum FULL or Delete to remove the ghost rows.
VACUUM [ FULL | SORT ONLY | DELETE ONLY | REINDEX ] [ [ table_name ] [ TO threshold PERCENT ] [ BOOST ] ]
- Also run the analyze once the Vacuum is done.
ANALYZE [ VERBOSE ] [ [ table_name [ ( column_name [, ...] ) ] ] [ PREDICATE COLUMNS | ALL COLUMNS ]
- Vacuum options in RedShift
- Analyze in RedShift
- RedShift vacuum Utility - Python Based
- Automate Vacuum Analyze Utility - Shell based with more control