Find Ghost Rows/Dead Rows For All The Tables In RedShift
Ghost rows or Dead rows in RedShift is a Red flag for the cluster’s performance. In this article we are going to see how to find the ghost rows or dead rows in RedShift. RedShift is built on top of the PostgreSQL database. To support MVCC PostgreSQL will handle the delete and the updates in a different way. Delete will not remove the data from the disk/block. Instead, the row will be marked as deleted. It’s a soft delete. For Updates its a combination of Delete + Insert. The existing row will be marked as deleted and a new row will be inserted with the updated values. In Redshift, we have 3 Pseudo columns.
I hope the XID for the delete operation for a particular row will be added to this column. It’s internal to redshift, so I can’t give 100% assurance on this statement.
The rows marked as deleted are called Dead Rows(Dead Tuples) in PostgreSQL and in RedShift, we call it as Ghost Rows. These ghost rows will be removed while running the Vacuum. In RedShift, these rows will be removed by,
Vacuum Delete Only
Vacuum short will not remove the Ghost rows. Now how do we identify the Ghost rows in RedShift? Unfortunately, I try to find any system table or view, but nothing is available for now.
From the stl_vacuum_detail there is a column num_deleted_rows will tell you how many dead rows are removed by the vacuum process. It includes both Auto vacuum and manual vacuum details. But it’s only useful after the vacuum.
But STL_SCAN, you have two important columns.
rows_pre_filter - Number of rows read from the block without filtering the ghost rows.
rows_pre_user_filter - Number of rows after eliminating the ghost rows.
It makes sense right? The difference between the rows_pre_filter and rows_pre_user_filter will give you the ghost rows count. I have prepared a query to get the ghost rows info from this table.
The logic behind the query:
Get the recent vacuum timestamp and the table name for the tables from STL_VACUMM.
STL_SCAN has a lot of queries, but we need to pick the query that ran after the vacuum for a table.
Even if you have more queries ran on top of a table after the vacuum, then pick the latest one.