The Data Guy

Use AWS Glue DataBrew To Remove Any Sensitive And PII Information

Advertisement

AWS Glue DataBrew is a serverless service that helps data engineers to cleanse and format their RAW data without writing any single line of code. It’s announced just before reinvent 2020. It has 250+ build-in transformations which reduce the overall efforts for the transformation. In this post, I’ll show you how can we use this DataBrew to remove any PII and other sensitive information from the dataset.

Disclaimer: In this demo, I have used the dummy data which is generated from the https://www.mockaroo.com/.

Glue vs Glue DataBrew:

Solution Overview:

Transformations:

DataBrew Tech terms:

  1. DataSet - A table. It can be a single file or a bunch of files in S3 or a table from the Glue Data catalog.
  2. Project - A place to build your transformation.
  3. Recipe - A transformation logic.
  4. Job - A job that will apply the recipe on top of a dataset.

Sample Data:

create table userdata (
	id INT,
	first_name VARCHAR(50),
	last_name VARCHAR(50),
	email VARCHAR(50),
	gender VARCHAR(50),
	ip_address VARCHAR(20),
	phone VARCHAR(50),
	credit_card VARCHAR(50),
	card_type VARCHAR(50),
	transaction_id VARCHAR(40)
);
-- loading the data from S3.
COPY userdata from 's3://mybucket/raw/MOCK_DATA.csv' iam_role 'something something' CSV;

Data Preparation:

Create dataset:

Go the Glue DataBrew console and select the dataset. Then create a new Dataset. We already added this table into Glue, so select on database name where the table has located and select the userdata table. During the process, the DataBrew needs to get the data out from RedShift and do its process, it needs an S3 bucket to stage the files. So select the S3 path for this.

Create project:

The project is the place where we build the transformation recipes. Create a new project and give a name. Under the recipe details select the Create new recipe.

Once its created you can see sample data.

Sample dataset

Mock the email address:

Select the email column and from the menu select clean -> Replace -> Replace value or pattern.

Mock email

From the value to be replaced, select the Regex option and use the following regex as ^[a-zA-Z0-9+_.-][email protected] Replace with value [email protected]

Mock email

Then click the apply button.

Mock the phone number column:

This is also a regex-based replace.

Keep last 4 digits in CARD number

The sample data has a different format for the card numbers. It’s all integer but the total number of characters in the column is not the same. So we’ll create a new column with the last 4 digits from the CARD Number column and delete the original card number column.

Delete the original CARD column:

We don’t this original column, so select the column. From the menu bar select column -> Delete.

Delete column

Rename the email and phone column:

To indicate the columns are mocked, we will rename the columns as mock_email and mock_phone. Select column -> click the navigation dots -> Rename -> mock_email.

Delete column

Repeat the same for the phone column.

Export the recipe:

From the recipe section, you can see all the transformations. If you want to use this recipe on a regular basis or use it for other projects, we need to publish it. Also, you can export it as YAML and JSON.

Export recipe

If you click on the lineage option, you can see the visual representation of this transformation.

lineage

Run the Job:

This job is saved automatically, so you can directly click on the Create Job button on the right top corner or navigate to Glue DataBrew -> Projects -> select the project -> Run Job.

It’ll ask you to select the target S3 path where the cleansed data will be uploaded. You can add multiple paths(same data will be copied) with multiple formats like CSV, JSON, Parquet and etc.

Run the Job

From the jobs tab, you can see the historical job execution.

Historical run

Job lineage

Concusion:

This is a small PoC that we can use DataBrew for removing sensitive information, but it has more options to play with the data like Join multiple tables, make pivot tables and etc.

Gotchas:

· glue, databrew, etl, security

Advertisement


Loading Comments