Data Pre-Processing using SageMaker Data Wrangler – Part 2

Introduction to SageMaker Data Wrangler

Nowadays, with the increment in the production of a vast variety of data from multiple resources inside the pipelines, the preprocessing steps to manage those amounts of data are also tough in the pipelines. So, to handle the preprocessing steps, Amazon SageMaker has a working functionality to preprocess the data which is known as SageMaker Data Wrangler. With the help of Data Wrangler, we can handle the vast amount of data in the pipeline itself, we just need to set up the flow of the preprocessing steps inside the Data Wrangler service.

Customized Cloud Solutions to Drive your Business Success

Cloud Migration
Devops
AIML & IoT

Know More

Implementing Data Wrangler Flow

Amazon SageMaker Data Wrangler flow, or a data flow, to create and modify a data preparation pipeline. The data flow connects the datasets, transformations, analyses, or steps, you create and can be used to define your pipeline. Each Data Wrangler flow has an Amazon EC2 instance associated with it.

Navigate to the Amazon SageMaker Studio console to create flow under SageMaker Data Wrangler

sm1

Now select the instance based on the preprocessing steps required in the pipeline.

sm2

After clicking save, the instance will be selected for the Data Wrangler Flow.

Data Flow UI

When we import the dataset, it will appear as the source in the Data Flow UI. Data Wrangler automatically infers the types of each column in our dataset and creates a new data frame named Data types. We can select this frame to update the inferred data types.

sm3

Each time we perform a transform step, we are creating a new data frame. When multiple transform steps (other than Join or Concatenate) are added to the same dataset, they are stacked.
Join, concatenate, and create standalone steps that contain the new joined or concatenated dataset. The following diagram shows a data flow with a join between two datasets, as well as two stacks of steps.

sm4

Adding Step in Data Flow

We can add the steps in the flow by clicking edit Data Types to change the structure of the data frame
We can also add the step of Add Transform to transform the columns which are present in the pipeline
We can also add the step of Add Analysis to analyze our data at any point in the data flow.
We can also join two datasets using the Joins functionality inside the flow.
Concatenation of two datasets to form a new dataset is also possible in the Data Flow step.

Deleting Step from Data Flow

We can delete an individual step for nodes in your data flow that have a single input.
We can’t delete individual steps for source, join, and concatenate nodes.
We can use the following procedure to delete a step in the Data Wrangler flow.

Choose the group of steps that has the step that we are deleting.
Choose the icon next to the step.
Choose Delete.

sm5

Conclusion

Amazon SageMaker Data Wrangler helps to preprocess the data within the pipeline. Earlier there was no such service that maintain the data integrity while preprocessing and provides the feature of transformation along with multiple different feature engineering steps like handling missing values, dealing with imbalanced data, along with handling outliers automatically in the pipeline itself. SageMaker studio provides the feature, and we can also use these features in different real-time MLOps projects as well for preprocessing stage and dumping the data into the Data Warehouse.

Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.

Cloud Training
Customized Training
Experiential Learning

About CloudThat

CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

Drop a query if you have any questions regarding SageMaker and I will get back to you quickly.

To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.

FAQs

1. How is code secured with Amazon SageMaker?

ANS: – Code is secure, encryptable ML volumes by Amazon SageMaker.

2. What safety measures are SageMaker packed with?

ANS: – It guarantees the encryption of all the artifacts in transit and at rest. For model artifacts data, encrypted Amazon S3 buckets are an option. Accessing Sagemaker Notebooks, training tasks, and endpoints using AWS Key Management Service (KMS). The API and Sagemaker console support SSL connections.