AWS, Cloud Computing

4 Mins Read

Extract Data from an Image Using AWS Textract

Voiced by Amazon Polly

Overview

Modern technology has solved this problem to a large extent and data can be extracted from structured forms without human touch. In other cases, however, data is received from a wide variety of unstructured documents without any rhyme or reason to the way the information is presented. Many businesses and government organizations extract data manually from scanned documents, such as PDFs, tables, and forms, which are slow, expensive, and prone to errors. Textract uses machine learning to handle any type of document in real-time, accurately extracting text, forms, and tables without any specification and code.

Customized Cloud Solutions to Drive your Business Success

  • Cloud Migration
  • Devops
  • AIML & IoT
Know More

About AWS Textract

Amazon Textract is a highly scalable machine learning (ML) service that automatically extracts text, handwriting, and data from documents like images, pdf, etc. It can also analyze a document such as related text, tables, key-value pairs, and selection elements. Use Amazon Textract to detect and extract text in your documents.

When the Amazon Textract operation processes the document, the results are returned in an array of Block objects or an array of Expense Document objects. Both objects contain information that has been found about items, including their location in the document and their relationship to other items in the document.

Use Cases

  • Import documents and forms into business applications
  • Creating smart search indexes
  • Creating automated workflows for document processing
  • Maintaining compliance in document archives
  • Text Extraction for Natural Language Processing (NLP)
  • Text extraction for document classification

Architecture Diagram

AD_textract

Steps to Setup AWS S3

Step 1: Open AWS S3 Console

Step 2: Click on Create Bucket. Enter the bucket name (i.e., data-extract-from-image) and select the region that you want to perform.

step2

Step 3: Click on Create Bucket.

step3

Steps to Setup Amazon Lambda

Step 1: Open Aws lambda console.

Step 2: Click on create function and enter the function name (i.e., textract-lambda). Then select the python 3.9 version.

lambda_step2

Step 3: Select a role that defines the permissions of your lambda function. Select a new role with a basic lambda function and click on Create function.

lambda_step3

Step 4: Inside the lambda function there is another option configuration. Go to configuration and click on permission. Then click on Role name.

lambda_step4

Step 5: Attach AmazonTextractFullAccess and AWSLambdaExecute policies to the lambda permission role.

lambda_step5

Step 6: Add S3 bucket as a trigger in lambda.

lambda_step6

Step 7: Add code in lambda. Inside the code, we are using detect_document_text boto3 API which detects text in the input document. Amazon Textract API detects and analyses text in documents and converts it into machine-readable text. After adding the code save it and click on the deploy button. (GitHub Link)

lambda_step7

Step 8: Upload one invoice image on the data-extract-from-image bucket.

lambda_step8

Step 9: Check CloudWatch log groups. Inside the log event, you can get all your image extracted data.

lambda_step9

Conclusion

In this blog, we learned about how to use AWS Textract API to extract data from an Image without any ML experience. This solution will drive decision-making efficiency and can be applied to any industry that has physical/scanned documents such as legal documents, purchase receipts, inventory reports, invoices, and purchase orders. We will discuss more use cases of AWS’s other services in our upcoming blogs.

Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.

  • Cloud Training
  • Customized Training
  • Experiential Learning
Read More

About CloudThat

CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

Drop a query if you have any questions regarding Textract and I will get back to you quickly.

To get started, go through our Consultancy page and Managed Services Package that is CloudThats offerings.

FAQs

1. What document formats does Amazon Textract support?

ANS: – Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats. With synchronous APIs, you can send images either as an S3 object or as a byte array. For the asynchronous API, you can send S3 objects. If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPG, PNG), do not convert or resample it before uploading it to Amazon Textract.

2. In which AWS regions are Amazon Textract available?

ANS: – Amazon Textract is currently available in US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), AWS GovCloud (US-West), AWS GovCloud (US-East), Regions Canada (Central), EU (Ireland), EU (London), EU (Frankfurt), EU (Paris), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Seoul), and Asia Pacific (Mumbai).

3. Are there any limits on the number of questions I can ask per document?

ANS: – Queries are processed on a per-page basis, and information can be extracted using queries through synchronous or asynchronous operations. A maximum of 15 queries per page is supported for synchronous operations. A maximum of 30 queries per page is supported for asynchronous operations.

WRITTEN BY Modi Shubham Rajeshbhai

Shubham Modi is working as a Research Associate - Data and AI/ML in CloudThat. He is a focused and very enthusiastic person, keen to learn new things in Data Science on the Cloud. He has worked on AWS, Azure, Machine Learning, and many more technologies.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!