Customizing Amazon Textract for Smarter Document Analysis

Introduction

Amazon Textract is a highly capable service designed to use machine learning to extract text, forms, and tables from scanned documents. For many organizations, off-the-shelf solutions don’t meet all their document processing requirements. This is where Custom Queries come in. You can modify the model’s behavior with the help of Amazon Textract adapters to adapt it to business needs, improve text extraction accuracy, and make document analysis much better.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Importance of Custom Queries

In today’s data-driven environment, businesses grapple with high volumes of unstructured or semi-structured documents such as invoices, contracts, and receipts that can vary widely in format. Manual data extraction from these documents often becomes time-consuming and error-prone, increasing costs and compliance risks.

To address this challenge, Amazon Textract introduces Custom Queries, enabling organizations to train the model on specific, predefined questions (e.g., “What is the total amount due?” or “What is the invoice number?”) and focus extraction efforts on the most crucial details.

By guiding the model to search for targeted information rather than capturing every bit of text, Textract speeds up document processing and boosts accuracy. This efficient approach frees up resources, reduces human oversight, and provides a scalable solution for handling diverse document types, making it essential for any organization to streamline data extraction and improve operational efficiency.

Creating an Adapter

The first step in working with Custom Queries is to create an adapter. This custom component tells Amazon Textract how to process your documents based on the unique questions you want to ask.

Sign in to the AWS Management Console and navigate to the Amazon Textract console.

create

2. Under the “Custom Queries” section, select “Create Adapter.”

create2

3. Give details, such as the name of the adapter and configuration options, including whether updates are automatic (ensuring the adapter remains up-to-date with any pre-trained features).

4. After this, the created adapter is viewed with information, including the adapter name and Adapter ID, as an indication that the adapter was created.

Dataset Creation

Following the creation of your adapter, you must create a training dataset for it. This implies uploading many images or documents, which will be used to train your adapter to detect the kinds of information you are interested in.

Choose your adapter and click on “Create Dataset.”

dataset

2. You can manually split the dataset into training and testing subsets or use the Autosplit option, which automatically divides the documents.

dataset2

3. Upload files either from your local machine or an Amazon S3 bucket. For this tutorial, a local machine will do, but uploading your files directly to Amazon S3 is always preferred, especially when working with larger datasets.

4. In the Test dataset details section, you can select Upload documents from your computer or Import documents from Amazon S3 bucket. For this tutorial, select Upload documents from your computer.

5. Choose Create dataset.

6. After creating the dataset, you will be taken to a Dataset details page. The dataset details page shows you a list of all the documents in your entire dataset and which part of the dataset (train or test) your document has been assigned to. View this under the Dataset assigned tothe column. You can also view the following:

1. Document name
2. Document status
3. Number of pages in the document
4. Document type
5. Document size
6. Whether the document is part of the training set or the testing set

7. Select Add documents to Datasetand add at least five documents to your training and testing datasets. If you previously selected Autosplit, you can add all the documents simultaneously.

Once the documents are uploaded, Amazon Textract organizes them into a training set (used to teach the model) and a testing set (used to evaluate performance).

Annotating Documents

With the dataset in place, the next critical step is annotation. This step involves linking specific queries to portions of the document where the answer can be found. For example, if you’re looking for the “Invoice Amount,” you need to mark the specific text containing this information.

In the console, select Create Queries and define the questions you want the model to answer (e.g., “What is the invoice date?”).

Annotate

Annotate1b

2. You can choose between text-based responses or binary responses (Yes/No questions).

Annotate2

3. Auto-labeling is available to assign labels to documents automatically. This is highly recommended for first-time users.

Annotate3

4. After auto-labeling, you can verify the labels’ accuracy and make necessary corrections.

Annotate4

This ensures that the model learns to associate the right query with the correct part of the document.

Training the Adapter

Once the documents are annotated, it’s time to train your adapter. This process involves using the labeled training dataset to teach the model how to recognize the answers to your specific queries.

In the console, navigate to your adapter’s details page and select Train Adapter.

train

2. The training process can take some time, depending on the size of your dataset. Once training is complete, you can evaluate the model’s performance by running it on the test dataset.

Evaluating and Improving the Adapter

After training the model, you need to assess its accuracy. Amazon Textract provides detailed performance metrics to help you understand how well your adapter is performing. If the results are unsatisfactory, you can retrain the adapter with additional annotations or more data to improve its accuracy.

evaluate

Using the Adapter for Document Analysis

Once satisfied with the model’s performance, you can use the adapter to process new documents. This involves submitting documents to the AnalyzeDocument API, which uses your custom-trained adapter to answer the predefined queries.

Conclusion

Custom Queries in Amazon Textract significantly enhance the accuracy and efficiency of document processing by allowing you to tailor the model to your specific needs. Whether you’re extracting information from invoices, contracts, or forms, the ability to ask custom questions ensures that your business can automate document handling with precision. Amazon Textract provides a comprehensive solution for businesses aiming to streamline their document workflows by creating and training custom adapters, annotating datasets, and refining model performance.

By following this blog, you can harness the full potential of Amazon Textract’s Custom Queries, making document processing smarter and more adaptable to your unique requirements.

Drop a query if you have any questions regarding Amazon Textract and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront, Amazon OpenSearch, AWS DMS and many more.

FAQs

1. What are Custom Queries in Amazon Textract?

ANS: – Custom Queries allow you to tailor Amazon Textract to answer specific questions about documents, such as “What is the total amount due?” or “What is the invoice number?” for more precise text extraction.

2. How do I create a custom adapter for queries in Amazon Textract?

ANS: – To create a custom adapter, sign in to the AWS Management Console, navigate to Amazon Textract, and select “Create Adapter” to set up the configuration and begin customizing your document processing.