Voiced by Amazon Polly |
Overview
In today’s AI-driven world, efficiently handling large volumes of data is key to unlocking valuable insights. One critical aspect of data processing, especially in Retrieval-Augmented Generation (RAG) models, is how data is chunked. Traditional methods, such as fixed-size or no chunking, may not always optimize retrieval performance. This blog introduces two advanced data chunking techniques, semantic chunking, and hierarchical chunking, and the option to apply custom chunking logic using AWS Lambda. These approaches, now available in Amazon Bedrock Knowledge Bases, aim to preserve contextual integrity and enhance retrieval efficiency.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Introduction
Data chunking isn’t simply about breaking data into smaller pieces; it’s about transforming it into a format that facilitates the tasks of language models and retrieval systems. The real question is not “How do I chunk my data?” but “How can I best organize my data so that it’s efficient for retrieval and task completion?” With this in mind, Amazon Bedrock Knowledge Bases introduces advanced chunking strategies like semantic chunking and hierarchical chunking. These techniques offer more refined approaches to partitioning data, enhancing the ability of Foundation Models (FM) to retrieve relevant and coherent information from a large corpus.
Advanced Data Chunking in Amazon Bedrock
Amazon Bedrock Knowledge Bases introduces two new chunking methods beyond traditional techniques: semantic chunking and hierarchical chunking. These new methods focus on preserving the context and relationships between different parts of the data, thereby improving the quality of results generated by RAG models.
Semantic Chunking
Semantic chunking focuses on breaking data into segments based on meaning and context. Instead of simply chopping data into equal parts, this method analyzes the relationships between sentences or paragraphs, creating chunks that preserve the integrity of the information. This is especially useful in cases where maintaining the semantic meaning is critical, such as in legal or technical documents.
For example, consider a technical manual that describes complex machinery operations. Semantic chunking ensures that instructions and descriptions related to specific functions stay together, making it easier for the model to retrieve and provide coherent responses.
To use semantic chunking in Amazon Bedrock:
- In the Knowledge Base creation process, choose the Advanced (customization) option under chunking and parsing configurations.
- Select Semantic chunking from the drop-down menu.
- Configure parameters such as:
- Max buffer size for grouping surrounding sentences: This defines how many neighboring sentences to include when evaluating semantic similarity. A buffer size 1 includes the current sentence, the one before, and the one after.
- Max token size for a chunk: The maximum number of tokens a chunk can contain, ranging from 20 to 8,192 tokens.
- Breakpoint threshold: This defines the similarity threshold for combining chunks, with a recommended value of 95%.
Hierarchical Chunking
Hierarchical chunking organizes data into a tree-like structure, breaking it into larger parent chunks and smaller child chunks. This structure enables efficient and granular information retrieval, making it easier for the model to retrieve relevant data based on its inherent relationships.
For instance, in an academic paper, hierarchical chunking can break the document into sections like introduction, methodology, and conclusion (parent chunks). In contrast, each section is further divided into sub-sections or paragraphs (child chunks). During retrieval, the model searches within child chunks and returns the parent chunk, ensuring granularity and comprehensive context.
To implement hierarchical chunking:
- Select Hierarchical chunking under the Advanced (customization) options during Knowledge Base creation.
- Configure the following:
- Max parent token size: The maximum number of tokens a parent chunk can contain (up to 8,192 tokens).
- Max child token size: The token limit for child chunks is usually around 300 tokens.
- Overlap tokens between chunks: A percentage defining the overlap between child chunks, typically set at 20%.
By maintaining the hierarchical relationship between parent and child chunks, this method ensures that context is preserved across different levels of granularity. This is ideal for complex, nested datasets like legal contracts or research papers.
Custom Processing with AWS Lambda
If the built-in chunking options aren’t sufficient for your use case, you can implement custom chunking logic using AWS Lambda functions. With AWS Lambda, you can go beyond chunking and apply custom logic for metadata processing or advanced data parsing.
To set up custom processing:
- Write an AWS Lambda function that defines your custom chunking logic or integrate a method from frameworks like LangChain or LLamaIndex.
- Create an AWS Lambda layer for the specific framework.
- Choose the appropriate Lambda function in the chunking and parsing configuration in the Knowledge Base creation workflow.
This level of customization allows you to tailor the chunking process to your specific requirements, adding another layer of flexibility to your RAG model.
Conclusion
Effective data chunking is critical for improving retrieval efficiency and accuracy in AI systems.
For even more control, custom chunking logic can be implemented using AWS Lambda, providing flexibility to adapt the chunking process to unique requirements.
Drop a query if you have any questions regarding Amazon Bedrock and we will get back to you quickly.
Making IT Networks Enterprise-ready – Cloud Management Services
- Accelerated cloud migration
- End-to-end view of the cloud environment
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront and many more.
To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.
FAQs
1. What is the difference between semantic chunking and hierarchical chunking?
ANS: – Semantic chunking focuses on dividing the data based on meaning and context, ensuring that related information stays together. On the other hand, hierarchical chunking organizes the data into a tree-like structure, with parent and child chunks, to maintain contextual relationships across different levels of the document.
2. When should I use custom chunking with AWS Lambda?
ANS: – Custom chunking with AWS Lambda should be used when the built-in chunking options don’t meet the specific needs of your application. For instance, if you have unique chunking requirements based on your data format or need to apply additional metadata processing, a custom AWS Lambda function can provide the flexibility to address these challenges.
WRITTEN BY Suresh Kumar Reddy
Yerraballi Suresh Kumar Reddy is working as a Research Associate - Data and AI/ML at CloudThat. He is a self-motivated and hard-working Cloud Data Science aspirant who is adept at using analytical tools for analyzing and extracting meaningful insights from data.
Click to Comment