Voiced by Amazon Polly |
Overview
In the era of data-driven decision-making, the ability to efficiently query and analyze vast amounts of data has become a critical asset for businesses. Amazon Athena, a serverless interactive query service provided by Amazon Web Services (AWS), empowers organizations to analyze data stored in Amazon S3 using standard SQL queries without complex infrastructure management. One of the key features that contribute to Amazon Athena’s exceptional performance is partitioning. In this blog, we’ll dive into Amazon Athena partitioning, exploring its benefits, implementation, and best practices for optimizing Data Analysis.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Introduction
Partitioning is a data organization technique that divides a large dataset into smaller, more manageable subsets or partitions based on specific criteria. These criteria typically align with the natural structure of the data, such as dates, categories, or geographical locations. Partitioning enhances query performance by reducing the amount of data that needs to be scanned when executing queries.
Use Cases
- Customer Segmentation: Analyzing data on specific attributes(e.g., region, time frame) to enable targeted analysis.
- Multi-Account Organization: Organization structure where each Account has data such as events or application logs. Partition data by Account ID to keep data isolated, secure, and easy to query.
Benefits of Amazon Athena Partitioning
- Improved Query Performance: Partitioning allows Amazon Athena to skip scanning irrelevant partitions when executing queries. This optimization significantly reduces query execution time and improves overall performance, enabling faster and more responsive data analysis.
- Cost Efficiency: By scanning only the relevant partitions, Amazon Athena reduces the data processed, leading to cost savings. As you pay based on the amount of data scanned, partitioning can help optimize your query costs.
- Simplified Data Management: Partitioning aligns data storage with the logical structure of the dataset. This organization simplifies data management, making it easier to locate, query, and maintain specific subsets of data.
- Enhanced Data Exploration: Partitioning enables data analysts and scientists to focus on specific dataset segments during exploratory analysis. This targeted approach streamlines the process of uncovering insights and trends within the data.
Implementing Partitioning in Amazon Athena
To harness the benefits of partitioning in Amazon Athena, follow these key steps:
- Data Preparation: Before partitioning, ensure that your data is appropriately structured. Partition keys should be carefully chosen based on the nature of your data and the type of queries you plan to run. For instance, partitioning by date can be a logical choice if your dataset contains sales data.
- Partitioning Columns: Define the columns acting as partition keys when creating your Amazon Athena table. These partitioning columns determine how your data is divided into partitions. Use the PARTITIONED BY clause when creating your table, specifying the partitioning columns.
- Directory Structure: In Amazon S3, each partition is represented by a separate subdirectory. The directory structure should mirror the partition keys, helping Amazon Athena efficiently navigate and scan the relevant data during query execution.
- Query Optimization: Leverage partition elimination in your SQL queries by specifying the partition key values in the WHERE This instructs Amazon Athena to scan only the relevant partitions, enhancing query performance.
Steps to set Partitioning in Amazon Athena
Before you begin creating tables, you should understand more about the requirements for partitioning tables. This can help you create the tables that you need.
AWS CloudTrail service records all AWS activities and delivers them to Amazon S3. Several events occur daily in AWS CloudTrail, so it’s difficult to find out or time consuming to query from all the events logs at the time of incident. So, the demonstration below will show you how you can set up partitioning for AWS CloudTrail logs. Similarly, this can be implemented to other logs in the stored Amazon S3 bucket with the structure described below in step 1.
Step 1: Access the AWS console, locate the Amazon S3 bucket, and identify the folder. Note the folder’s path, which can serve as a partitioning key.
Step 2: Create a Table using partitioning
- Modify the LOCATION clause within the CREATE TABLE statement.
Format:
1 |
LOCATION 's3://cloudtrail_bucket_name/AWSLogs/<Account_ID>/CloudTrail/' |
Example:
1 |
LOCATION ‘s3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail |
- In the PARTITIONED BY clause, add an entry for the timestamp and region as a string, as in the following example:
1 |
PARTITIONED BY (timestamp string, region string) |
- In the TBLProperties clause, define partition column types and Amazon S3 location with a variable that acts as dynamic input for forecasting.
1 2 3 4 5 6 7 8 9 |
TBLPROPERTIES ( 'projection.enabled'='true', 'projection.timestamp.format'='yyyy/MM/dd', 'projection.timestamp.interval'='1', 'projection.timestamp.interval.unit'='DAYS', 'projection.timestamp.range'='2020/01/01,NOW', 'projection.timestamp.type'='date', 'projection.region.type'='injected', 'storage.location.template'='s3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail/${region}/${timestamp}') |
Here is the Complete DDL statement(sample) with partitioning schema
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
CREATE EXTERNAL TABLE demo_cloudtrail_partition( eventVersion STRING, userIdentity STRUCT< type: STRING, principalId: STRING, arn: STRING, accountId: STRING, invokedBy: STRING, accessKeyId: STRING, userName: STRING, sessionContext: STRUCT< attributes: STRUCT< mfaAuthenticated: STRING, creationDate: STRING>, sessionIssuer: STRUCT< type: STRING, principalId: STRING, arn: STRING, accountId: STRING, userName: STRING>, ec2RoleDelivery:string, webIdFederationData:map<string,string> > >, eventTime STRING, eventSource STRING, eventName STRING, awsRegion STRING, sourceIpAddress STRING, userAgent STRING, errorCode STRING, errorMessage STRING, requestparameters STRING, responseelements STRING, additionaleventdata STRING, requestId STRING, eventId STRING, readOnly STRING, resources ARRAY<STRUCT< arn: STRING, accountId: STRING, type: STRING>>, eventType STRING, apiVersion STRING, recipientAccountId STRING, serviceEventDetails STRING, sharedEventID STRING, vpcendpointid STRING, tlsDetails struct< tlsVersion:string, cipherSuite:string, clientProvidedHostHeader:string> ) PARTITIONED BY ( `timestamp` string, `region` string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail' TBLPROPERTIES ( 'projection.enabled'='true', 'projection.timestamp.format'='yyyy/MM/dd', 'projection.timestamp.interval'='1', 'projection.timestamp.interval.unit'='DAYS', 'projection.timestamp.range'='2020/01/01,NOW', 'projection.timestamp.type'='date', 'projection.region.type'='injected', 'storage.location.template'='s3://aws-cloudtrail-logs-111111111111-2ca53272/AWSLogs/111111111111/CloudTrail/${region}/${timestamp}') |
- Run the Queries
To use partition, you are required to give additional parameters (region and timestamp) in where condition in query as shown below
Example query:
1 |
SELECT * FROM "<db name>"."<table name> " where region = '<aws-region>' and timestamp = '<yyyy/mm/dd>' limit 10; |
Conclusion
Amazon Athena’s partitioning feature is powerful for optimizing data analysis and query performance.
Whether analyzing sales trends, user behavior, or any other dataset, Amazon Athena partitioning empowers you to explore and extract value from your data quickly and precisely.
Drop a query if you have any questions regarding Amazon Athena and we will get back to you quickly.
Making IT Networks Enterprise-ready – Cloud Management Services
- Accelerated cloud migration
- End-to-end view of the cloud environment
About CloudThat
CloudThat is an official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, Amazon QuickSight Service Delivery Partner, AWS EKS Service Delivery Partner, and Microsoft Gold Partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
To get started, go through our Consultancy page and Managed Services Package, CloudThat’s offerings.
FAQs
1. Can I partition data stored in other AWS services, like Amazon Redshift or Amazon RDS, with Amazon Athena?
ANS: – No, Amazon Athena partitions data stored in Amazon S3. If you have data in other AWS services, you may need to transfer or copy it to Amazon S3 before partitioning it with Amazon Athena.
2. What column should I choose for partitioning?
ANS: – The choice of partitioning column depends on your stored data structure and query patterns. Typically, it’s best to choose a column that is frequently used in WHERE clauses of your queries.
3. Can I change the partitioning column later without affecting data?
ANS: – Yes, you can change the partitioning column. It may involve recreating your table.
WRITTEN BY Kashyap Nitinbhai Shani
Kashyap Nitinbhai Shani is a Research Associate at CloudThat. He is interested to learn advanced technologies and gain insights into new and upcoming cloud services. He likes writing tech blogs and learning new languages.
Click to Comment