Voiced by Amazon Polly |
Overview
Apache Spark, a famous open-source distributed processing solution intended for rapid analytics workloads against data of any scale, is now supported by Amazon Athena.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Introduction
Amazon Athena automatically scales by performing queries in parallel, resulting in rapid returns even with big datasets and sophisticated queries.
Using Apache Spark with Athena
Amazon Athena enables interactive data analytics and exploration with Apache Spark without the requirement for resource planning, configuration, or management. Running Apache Spark apps on Athena entails sending Spark code for processing and receiving results without further settings. You may leverage the streamlined notebook experience in the Amazon Athena interface to construct Apache Spark applications using Python or Athena notebook APIs. Apache Spark on Amazon Athena is serverless and offers dynamic, on-demand scalability to suit changing data volumes and processing requirements.
AWS Regions where Amazon Athena for Apace Spark is available now –
- Asia Pacific (Tokyo)
- Europe (Ireland)
- US East (N. Virginia)
- US East (Ohio)
- US West (Oregon)
Here is the list of some preinstalled python libraries that can be directly leveraged –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
boto3==1.24.31 botocore==1.27.31 certifi==2022.6.15 charset-normalizer==2.1.0 cycler==0.11.0 cython==0.29.30 docutils==0.19 fonttools==4.34.4 idna==3.3 jmespath==1.0.1 joblib==1.1.0 kiwisolver==1.4.4 matplotlib==3.5.2 mpmath==1.2.1 numpy==1.23.1 packaging==21.3 pandas==1.4.3 patsy==0.5.2 pillow==9.2.0 plotly==5.9.0 pmdarima==1.8.5 pyathena==2.9.6 pyparsing==3.0.9 python-dateutil==2.8.2 pytz==2022.1 requests==2.28.1 s3transfer==0.6.0 scikit-learn==1.1.1 scipy==1.8.1 seaborn==0.11.2 six==1.16.0 statsmodels==0.13.2 sympy==1.10.1 tenacity==8.0.1 threadpoolctl==3.1.0 urllib3==1.26.10 pyarrow==9.0.0 |
Setting up Apache Spark on Amazon Athena
To begin using Apache Spark on Amazon Athena, you must first set up a Spark enabled workgroup. After switching to the workgroup, you can start a new notebook or open an existing one. When you open a notebook in Athena, a new session is instantly launched, and you may work straight in the Athena notebook editor.
Steps to create a Spark enabled workgroup in Athena
- Head to the Athena console https://console.aws.amazon.com/athena/
- In the navigation pane, choose Workgroups, click the create workgroup button, and enter any workgroup name of your own.
- For Analytics Engine, choose Apache Spark.
- To use the example notebook for the sake of this tutorial, click Turn on the example notebook. This optional feature adds an example notebook to your workgroup with the name example-notebook-random string and AWS Glue-related permissions that the notebook can use to create, display, and delete databases and tables in your account as well as read permissions in Amazon S3 for the example dataset.
Switching workgroups and opening notebook explorer
- Select the button next to the Spark enabled workgroup you just created on the Workgroups page of the Athena interface.
- Choose Actions -> Switch workgroup. (You will be notified by the console that you have changed to the new workgroup.)
- Choose Notebook explorer from the console navigation pane.
The Notebook explorer can be used in multiple ways –
- A notebook can be opened in a new session by selecting its connected name.
- Use the Action menu to rename, delete or export the notebook.
- Choose Import file to import the notebook.
- Click on Create Notebook to create a new notebook.
Running the example notebook
A dataset of New York City taxi trips is queried in the example notebook.
To run the example notebook
- From Notebook explorer, select the linked name of the example notebook. This opens the notebook in the notebook editor and initiates a notebook session with the default settings. You are notified that a new Apache Spark session has been launched using the default settings (20 maximum DPUs).
- Select the Run button once for each cell in the notebook to run the cells sequentially and view the results.
- Scroll down to each cell to see the results and bring new cells into view.
- A progress bar will be visible for the cells that include calculations, which display the percentage completion, elapsed time, and remaining time in completion.
Terminating a Session
Choose the session menu from the notebook editor and click on Terminate. A Confirm Session Termination prompt will pop up to confirm. Choose Confirm, and you will return to the notebook editor. Your notebook will be saved as well.
Creating your notebook
- From the navigation console, choose Notebook explorer or Notebook editor.
- Do as per the previous step –
- In Notebook explorer -> Create Notebook.
- In Notebook editor -> Create Notebook or click the (+) button to add a notebook.
3. Enter a name for the notebook in the Create notebook dialogue box.
4. Click on expand Session parameters to fill in values for optional parameters.
5. Click Create.
Supported data and storage formats
The natively supported formats are shown in the following table. See Data Sources in the Apache Spark manual for further details on Spark data sources.
Monitoring Apache Spark calculations with CloudWatch metrics
When the Publish CloudWatch metrics option for your Spark-enabled workgroup is chosen, Athena posts metrics related to calculations to Amazon CloudWatch. In the CloudWatch console, you can build personalized dashboards and configure alarms and triggers for metrics.
Athena publishes the following metric to the CloudWatch console under the AmazonAthenaForApacheSpark namespace:
- DPCount – number of DPUs used to calculate during the session.
The DPCount metric has the following dimension –
- SessionId – It is the ID of the session where calculations are submitted.
- Workgroup – Name of the workgroup.
For the Amazon CloudWatch console to display metrics for Spark-enabled workgroups –
- Head to the CloudWatch console at https://console.aws.amazon.com/cloudwatch/
- Choose Metrices -> All Metrices from the navigation pane and select AmazonAthenaForApacheSpark namespace from the list.
To view the metrics using CLI –
>> aws cloudwatch list-metrics –namespace “AmazonAthenaForApacheSpark”
Conclusion
Apache Spark data analytics and exploration can be conducted interactively with Amazon Athena without the need to prepare, set up, or manage resources. Running Apache Spark applications on Athena without further configuration entails sending Spark code for processing and receiving the results immediately.
Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.
- Cloud Training
- Customized Training
- Experiential Learning
About CloudThat
CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Drop a query if you have any questions regarding Amazon Athena and I will get back to you quickly.
To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.
FAQs
1. What is DPUCount?
ANS: – A DPU is a metric for processing power that includes 16 GB of memory and 4 virtual CPUs with compute capacity.
2. How will the session be managed if you need to work on multiple projects simultaneously?
ANS: – You can make a session specifically for each project you need to work on at once, and the sessions will be independent.
3. What are magic commands, and how to use them?
ANS: – In a notebook cell, you can execute magic commands known as magics. For instance, %env displays the environment variables in a notebook session. A percent symbol (%) indicates that a magic function or a line of magic is present. The term “cell magic functions” or “cell magics” refers to spells that are written on many lines and are followed by a double percent sign (%%).
WRITTEN BY Sahil Kumar
Sahil Kumar works as a Subject Matter Expert - Data and AI/ML at CloudThat. He is a certified Google Cloud Professional Data Engineer. He has a great enthusiasm for cloud computing and a strong desire to learn new technologies continuously.
Click to Comment