AWS Athena
Data analysis is a very complex process, and efforts have always been made to make it easy. There are many tools for analytics, and even the popular tech giant Amazon offers an AWS service called Amazon Athena. This Amazon Athena tutorial will guide you through the basic and advanced use of Amazon Athena.
Amazon Athena is an interactive data analysis tool used to process complex queries in a relatively short amount of time. It is serverless. Hence, there is no hassle to set up, and no infrastructure management is required. It is not a database service. Therefore, you pay for the queries you run. You just point your data into S3, define the necessary schema, and you are good to go with a standard SQL. Learn all about Amazon Web Services with AWS Training.
Introduction to Amazon Athena
On 20 November 2016, Amazon launched Athena as one of its services. As mentioned earlier, Amazon Athena is a serverless query service that analyzes data using standard SQL stored in Amazon S3. With a few clicks in the AWS Management Console, customers can point Amazon Athena at their data stored in Amazon S3 and run queries using standard SQL to retrieve results in seconds.
With Amazon Athena, there is no infrastructure to set up or manage, and the customer only pays for the queries they run. Amazon Athena scales automatically, executing queries in parallel, giving fast results even with large datasets and complex queries.
Difference Between Microsoft SQL Server And Amazon Athena
Features | Microsoft SQL Server | Amazon Athena |
---|---|---|
DEFINITION | Microsoft SQL Server is a database management and analysis system. | Amazon Athena is an interactive query service that makes data analysis easy. |
USAGE | Used for DML, DCL, DDL, and TCL operations on Database. | Used for DML operations on Database. |
BENEFITS |
|
|
INTEGRATION |
|
|
LIMITATIONS |
|
|
Creating Table in Athena
We use live resources, so you’re only charged for the queries you run, but not for the datasets you use, and if you want to upload your data files to Amazon S3, then charges apply.
To query S3 file data, you must have an external table associated with the file structure. We can create external tables in two ways:
- Manually.
- Using AWS Glue Crawler.
To create an external table manually, follow the correct structure CREATE DETAILS CREATE EXTERNAL TABLE and specify the correct format and exact location. An example is given below:
Creating an external table manually
The created external tables are stored in the AWS Glue Catalog. Glu Clover parses the input file structure and creates a metadata table defined in the Glu Data Catalog.
The Crawler uses the AWS IAM (Identity and Access Management) role to allow archived data and data catalogs. You must have permission to pass roles to the Crawler to access crawled Amazon S3 paths.
Go to AWS Glue, select “Add Table,” and select the option “Add Table Using Crawler”.
Add tables using a glue crawler.
Give the Crawler a name. Let’s say, for example, a car-crawler
Enter crawler name
Choose the path in Amazon S3, where the file is saved.
If you plan to query only one file, you can choose either the S3 file path or the S3 folder path to query all files in a folder with the same structure.
Enter crawler name
Choose the path in Amazon S3, where the file is saved.
If you plan to query only one file, you can choose either the S3 file path or the S3 folder path to query all files in a folder with the same structure.
as the path.
Create an IAM role with permissions to the S3 object whose target you want to query or select an existing IAM role (that has sufficient privileges to access the S3 object).
Choose a database that contains external tables and optionally choose a prefix to be added to the external table name.
Select Database and prefix for external tables
Click Finish to create the Glue Crawler
- Run crawler
The external table has been created under the specified Database. Now you can query the S3 object using this.
- Select data from external table
Since we put a file, the query “select * from json_files” returns one record in the file. Let’s try to put another file with the same structure in the same S3 folder and query the external table again.
If you query the same EXTERNAL table, you will see two rows returned instead of one.
When the same external table is queried, you will get two records. It is because there are two files in the S3 folder with the desired structure. You can perform many operations on the data. For example, the following Query will UNNEST the array in the result set.
Accessing Amazon Athena
Athena is very easy to reach, and one can either:
These are some of the ways to access Amazon Athena. By now, you know everything important about Amazon Athena, and let me tell you about the different features of Athena.
Features of Athena
Among the many services provided by Amazon, Athena is one of the services. It has several features that make it suitable for data analysis. Let’s take a look at the different features one by one.
- Easy Implementation: Athena does not require installation, and it can also be accessed by the AWS CLI directly from the AWS console.
- Serverless: It is serverless, so the end-user does not need to worry about infrastructure, configuration, scaling, or failure. Athena takes care of everything herself.
- Pay Per Query: Athena charges you only for the Query you run, i.e., the amount of data managed per Query. You can save a lot if you can compress them and format your dataset accordingly.
- Fast: Athena is a very fast analysis tool. It can execute complex queries quickly by breaking them down into simpler queries and running them in parallel, then combining the results to give the desired output.
- Secure: With the help of IAM policies and AWS identities, Athena gives you complete control over the data set. Since data is stored in S3 buckets, IAM policies can help you manage users’ controls.
- Highly Available: With the assurance of AWS, Athena is highly available, and the user can execute queries round the clock. As is AWS 99.999% available, so is Athena.
- Integration: The best feature of Athena is that it can be integrated with AWS Glue. AWS Glue will help the user to build a better-integrated data repository. It helps you create better versions of data, better tables, views, etc.
As of now, you will be impressed with AWS Athena. Now that you know a lot about Athena. Let’s roll up our sleeves and understand Athena’s work by doing a small demo.
Demo (Comparison between Amazon Athena and MySQL)
In this Amazon Athena tutorial, now we will compare MySQL and Athena and understand how even simple queries take less time to execute in Athena.
- Loading the CSV file to MySQL took around 1 hour, but in Athena, it took just 3 minutes to upload the CSV file to S3 and 0.42 seconds to create a table.
- Select Query. Select * from table.
- Select Query in Athena.
- Select Query in MySQL.
- Select a specific column from the table.
Select a specific column in Athena
- Select a specific column in MySQL.
- Getting the count of a specific column.
- Count of a specific column in Athena.
- Counting the number of records in the table.
Count all records in Athena
Select Query with a specified range.
Select Query within the said range in Athena.
Select Query within the said range in MySQL.
What are the limitations of Amazon Athena?
- Optimization is limited to queries. For example, data already stored in S3 cannot be optimized.
- No indexing options. Indexing options commonly appear in traditional databases. Without indexing, the operation load on Athena increases, potentially affecting performance.
- Efficient queries require partitioning. To enable efficient queries, data must first be partitioned, and partitions must then be managed for what best fits performance needs.
- Stored procedures, parameterized queries, and Presto federated connectors are not supported. Amazon Athena Federated Query is needed to connect data sources.
- When querying a table with thousands of partitions, Athena can time out.
- Source files that start with an underscore or a dot are treated as hidden.
- The row and column size cannot exceed 32 megabytes.
- Athena does not support querying data in S3 Glacier and S3 Glacier Deep Archive storage classes.
- Statements such as CREATE TABLE LIKE, DESCRIBE INPUT and DESCRIBE OUTPUT, EXECUTE ? USING, MERGE and UPDATE are not supported.
AWS Athena vs. AWS Glue
Since its initial release in August 2017, AWS Glu has been operating as a fully-managed Extract, Transform and Load (ETL) service. It comes with three primary components:
- A flexible scheduler to handle job monitoring
- An ETL engine capable of generating Scala or Python code
- A data catalog that serves as a central metadata repository
AWS Glue helps you find and transform data sets and prepare them for discovery and querying with these tools.
So, you should be able to use AWS Athena with AWS Glue. Subsequent data catalogs will create, store, and retrieve table metadata (or schemas) as queried by Athena.
What are the advantages and disadvantages of using AWS Athena?
AWS Athena, as it turned out, is a double-edged sword. The features that make it conveniently cheap and accessible are the ones that may limit you somewhat.
Pros of AWS Athena
- Serverless: Since it is delivered as a fully managed serverless service, AWS Athena saves you all the hassle that comes with managing infrastructure. You don’t have to worry about setting up clusters, regulating capacity, or loading data.
- Cost-effective: AWS Athena is not only cost-effective but also quite affordable as compared to its close competitors. It is because the service doesn’t charge you for count instances. And instead, you only pay for the queries you’re running.
- Widely Accessible: As a service that runs its queries using standard SQL, AWS Athena is widely accessible to anyone – not just developers and engineers. It can be adopted by business analysts and other data professionals, as standard SQL queries are very simple and straightforward.
- Flexibility: Amazon Athena’s open and versatile architecture doesn’t limit you to any specific vendor, technology, or device. For example, you can work with a wide range of open-source file formats and freely switch between query engines without adjusting schemas.
Cons of AWS Athena
- No Data Optimization: AWS Athena doesn’t offer a lot of customization capabilities. The farthest you can go here is to optimize the queries – not the underlying data. Even when you try to replace Amazon S3 data using AWS Glu, you still need to be careful not to harm other services that access the same data.
- Shared Resources: According to Amazon’s Service Level Agreement (SLA), all AWS Athena users worldwide share the same resource when running their queries. This multi-tenancy approach can trigger resource stress from time to time, leading to fluctuating query performance.
- Reduction in data manipulation operations: AWS Athena is just a query service, and you will find only one query engine here. It does not come with a built-in Data Manipulation Language (DML) interface for inserting, deleting, and updating data.
- Requires data partitioning: If you intend to run your SQL queries efficiently, you may want to partition the data sets stored in Amazon S3. The number of partitions you have to create will affect the speed and performance of your queries to a great extent. For example, every 500 partitions scanned will increase your query time by one second.
- Lack of indices: While indexing has always been a built-in provision in traditional databases, you do not get this privilege with AWS Athena. As such, you should expect challenges in operations such as consolidating large tables.
How is AWS Athena priced?
As we’ve already said, AWS Athena, follows a pricing schedule that charges you based on the queries you choose to run in your data analysis.
Amazon calculates the bytes and then rounds them to the nearest megabyte, 10MB being the minimum amount per Query.
You should expect to pay $5 for every terabyte (TB) of data you can afford. In the meantime, you will not be charged for failed queries, statements for managing partitions, as well as Data Definition Language (DDL) statements.
But that’s not all. Amazon further makes it possible for you to reduce per-query costs by as much as 30% to 90%. You just need to split, compress or convert your data to a columnar format.