Pivot table in AWS Athena (or Presto)

Steve Ng
Analytics Vidhya
Published in
4 min readMar 28, 2020

--

Background

Pivot table is one of the most commonly used features in Excel for reporting purposes. The user does not need a programming background to perform data analysis.

+---------------+-------+---------------+--------------+
| Item Category | Price | Region | Sales Person |
+---------------+-------+---------------+--------------+
| Furniture | 10 | North America | Jane |
| Technology | 5 | Asia Pacific | Jane |
| Furniture | 12 | North America | Tom |
| Technology | 8 | North America | Tom |
+---------------+-------+---------------+--------------+

Assume a sales report as above (heavily simplified for demo purpose) on the orders sold by all the salespeople across all regions. Pivot table enables multi-dimensional questions.

Example: Total sales per region per category

In order to answer this query, one might create a pivot table on Excel setting the below configuration

  • Row = Item Category
  • Column = Region
  • Value = sum (price)

And the output as shown below will answer the question.

+---------------+--------------+---------------+
| Item Category | Asia Pacific | North America |
+---------------+--------------+---------------+
| Furniture | | 22 |
| Technology | 5 | 8 |
+---------------+--------------+---------------+

Say your company is doing very well, and there were 10 million orders for last month. Now imagine your analyst running pivot table on an excel with 10 million rows. Hopefully, he or she will still be sticking around for next month’s analysis.

In this post, we will describe how we can achieve this on Athena — a highly scalable manner to query huge amounts of data.

Amazon Athena

Athena user interface. On the left is the list of tables. On the right is to execute queries and view the result

Athena is a managed query service provided by AWS. The query engine is based on Presto. Athena supports most operator from presto and is a popular choice to query data in s3. With the recent federated query announcement, Athena can also query other data sources such as Postgres. Truly splendid.

Walking through an example

Example: Total sales per region per category

In this scenario, the user will select from excel on Pivot table at

  • row = item category
  • Colum = regions
  • value = sum (price)

Query to achieve the above through Athena. Assuming the table is called order

The query above will produce the below output. Exactly the same output as the earlier example if we used the pivot table from Excel.

+---------------+--------------+---------------+
| Item Category | Asia Pacific | North America |
+---------------+--------------+---------------+
| Furniture | | 22.0 |
| Technology | 5.0 | 8.0 |
+---------------+--------------+----------------

Explanation

Firstly on the inner query. Example output of the inner query as shown below.

+---------------+---------------------------------------+
| Item Category | kv1 |
+---------------+---------------------------------------+
| Furniture | {North America=[10, 12]} |
| Technology | {Asia Pacific=[5], North America=[8]} |
+---------------+---------------------------------------+

The column kv1 is populated mainly with the function multimap_agg multimap_agg(“region”, amount) kv1.

As per the docs

multimap_agg(key, value) → map<K,array<V>>Returns a multimap created from the input key / value pairs. Each key can be associated with multiple values.

Notice column kv1 is a map with key = region and values = all the price in the region. Based on the above, it will indicate that

  • For the Furniture category, there were 2 orders made in North America at $10 and $12 respectively
  • For the Technology category, there was 1 order from Asia Pacific of $5 and 1 order from North America at $8

After the inner query, the last step would be to sum the value. Presto provides a variety of array helper functions. In order to sum up the value, we will use the reduce operator.

reduce(array(T), initialState S, inputFunction(S, T, S), outputFunction(S, R)) → RReturns a single value reduced from array. inputFunction will be invoked for each element in array in order. In addition to taking the element, inputFunction takes the current state, initially initialState, and returns the new state. outputFunction will be invoked to turn the final state into the result value. It may be the identity function (i -> i).

reduce(kv1[‘Asia Pacific’], 0.0, (s, x) -> s + x, s-> s) AS “Asia Pacific”

What this line does is

  1. look at the row value and pick up the value of Asia Pacific with kv1[‘Asia Pacific’]
  2. Initialize reduce to sum up the value
  3. Create a column as “Asia Pacific”

Conclusion

I hope the above serves as a quick guide to Pivot table on Athena.

If you notice, we have to identify the regions: Asia Pacific and North America in advance. Unfortunately, I can’t find a way to dynamically generate the list. Please let me know if you have a suggestion.

--

--

Steve Ng
Analytics Vidhya

simply curious about new technology on the block