SQL for Data Analytics III

6 minute read

Notes on Postgresql for Data Analysis – III

Subqueries

Subqueries, also known as inner queries and/or nested queries are tools that allows us to answer more complex questions from what we normally can from a single database table. A subquery allows us to query an existing table which itself was created from a prior/inital query to the db. Let’s look at the following example:

  • This is the inner query:
    SELECT DATE_TRUNC('day', occurred_at) AS day,
       channel,
       COUNT(*) AS event_count
    FROM web_events
    GROUP BY 1, 2
    ORDER BY 1
    LIMIT 5;
    

    and it will return the following output:

Subquery_1

  • We can now create a subquery to get more insights from the data we retrieved from the inner query. We do it by wrapping the inner query in parenthesis and type the statements for the subquery at the the top of the inner query:
    SELECT channel,
       AVG(event_count) AS avg_event_count
    FROM (SELECT DATE_TRUNC('day', occurred_at) AS day,
         channel,
         COUNT(*) AS event_count
         FROM web_events
         GROUP BY 1, 2) AS sub
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 5;
    

    Here we see that in the subquery we are selecting the channel and also, we are aggregating on the AVG of the event_count columnn that was produced by the inner query. Keep in mind that the inner query always runs first and creates the initial output table. The we use the FROM on the inner query because we want to retrieve data from the inner query. Notice that after the closing parenthesis of the inner query we alias the subquery AS sub and we can also pass other arguments such GROUP BY, ORDER BY and LIMIT to our subquery. It is best practice to write queries in a readable manner. Although the intended output would be the same– do not write a query in a single line. This makes the query difficult to read. Indentation works best at making the query more readable as seen above.

Sub-query-2

In the query above we created a table by running an inner query and the queried that table using FROM in the subquery. This worked because we returned a table but if the inner query returns a single value then we we must use that value in a logical statement such as WHERE, HAVING, CASE and even a SELECT statement. As an example, let’s say that we want to look at all the orders from the month the that the first ever order was placed. We would use a inner query to return the date/month for the first order and then use a subquery to query the other conditions.

SELECT DATE_TRUNC('month', MIN(occurred_at))
       FROM orders;

This will return the date we are after:

Subquery-3

Now we can create a subquery to get all the data about the orders that were placed in the same month as the first order:

SELECT *
FROM orders
WHERE DATE_TRUNC('month', occurred_at) =
      (SELECT DATE_TRUNC('month', MIN(occurred_at)) AS min_month
      FROM orders)
ORDER BY occurred_at;

Subquer-4

Notice that we did not alias the subquery below. This is because when the inner query returns a single value and not a table then the subquery does not need to be aliased.

Examples:

  1. Return the name of the sales rep in each region that has the largest amount in total_amt_usd. This seems faily easy but it is in fact a complex query. The results that we need are spread across multiple tables so first, we would need to combine all the tables into a single table and sum the total amount of sales.
    SELECT sales_reps.name AS sales_rep,
       region.name AS region_name,
       SUM(orders.total_amt_usd) AS total_amt
    FROM sales_reps
    JOIN accounts
    ON sales_reps.id = accounts.sales_rep_id
    JOIN orders
    ON orders.account_id = accounts.id
    JOIN region
    ON region.id = sales_reps.region_id
    GROUP BY 1, 2
    ORDER BY 3;
    

Once we run this query, we get the following output:

Subquery-5

The next step would be to get the highest /MAX total amount for each region. We can do this by adding a subquery:

SELECT region_name, MAX(total_amt) AS max_total_amt,
FROM (SELECT sales_reps.name AS sales_rep,
             region.name AS region_name,
             SUM(orders.total_amt_usd) AS total_amt
       FROM sales_reps
       JOIN accounts
       ON sales_reps.id = accounts.sales_rep_id
       JOIN orders
       ON orders.account_id = accounts.id
       JOIN region
       ON region.id = sales_reps.region_id
       GROUP BY 1, 2) AS table1
GROUP BY 1;

Now we get:

Subquery-6

Last, we can create another subquery and join the two tables above by matching the region and the amount:

SELECT table3.sales_rep, 
       table3.region_name,
       table3.total_amt
FROM (SELECT region_name, MAX(total_amt) AS max_total_amt
FROM (SELECT sales_reps.name AS sales_rep,
             region.name AS region_name,
             SUM(orders.total_amt_usd) AS total_amt
       FROM sales_reps
       JOIN accounts
       ON sales_reps.id = accounts.sales_rep_id
       JOIN orders
       ON orders.account_id = accounts.id
       JOIN region
       ON region.id = sales_reps.region_id
       GROUP BY 1, 2) AS table1
GROUP BY 1) AS table2
JOIN (SELECT sales_reps.name AS sales_rep,
             region.name AS region_name,
             SUM(orders.total_amt_usd) AS total_amt
      FROM sales_reps
      JOIN accounts
      ON sales_reps.id = accounts.sales_rep_id
      JOIN orders
      ON orders.account_id = accounts.id
      JOIN region
      ON region.id = sales_reps.region_id
      GROUP BY 1, 2
ORDER BY 3 DESC) AS table3
ON table3.region_name = table2.region_name AND table3.total_amt = table2.max_total_amt;

Subquery-8

WITH Statements

WITH statements also known as Common Table Expressions CTE have the same purpose as subqueries but they are much common in practice as they improve efficiency and readability. As we saw in the subquery example above, subqueries can get pretty complex and its logic becomes hard to follow. CTEs can help break these subqueries into separate components so that the logic is easier to follow. When we run a CTE, it creates a table just as any other table in the database from which we can do further queries. Let’s look at an example. We need to find the daily number of events from each channel. Before we solve it using the following query:

SELECT channel, AVG(events) AS average_events
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
             channel, COUNT(*) as events
      FROM web_events 
      GROUP BY 1,2) sub
GROUP BY channel
ORDER BY 2 DESC;

We can improve this query by creating a CTE using a WITH statement. First, we put the subquery inside the WITH statement and alias the table as events:

WITH events AS (
            SELECT DATE_TRUNC('day',occurred_at) AS day,
             channel, COUNT(*) as events
      FROM web_events 
      GROUP BY 1,2)

Now that the new table events has been created, we can query directly from this table:

WITH events AS (
          SELECT DATE_TRUNC('day',occurred_at) AS day, 
                            channel, COUNT(*) as events
          FROM web_events 
          GROUP BY 1,2)

SELECT channel, AVG(events) AS average_events
FROM events
GROUP BY channel
ORDER BY 2 DESC;

In this example, only one table was created but we have the ability to create as many tables as necessary with the following syntax:

WITH table1 AS (
          SELECT *
          FROM web_events),

     table2 AS (
          SELECT *
          FROM accounts)


SELECT *
FROM table1
JOIN table2
ON table1.account_id = table2.id;

Example

  1. Return the name of the sales rep in each region that has the largest amount in total_amt_usd. We already answer this question using sub queries but now– lets use a WITH statement.
    WITH t1 AS (
      SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
       FROM sales_reps s
       JOIN accounts a
       ON a.sales_rep_id = s.id
       JOIN orders o
       ON o.account_id = a.id
       JOIN region r
       ON r.id = s.region_id
       GROUP BY 1,2
       ORDER BY 3 DESC), 
    t2 AS (
       SELECT region_name, MAX(total_amt) total_amt
       FROM t1
       GROUP BY 1)
    SELECT t1.rep_name, t1.region_name, t1.total_amt
    FROM t1
    JOIN t2
    ON t1.region_name = t2.region_name AND t1.total_amt = t2.total_amt;
    
  2. Return how many orders were placed for the region with the largest total_smt_usd:
    WITH t1 AS (
       SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
       FROM sales_reps s
       JOIN accounts a
       ON a.sales_rep_id = s.id
       JOIN orders o
       ON o.account_id = a.id
       JOIN region r
       ON r.id = s.region_id
       GROUP BY r.name), 
    t2 AS (
       SELECT MAX(total_amt)
       FROM t1)
    SELECT r.name, COUNT(o.total) total_orders
    FROM sales_reps s
    JOIN accounts a
    ON a.sales_rep_id = s.id
    JOIN orders o
    ON o.account_id = a.id
    JOIN region r
    ON r.id = s.region_id
    GROUP BY r.name
    HAVING SUM(o.total_amt_usd) = (SELECT * FROM t2);
    

Tags:

Updated: