SQL for Data Analytics III
Notes on Postgresql for Data Analysis – III
Subqueries
Subqueries, also known as inner queries and/or nested queries are tools that allows us to answer more complex questions from what we normally can from a single database table. A subquery allows us to query an existing table which itself was created from a prior/inital query to the db. Let’s look at the following example:
- This is the inner query:
SELECT DATE_TRUNC('day', occurred_at) AS day, channel, COUNT(*) AS event_count FROM web_events GROUP BY 1, 2 ORDER BY 1 LIMIT 5;
and it will return the following output:
- We can now create a subquery to get more insights from the data we retrieved from the inner query. We do it by wrapping the inner query in parenthesis and type the statements for the subquery at the the top of the inner query:
SELECT channel, AVG(event_count) AS avg_event_count FROM (SELECT DATE_TRUNC('day', occurred_at) AS day, channel, COUNT(*) AS event_count FROM web_events GROUP BY 1, 2) AS sub GROUP BY 1 ORDER BY 2 DESC LIMIT 5;
Here we see that in the subquery we are selecting the channel and also, we are aggregating on the
AVG
of theevent_count
columnn that was produced by the inner query. Keep in mind that the inner query always runs first and creates the initial output table. The we use theFROM
on the inner query because we want to retrieve data from the inner query. Notice that after the closing parenthesis of the inner query we alias the subqueryAS sub
and we can also pass other arguments suchGROUP BY
,ORDER BY
andLIMIT
to our subquery. It is best practice to write queries in a readable manner. Although the intended output would be the same– do not write a query in a single line. This makes the query difficult to read. Indentation works best at making the query more readable as seen above.
In the query above we created a table by running an inner query and the queried that table using FROM
in the subquery. This worked because we returned a table but if the inner query returns a single value then we we must use that value in a logical statement such as WHERE
, HAVING
, CASE
and even a SELECT
statement. As an example, let’s say that we want to look at all the orders from the month the that the first ever order was placed. We would use a inner query to return the date/month for the first order and then use a subquery to query the other conditions.
SELECT DATE_TRUNC('month', MIN(occurred_at))
FROM orders;
This will return the date we are after:
Now we can create a subquery to get all the data about the orders that were placed in the same month as the first order:
SELECT *
FROM orders
WHERE DATE_TRUNC('month', occurred_at) =
(SELECT DATE_TRUNC('month', MIN(occurred_at)) AS min_month
FROM orders)
ORDER BY occurred_at;
Notice that we did not alias the subquery below. This is because when the inner query returns a single value and not a table then the subquery does not need to be aliased.
Examples:
- Return the name of the sales rep in each region that has the largest amount in
total_amt_usd
. This seems faily easy but it is in fact a complex query. The results that we need are spread across multiple tables so first, we would need to combine all the tables into a single table and sum the total amount of sales.SELECT sales_reps.name AS sales_rep, region.name AS region_name, SUM(orders.total_amt_usd) AS total_amt FROM sales_reps JOIN accounts ON sales_reps.id = accounts.sales_rep_id JOIN orders ON orders.account_id = accounts.id JOIN region ON region.id = sales_reps.region_id GROUP BY 1, 2 ORDER BY 3;
Once we run this query, we get the following output:
The next step would be to get the highest /MAX
total amount for each region. We can do this by adding a subquery:
SELECT region_name, MAX(total_amt) AS max_total_amt,
FROM (SELECT sales_reps.name AS sales_rep,
region.name AS region_name,
SUM(orders.total_amt_usd) AS total_amt
FROM sales_reps
JOIN accounts
ON sales_reps.id = accounts.sales_rep_id
JOIN orders
ON orders.account_id = accounts.id
JOIN region
ON region.id = sales_reps.region_id
GROUP BY 1, 2) AS table1
GROUP BY 1;
Now we get:
Last, we can create another subquery and join the two tables above by matching the region and the amount:
SELECT table3.sales_rep,
table3.region_name,
table3.total_amt
FROM (SELECT region_name, MAX(total_amt) AS max_total_amt
FROM (SELECT sales_reps.name AS sales_rep,
region.name AS region_name,
SUM(orders.total_amt_usd) AS total_amt
FROM sales_reps
JOIN accounts
ON sales_reps.id = accounts.sales_rep_id
JOIN orders
ON orders.account_id = accounts.id
JOIN region
ON region.id = sales_reps.region_id
GROUP BY 1, 2) AS table1
GROUP BY 1) AS table2
JOIN (SELECT sales_reps.name AS sales_rep,
region.name AS region_name,
SUM(orders.total_amt_usd) AS total_amt
FROM sales_reps
JOIN accounts
ON sales_reps.id = accounts.sales_rep_id
JOIN orders
ON orders.account_id = accounts.id
JOIN region
ON region.id = sales_reps.region_id
GROUP BY 1, 2
ORDER BY 3 DESC) AS table3
ON table3.region_name = table2.region_name AND table3.total_amt = table2.max_total_amt;
WITH Statements
WITH
statements also known as Common Table Expressions CTE have the same purpose as subqueries but they are much common in practice as they improve efficiency and readability.
As we saw in the subquery example above, subqueries can get pretty complex and its logic becomes hard to follow. CTEs can help break these subqueries into separate components so that the logic is easier to follow. When we run a CTE, it creates a table just as any other table in the database from which we can do further queries. Let’s look at an example.
We need to find the daily number of events from each channel. Before we solve it using the following query:
SELECT channel, AVG(events) AS average_events
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2) sub
GROUP BY channel
ORDER BY 2 DESC;
We can improve this query by creating a CTE using a WITH
statement. First, we put the subquery inside the WITH
statement and alias the table as events:
WITH events AS (
SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2)
Now that the new table events has been created, we can query directly from this table:
WITH events AS (
SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2)
SELECT channel, AVG(events) AS average_events
FROM events
GROUP BY channel
ORDER BY 2 DESC;
In this example, only one table was created but we have the ability to create as many tables as necessary with the following syntax:
WITH table1 AS (
SELECT *
FROM web_events),
table2 AS (
SELECT *
FROM accounts)
SELECT *
FROM table1
JOIN table2
ON table1.account_id = table2.id;
Example
- Return the name of the sales rep in each region that has the largest amount in
total_amt_usd
. We already answer this question using sub queries but now– lets use aWITH
statement.WITH t1 AS ( SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY 1,2 ORDER BY 3 DESC), t2 AS ( SELECT region_name, MAX(total_amt) total_amt FROM t1 GROUP BY 1) SELECT t1.rep_name, t1.region_name, t1.total_amt FROM t1 JOIN t2 ON t1.region_name = t2.region_name AND t1.total_amt = t2.total_amt;
- Return how many orders were placed for the region with the largest
total_smt_usd
:WITH t1 AS ( SELECT r.name region_name, SUM(o.total_amt_usd) total_amt FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY r.name), t2 AS ( SELECT MAX(total_amt) FROM t1) SELECT r.name, COUNT(o.total) total_orders FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY r.name HAVING SUM(o.total_amt_usd) = (SELECT * FROM t2);