SQL for Data Analytics I

17 minute read

Notes on SQL for Data Analytics –

Entity Relational Diagrams

An Entity Relational Diagram (ERD) is a common way to view data in a database. These diagrams help us visualize data and they include:

  1. The names of the tables
  2. The columns in each table
  3. The way the tables work together

We normally store data in spreadsheets. We can have multiple data fields spread spread across multiple spreadheets. We can use an ERD to make sense of these data, where each spreasheet is represented by a table. At the top of the diagram we’ll see the name of the spreadsheet/table and below the column names will be lsited. Using SQL we can access to all of these data by querying one or multiple tables.

Structured Query Language

SQL is the standard way to access data stored is relational databases and it has many advantages:

  • SQL is relatively easy to understand and learn
  • It allows us to access large amounts of data directly where it is stored without the need to copy the data into some other program to view it and it reduces overload, meaning ‘spreadsheet won’t crash while handling large volumes of data’
  • It is easy to audit and replicate. For instance, in Excel we’d need to click at individual cells to see how they are calculated whereas in SQL we can read a query from top to bottom.
  • SQL can run queries across multiple tables at once which is great for performing aggregations that are similar to Pivot tables in Excel with a majot difference that Excel maxes out at ~1 millon rows and SQL can perform these operations across billions of rows at the time.

Attributes of SQL

  • Data Integrity is ensured – only the data we want to enter is entered and only certain people are able to write to the database (db)
  • Data can be accessed quickly – SQL quickly returns results/data from the db and code can be optimized to improve performance
  • Data can be easily shared – multiple people within an organization can access the data concurrently. Every user would get exactly the same data ensuring consistency.

Key points about the data stored in SQL

  • Data in dbs are stored in tables – just like in Excel spreadhseets
  • All the data in the same column must be of the same datatype (str, int, boolean). If a columnn has numeric/quantitative data then every element of the column must be the same. If we have text in a quantitative column we cannot perform any aggregations (mean, sum, etc.) and would get errors.
  • Consistent datatypes makes working with dbs fast.

Types of SQL Databases

There are several SQL databases such as MySQL, Access, Oracle, Microsoft SQLServer, Postgres SQL. Fundamentally, all these databases are the same and code can be applied across different paltforms but there are subtle differences in syntax and functions. SQL can also be written in other platforms such as Python, Scala and HaDoop. The exercises in this notebook are based on Postgres SQL.

SQL Elements

SQL has different elements. The most basic of them is the Statement. A statement is simply a piece of SQL code. Statements tell the db what we want to do with the data. For instance:

  • CREATE stament – is used to make a new table in the db
  • DROP TABLE – is used to remove a table from the db
  • SELECT – is used to read and display data. SELECT staments are commonly referred as queries.

Let’s look at the SELECT statement in more detail.

We can think of the SELECT stament as a form we have to fill to retrieve data. The form contains questions such as : Where do you want to get your datafrom? What elements from that table would you like to retrieve? We get the idea. These questions are structured in the same order, every time. Some of them are mandatory and other optional. When writing a SELECT stament, these questions are represented by a single word, such as SELECT and FROM These words are called clauses.

The FROM clause tells the query which data/table to use. The SELECT clause tells the query which columns to read from the table. We can select multiple columns from a table by passing the table names separated by commas and we can also select ALL columns by passing an asterisk *. The SELECT and FROM clauses are mandatory

SELECT *
FROM db.table 

Installing PostgreSQL (MAC-OS)

In the shell run:

brew install postgresql

Getting Started

Start PostgreSQL. In the shell run:

brew services start postgresql

To stop:

brew services stop postgresql

Once PostgreSQL has been started:

psql postgres

If executed successfully we are now working inside the psql terminal. To create a new user and password:

CREATE ROLE newuser WITH LOGIN PASSWORD 'password'

ALTER ROLE newuser CREATEDB;

New user has been created, we can now exit the psql terminal so we can login into it again with the new user.

$ psql postgres -U newuser

We should now see:

postgres=>

To create a new database:

postgres=> CREATE DATABASE database_name:

Exit the current terminal q

Access the recently created db by:

psql database_name

To create a table in that db:

database_name=# CREATE TABLE table_name (column_1 INTEGER, column_2 VARCHAR);

I have data in a .csv file that I want to copy to my postgresql database. The csv contains made-up accounts data. It’s name is accounts and this is just one table of the database. To copy the csv into the databse first:

  • Create the table in the database – think of it as create a spreadsheet.
  • Create the columns in the table. They have to match what we have in the spreadsheet.
  • Once the table is set up type the following commands:
    • COPY table_name # the table we want to copy into
    • FROM 'absolute_path_to_your_speadsheet'
    • DELIMITER ',' CSV HEADER;

If executed successfully, we will see a message: COPY 100 or whatever number of records that were copied.

SQL_IMG_1

Queries and Staments

A few things to know about SQL syntax:

  • SQL statements are not case sensitive. We can type the commands in either upper or lower case and they will execute normally but it is good practice to capitalize SQL statements. This will make query easier to read and understand.
  • Avoid spaces in variables and table & column names. Uisng underscores is good practice when naming columns and variables. If there’s a space in a column name, in PostgreSQL we must wrap the column name in double quotes ”” column name”“. In other SQL environments [ ] are used.
  • SQL ingnores spaces in queries. We can write a query in a single as well as multiple lines.
  • The use of semicolons (;) varies depending on the SQL environment. Some environments required the use of semicolons to execute the query. Overal it is good practice to use them.
  • Be diligent when running queries and examining its results. Just because the query runs sucessfully doesn’t mean the output is exactly what we wanted.

We can run queries using the db and table shown above.

SELECT * FROM orders;

This query will return all the data from the orders table. The asterisk means everything.

If we want to retrieve data from a specific column, we pass that column name in the query. If we want more than one column but not all, pass the column names separated by commas:

SELECT id, account_id, total FROM orders;

Note that the query will return all the rows in the table. If we are exploring the data or do not want all the rows we can specify a limit for the returned rows.

SELECT * FROM orders LIMIT 10;

We can sort the data retrieved by the query by using the ORDER BY statement. ORDER BY can be used in a ascending and descending way. As an example, if want to retrieve all the data sorted by the last ten orders from the orders table, I would type this query:

SELECT * FROM orders ORDER BY occurred_at DESC LIMIT 10;

SQL_IMG_2

Retrieve the id, account_id and total_amt_usd data from the 5 largest orders:

SELECT id, account_id, total_amt_usd FROM orders ORDER BY total_amt_usd DESC LIMIT 5;

SQL_IMG_3

We can also use ORDER BY in accross multiple columns. For instance, if we want to order by account_id first and then by order amount:

SELECT id, account_id, total_amt_usd

FROM orders

ORDER BY account_id, total_amt_usd DESC

LIMIT 20;

SQL_IMG_4

If we are looking to filter data by a specific value or criteria we can use the WHERE clause. Let’s say we want to retrieve orders that belong to account_id 1001 only, we would write:

SELECT id, account_id, total_amt_usd

FROM orders

WHERE account_id = 1001

ORDER BY total_amt_usd DESC;

SQL_IMG_5

WHERE statements can be used with numerical operators such as >, <, >=, <=, =. WHERE can also be used on non-numerical data by using operators equals = and not equal != but the evaluating condition must be inside single quotes. WHERE name = 'John'

We can also perform calculations betwen columns ( +, -, *, /) to get insights on our data. The results would be stored in a new column. By default the column would have a generic name but we can alias it using AS to give it a more descriptive name. The column will not be created permanently in the database as it is only created as a result of the query.
These columns are called Derived / Calculated Columns.

As an example calculate the unit price by dividing the standard amount by the standard quantity. Name the derived column unit_price. Return the first 10 items.

SELECT id, account_id, standard_amt_usd, standard_qty, (standard_amt_usd / standard_qty) AS unit_price

FROM orders

LIMIT 10

SQL_IMG_6

Logical Operators

SQL has several Logical Operators such as:

  • LIKE – its functionality is similar to WHERE = but for cases when we do not exactly what we’re looking for. For instance, we have a column containing web traffic. We want to retrieve data that contain the word ‘google’ to see how much of the traffic came from google. If we do WHERE url = 'google' it would return zero results as the URLs contain many more information. To get all the URLs that contain google we write the following query: SELECT * FROM web_traffic WHERE url LIKE '%google%' Note the use of the % wildcard that includes the charaters before and after the word google. Let’s say we want to retrieve the name of companies that start with the leter C we would use LIKE 'C%'

  • IN – similar to WHERE = but for more than one condition. We can use IN with the WHERE statement when we can want to look at info for two companies: WHERE name IN ('Microsoft', 'Apple').

  • NOT – this is used with LIKE and IN but to retrieve data NOT LIKE or NOT IN cases. Let’s use the google example above. If we run the query SELECT * FROM web_traffic WHERE url NOT LIKE '%google%; would return every other url which does not contain the word google. I think of if as an except statement. Select everyhting from the web_traffic table expect results that contain google.

  • AND & BETWEEN these are used to combine operations where all combined conditions must equal to true. For isntance, we can use AND to look at transactions that occurred during a specific period of time: WHERE occurred_at >= '2020-01-01' AND occurred_at < '2020-05-01' The BETWEEN statements makes the query above cleaner and it returns the same output. WHERE occurred_at BETWEEN '2020-01-01' AND '2020-01-01'

  • OR this is used to combine operations where one or more conditions must be equal to true. Similar to AND, OR can combine multiple statements. Keep in mind that we need to specify the column name each time we use OR. If we have multiple OR is good practice to encase them in parenthesis to ensure they are executed correctly.

Joins

In real world application databases all the data do not reside in the same table, instead they are stored in multiple tables. For instance,account data and orders data are two different objects and it is best if they are stored separetely. Storing data this way allows for faster queries because we do not have to query a massive database. When creating a database, it is important to think about how data will be stored. This is known as normalization and it is a very important concept in database design. Things to considered when working on database normalization:

  • Are the tables storing logical groupings of data?
  • Can I make changes in a single location , rather than in many tables from the same information?
  • Can I access and manipulate data efficiently and quickly? As an analyst, normalization is something we will encounter frequently as we will be working with data existing databases rather than creating a new database. But it is important to be aware of this concept and what it represents.

JOINS are important when we are working with multiple tables and when we want to combine data from these tables. To write a JOIN we need:

  1. The SELECT and FROM clauses.
  2. The JOIN clause. Think of JOIN as second JOIN clause as it tells the query the additional table we want to pull data from.
  3. The ON clause which specifies the relationship between these two tables.

Example:

SELECT orders.*, accounts.*

FROM orders

JOIN accounts

ON orders.account_id = accounts.id

LIMIT 5

SQL_IMG_7

In the query above we joined all the columns from both tables. We can join any columns that we want in return by name. Note that the table name always goes on the right side of the ‘.’ and the column name on the left. Let’s say we wanted to get just the name of the customer and the date when the order was placed. We would write:

SELECT accounts.name, orders.occured_at
FROM orders
JOIN accounts
ON orders.account_id = accounts.id
LIMIT 10;

SQL_IMG_10

Types of Joins

Types_of_Joins_IMG

INNER JOIN

When doing an inner join, it only returns rows that are present in both tables. ###LEFT JOIN When we perform a query, the table we pass to FROM is consider the left table and the table we pass to JOIN is the right table. To change the JOIN into an LEFT JOIN we simply add the word LEFT to the JOIN clause. This query will return the results that match with the right table, just like an inner join but– it will also any results in the left table that did not match the right table. Its syntax is:

SELECT
FROM left table
LEFT JOIN right table

RIGHT JOIN

This type of joins acts the opposite of the left join and when used it will return all the matching elements from the right table as well as the elements that did not match from the left table. Keep in mind that right joins are not used often in practice and we barely see them in real applications. Both left and right joins are interchangeable and left joins are the standard.

OUTER JOINS

This type of join will return the inner join result set as well as any unmatched rows from either of the two tables being joined. Outter Joins are also rare in practice.

TIP:

  • If you have two or more columns in a SELECT that have the same name after the table name such as accounts.name and sales_reps.name you will need to alias them. Otherwise it will only show one of the columns.

Keys

Primary Key

A primary key (PK) is a unique column in a particular table. This is the first column in each of our tables.

Foreign Key

A foreign key is a column in one table that is the PK in a different table.

Exercise:

Join the sales_rep and region tables together.

Keys_IMG

ON orders.account_id = accounts.id;

###Joining more than two tables:

Keys_IMG_2

We can use the following SQL code to join the 3 columns:

SELECT *
FROM web_events
JOIN accounts
ON web_events.account_id = accounts.id
JOIN orders
ON accounts.id = orders.account_id

We could also JOIN specific columns form any of the three tables by using a SELECT stament which specifies the table that we want to pull the column from as well as the column name:

SELECT web_events.channel, accounts.name, orders.total
FROM web_events
JOIN accounts
ON web_events.account_id = accounts.id
JOIN orders
ON accounts.id = orders.account_id

Alias

When joining multipel tables together, it is useful and convenient to give each table an alias. Normally, the alias is just the first letter of the column name so if the column name is orders we can alias it as o.

FROM orders AS o
JOIN accounts AS a

It is also possible to create an alias without using the AS. This would produce the same result as above:

FROM order o
JOIN accounts a

Answering questions using SQL based on the following ERD:

Exer_ERD

  1. Return a table for all web_events where the account name is ‘Walmart’. The table should contain data for the primary_poc, time of the event (ocurred_at), and the channel and a column displaying the account name.
    SELECT accounts.name, 
      accounts.primary_poc, 
      web_events.occurred_at, 
      web_events.channel
    FROM accounts
    JOIN web_events
    ON web_events.account_id = accounts.id
    WHERE accounts.name = 'Walmart';
    
  2. Return a table that provides the region for each sales_rep and their associated accounts. Sort results alphabetically on account_name
    SELECT region.name reg_name, 
      sales_reps.name rep, 
      accounts.name
    FROM sales_reps
    JOIN region
    ON sales_reps.region_id = region.id
    JOIN accounts 
    ON accounts.sales_rep_id = sales_reps.id
    ORDER BY accounts.name;
    
  3. Return the name of each region for every order as well as the account name and unit price paid (total_amt_usd / total) be careful not to divide by 0.
    SELECT region.name AS region_name, 
     accounts.name AS account_name, 
     (total_amt_usd / (total + 0.01)) AS unit_price
    FROM accounts
    JOIN orders
    ON accounts.id = orders.account_id
    JOIN sales_reps
    ON sales_reps.id = accounts.sales_rep_id
    JOIN region
    ON sales_reps.region_id = region.id;
    
  4. Return a table that contains the region name, sales rep name and account name for the Midwest region. Sort the accounts names alphabetically.
    SELECT region.name AS region name, 
      sales_reps.name as rep name, 
      accounts.name AS accounts,  
    FROM sales_reps
    JOIN region
    ON sales_reps.region_id = region.id
    JOIN accounts
    ON accounts.sales_reps_id = sales_reps.id
    WHERE region.name = 'Midwest'
    ORDER BY accounts.name
    
  5. Return a table that contains the region name, sales rep name and account name for the Midwest region but only return data for sales reps whose first name starts with S.
    SELECT region.name AS region_name, 
      sales_reps.name as rep_name, 
      accounts.name AS accounts
    FROM sales_reps
    JOIN region
    ON sales_reps.region_id = region.id
    JOIN accounts
    ON accounts.sales_rep_id = sales_reps.id
    WHERE region.name = 'Midwest' AND sales_reps.name LIKE 'S%'
    ORDER BY accounts.name
    
  6. Return a table that contains the region name, sales rep name and account name for the Midwest region but only return data for sales reps whose last name starts with M.
    SELECT region.name AS region_name, 
      sales_reps.name as rep_name, 
      accounts.name AS accounts
    FROM sales_reps
    JOIN region
    ON sales_reps.region_id = region.id
    JOIN accounts
    ON accounts.sales_rep_id = sales_reps.id
    WHERE region.name = 'Midwest' AND sales_reps.name LIKE '% M%'
    ORDER BY accounts.name
    
  7. Return a table that contains the region name, account name and unit price. However, only return data if the standard order quantity is > 100.
    SELECT region.name AS region_name, 
      accounts.name AS accounts,
      (orders.total_amt_usd / (orders.total + 0.01)) AS unit_price
    FROM sales_reps
    JOIN region
    ON sales_reps.region_id = region.id
    JOIN accounts
    ON accounts.sales_rep_id = sales_reps.id
    JOIN orders
    ON orders.account_id = accounts.id
    WHERE orders.standard_qty > 100
    ORDER BY accounts.name
    
  8. Return a table that contains the region name, account name and unit price. However, only return data if the standard order quantity is > 100 and if the poster order quantity is > 50. Order by unit_price.
    SELECT region.name AS region_name, 
      accounts.name AS accounts,
      (orders.total_amt_usd / (orders.total + 0.01)) AS unit_price
    FROM sales_reps
    JOIN region
    ON sales_reps.region_id = region.id
    JOIN accounts
    ON accounts.sales_rep_id = sales_reps.id
    JOIN orders
    ON orders.account_id = accounts.id
    WHERE orders.standard_qty = 100 AND orders.poster_qty > 50
    ORDER BY unit_price
    
  9. Return a table that contains the unique channel values for account 1001 Walmart. Hint: using SELECT DISTINCT returns unique values.
    SELECT DISTINCT accounts.name AS name, 
               web_events.channel AS channel
    FROM accounts
    JOIN web_events
    ON accounts.id = web_events.account_id
    WHERE accounts.id = 1001;
    
  10. Return a table with that contains data on when the transaction took place, the account name, the order total, the order total amount for all the orders that took place in 2008.
    SELECT orders.occurred_at, 
      accounts.name, 
      orders.total,
      orders.total_amt_usd
    FROM accounts
    JOIN orders
    ON orders.account_id = accounts.id
    WHERE orders.occurred_at BETWEEN '01-01-2015' AND '01-01-2016'
    ORDER BY orders.occurred_at DESC;
    

Tags:

Updated: