SQL for Data Analytics IV

3 minute read

Notes on Postgresql for Data Analysis – IV

Cleaning Data

In SQL we have the option to clean up our data so that it is more suitable for analysis. To achieve this we can use the following functions:

LEFT

The LEFT function pulls characters from the left side of the string and presents them as separate strings. For instance, if we would want to pull the 3 digits area code from a list of phone numbers– we’d do the following:

SELECT name,
       phone_number
       LEFT(phone_number, 3) AS area_code

The RIGHT function, on the other hand, pulls characters from the right side of the string and represents them as a separate string so to build up form the example above– we would add the following line:

RIGHT(phone_number, 8) AS phone_num_only

LENGTH

The LENGTH function returns the length of the string.

The LEFT & RIGHT functions are pretty straight forward and they work perfectly for data fields that are well-defined, clean and has a certain number of characters, such as the phone number. Location data such as city, state aren’t as easy to work with using left and right. For these we can use functions:

POSITION

The POSITION function returns the position of a string counting from the left. This function also allows us to specify a sub-string and it returns a numerical value that is equal to how many spaces from the left that particular value is.

POSITION(',', IN city_state) AS comma_position

The STRPOS is a function equivalent to POSITION but with a slightly different syntax:

STRPOS(city_state, ',') AS sub_comma_pos

Keep in mind that both functions POSITION & STRPOS are case sensitive so if we want to look for a character regardless of its case then we must convert the whole string into lower or upper case with the use of the following functions:

LOWER(city_state) AS lower

UPPER(city_state) AS upper

Now to retrieve the city from the city-state pair we would run the following line– notice we include a $-1$ in the query to remove the comma:

LEFT(city_state, POSITION(',' IN city_state) -1) AS city

Example:

  1. Separate the full name into first_name & last_name
    SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name, 
       RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name
    FROM accounts;
    

    What if we want to join the first and last name fields? Then we will use the CONCATENATE function.

CONCATENATE(first_name, ' ', last_name) AS full_name

Alternatively, we can use a double pipw || to perform the same concatenation:

first_name || ' ' || last_name AS full_name_alt

Example:

Let’s create a fictious enail account by concatenating the first_name, last_name, company_name and .com. The email should have the following format: john.doe@sql.com

WITH t1 AS (
 SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name,  
        RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
 FROM accounts)
SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name, '.com')
FROM t1;

SQL_IMG1

In real-world applications, it is commmon to find disparate data. This is especially the case when we are working with dates. Dates can be found in a variety of formats and it’s likely that they will need to be ‘cleaned’ before we work with them. Let’s assume that each portion of the date has been entered as a separate column and that the month is in text form. To fix this, we would need to convert the month to its numeric value, concatenate all date columns and then cast this result into a date type that SQL can understand. We would approach it like this:

# first clean month
DATE_PART('month', TO_DATE(month, 'month')) AS clean_month

#  second -- concatenate all date fields
year || '-' || DATE_PART('month', TO_DATE(month, 'month')) || '-' || day AS concat_date

# cast into date type

CAST(year || '-' || DATE_PART('month', TO_DATE(month, 'month')) || '-' || day) AS formatted_date

CAST

The CAST function allows us to change columns from one data type to another. We saw its syntax above but we can also use a short hand. Rather that wrapping all the argunments inside the CAST function we can simply add :: at the end of the statement:

(year || '-' || DATE_PART('month', TO_DATE(month, 'month')) || '-' || day):: AS formatted_date

Tags:

Updated: