SQL for Data Analytics IV
Notes on Postgresql for Data Analysis – IV
Cleaning Data
In SQL we have the option to clean up our data so that it is more suitable for analysis. To achieve this we can use the following functions:
LEFT
The LEFT
function pulls characters from the left side of the string and presents them as separate strings. For instance, if we would want to pull the 3 digits area code from a list of phone numbers– we’d do the following:
SELECT name,
phone_number
LEFT(phone_number, 3) AS area_code
RIGHT
The RIGHT
function, on the other hand, pulls characters from the right side of the string and represents them as a separate string so to build up form the example above– we would add the following line:
RIGHT(phone_number, 8) AS phone_num_only
LENGTH
The LENGTH
function returns the length of the string.
The LEFT
& RIGHT
functions are pretty straight forward and they work perfectly for data fields that are well-defined, clean and has a certain number of characters, such as the phone number. Location data such as city, state aren’t as easy to work with using left and right. For these we can use functions:
POSITION
The POSITION
function returns the position of a string counting from the left. This function also allows us to specify a sub-string and it returns a numerical value that is equal to how many spaces from the left that particular value is.
POSITION(',', IN city_state) AS comma_position
The STRPOS
is a function equivalent to POSITION
but with a slightly different syntax:
STRPOS(city_state, ',') AS sub_comma_pos
Keep in mind that both functions POSITION
& STRPOS
are case sensitive so if we want to look for a character regardless of its case then we must convert the whole string into lower or upper case with the use of the following functions:
LOWER(city_state) AS lower
UPPER(city_state) AS upper
Now to retrieve the city from the city-state pair we would run the following line– notice we include a $-1$ in the query to remove the comma:
LEFT(city_state, POSITION(',' IN city_state) -1) AS city
Example:
- Separate the full name into
first_name
&last_name
SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name, RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name FROM accounts;
What if we want to join the first and last name fields? Then we will use the
CONCATENATE
function.
CONCATENATE(first_name, ' ', last_name) AS full_name
Alternatively, we can use a double pipw ||
to perform the same concatenation:
first_name || ' ' || last_name AS full_name_alt
Example:
Let’s create a fictious enail account by concatenating the first_name
, last_name
, company_name
and .com
. The email should have the following format: john.doe@sql.com
WITH t1 AS (
SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name,
RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
FROM accounts)
SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name, '.com')
FROM t1;
In real-world applications, it is commmon to find disparate data. This is especially the case when we are working with dates. Dates can be found in a variety of formats and it’s likely that they will need to be ‘cleaned’ before we work with them.
Let’s assume that each portion of the date has been entered as a separate column and that the month is in text form. To fix this, we would need to convert the month to its numeric value, concatenate all date columns and then cast this result into a date
type that SQL can understand. We would approach it like this:
# first clean month
DATE_PART('month', TO_DATE(month, 'month')) AS clean_month
# second -- concatenate all date fields
year || '-' || DATE_PART('month', TO_DATE(month, 'month')) || '-' || day AS concat_date
# cast into date type
CAST(year || '-' || DATE_PART('month', TO_DATE(month, 'month')) || '-' || day) AS formatted_date
CAST
The CAST
function allows us to change columns from one data type to another. We saw its syntax above but we can also use a short hand. Rather that wrapping all the argunments inside the CAST
function we can simply add ::
at the end of the statement:
(year || '-' || DATE_PART('month', TO_DATE(month, 'month')) || '-' || day):: AS formatted_date