Working with Timeseries Data

9 minute read

Notes on Python/pandas for TimeSeries Data from Datacamp’s Manipulating Time Series course

Date & time series functionality

  • At the core – data types for date & time information
    • Objects for points in time periods
    • Attributes & methods reflect time-related details
  • Sequences of Dates & Periods
    • Series or DataFrame columns
    • Index: converts entire dataframe into TimeSeries
  • Many series/dataframes methods rely on time information in the index to provide time-series functionality

The basic building block is pandas Timestamp:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# create a timestamp manually
time_stamp = pd.Timestamp(datetime(2020, 1, 1))

time_stamp
# we can also use a date string instead of a datetime object 
# to create the timestamp
pd.Timestamp('2020-01-01')

Timestamp('2020-01-01 00:00:00')

pd.Timestamp has different attributes so we can access different elements of the timestamp. For instance:

# to get the year
time_stamp.year

2020

# to get the week day
time_stamp.day_name()

Wednesday

We can also access more attributes for pd.Timestamp such as: .second, .minute, .hour .day, .month, .quarter, .year .weekday, .dayofweek, .weekofyear, .dayofyear

Period & Frequency

The period object has freq attribute to store frequency info:

period = pd.Period('2020-01')
period

Period('2020-01', 'M')

# change requency to daily 'D'
period.asfreq('D')

Period('2020-01-31', 'D')

# we can do simply math with periods
period + 2

Period('2020-03', 'M')

There are other frequencies that we can use besides month ‘M’ and day ‘D’

Hour H

Day D

Week W

Month M

Quarter Q

Year A

Business Day B

# create a sequence of timestamps
# 12 monthly timestamps

index = pd.date_range(start='2019-01-01', periods=12, freq='M')
index
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30', '2019-07-31', '2019-08-31',
               '2019-09-30', '2019-10-31', '2019-11-30', '2019-12-31'],
              dtype='datetime64[ns]', freq='M')

DatetimeIndex is a sequence of Timestamp objects with frequency info. We can convert a DatatimeIndex to a Periodindex

index[0]

Timestamp('2019-01-31 00:00:00', freq='M')

index.to_period()
PeriodIndex(['2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06',
             '2019-07', '2019-08', '2019-09', '2019-10', '2019-11', '2019-12'],
            dtype='period[M]', freq='M')

Now that we have an DatetimeIndex we can use it to create a time-series dataframe.

data = np.random.random(size=(12,2))

df = pd.DataFrame(data=data, index=index)

df.head(10)

                   0	       1
2019-01-31	0.685220	0.456167
2019-02-28	0.809490	0.905552
2019-03-31	0.454890	0.683387
2019-04-30	0.860495	0.341041
2019-05-31	0.156462	0.030216
2019-06-30	0.885845	0.085404
2019-07-31	0.995309	0.719039
2019-08-31	0.417888	0.032380
2019-09-30	0.167757	0.773513
2019-10-31	0.631097	0.583592
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 12 entries, 2019-01-31 to 2019-12-31
Freq: M
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       12 non-null     float64
 1   1       12 non-null     float64
dtypes: float64(2)
memory usage: 608.0 bytes
# print the day of week and the day's name for the df above

for day in index:
  print(day.dayofweek, day.day_name())

3 Thursday
3 Thursday
6 Sunday
1 Tuesday
4 Friday
6 Sunday
2 Wednesday
5 Saturday
0 Monday
3 Thursday
5 Saturday
1 Tuesday

Timeseries Transformations

Basic timeseries transformations include:

  • Parsing string dates and convert to datetime64

  • Selecting & slicing for specific subperiods

  • Setting & changing DatetimeIndex frequency:

    • Upsampling (creating more data) vs Downsampling (aggregating data)

When we load a dataset (.csv file) that contains dates, most of the time these dates will be treated as string objects. To convert these strings to the correct datatype, pandas has the pd.to_datetime() function. This function parses the date string and converts it to datetime64.

df.date = pd.to_datetime(df.date)

Now that it is converted to the correct data type, we can set it as index:

df.set_index('date', inplace=True)

We now have a complete timeseries dataframe.

TIP: When using pd.read_csv we can parse dates and set index in one step:

df = pd.read_csv("datasetname.csv", parse_dates=['column_name_with_dates'], index_col='column_name_with_dates')

df = pd.read_csv('/occupancy.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8143 entries, 0 to 8142
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           8143 non-null   object 
 1   Temperature    8143 non-null   float64
 2   Humidity       8143 non-null   float64
 3   Light          8143 non-null   float64
 4   CO2            8143 non-null   float64
 5   HumidityRatio  8143 non-null   float64
 6   Occupancy      8143 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 445.4+ KB
df.head()
date	Temperature	Humidity	Light	CO2	HumidityRatio	Occupancy
0	2015-02-04 17:51:00	23.18	27.2720	426.0	721.25	0.004793	1
1	2015-02-04 17:51:59	23.15	27.2675	429.5	714.00	0.004783	1
2	2015-02-04 17:53:00	23.15	27.2450	426.0	713.50	0.004779	1
3	2015-02-04 17:54:00	23.15	27.2000	426.0	708.25	0.004772	1
4	2015-02-04 17:55:00	23.10	27.2000	426.0	704.50	0.004757	1
df.date = pd.to_datetime(df.date)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8143 entries, 0 to 8142
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           8143 non-null   datetime64[ns]
 1   Temperature    8143 non-null   float64       
 2   Humidity       8143 non-null   float64       
 3   Light          8143 non-null   float64       
 4   CO2            8143 non-null   float64       
 5   HumidityRatio  8143 non-null   float64       
 6   Occupancy      8143 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 445.4 KB
df.set_index('date', inplace=True)
df.head()
	Temperature	Humidity	Light	CO2	HumidityRatio	Occupancy
date						
2015-02-04 17:51:00	23.18	27.2720	426.0	721.25	0.004793	1
2015-02-04 17:51:59	23.15	27.2675	429.5	714.00	0.004783	1
2015-02-04 17:53:00	23.15	27.2450	426.0	713.50	0.004779	1
2015-02-04 17:54:00	23.15	27.2000	426.0	708.25	0.004772	1
2015-02-04 17:55:00	23.10	27.2000	426.0	704.50	0.004757	1
# plot the df

df.plot(title="Occupancy DF Plot", subplots=True, figsize = (12,10))
plt.tight_layout();
plt.legend('best')
plt.show();

Subplot_1

Partial String Indexing

We can pass a string representing a period of time (year) to retrieve all the values for that particular year:

df['2015'].info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8143 entries, 2015-02-04 17:51:00 to 2015-02-10 09:33:00
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Temperature    8143 non-null   float64
 1   Humidity       8143 non-null   float64
 2   Light          8143 non-null   float64
 3   CO2            8143 non-null   float64
 4   HumidityRatio  8143 non-null   float64
 5   Occupancy      8143 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 445.3 KB

We can also pass a ‘slice’ or a range of date that we want to retrieve data for:

df['2015-2-5' : '2015-2-6'].info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2880 entries, 2015-02-05 00:00:00 to 2015-02-06 23:58:59
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Temperature    2880 non-null   float64
 1   Humidity       2880 non-null   float64
 2   Light          2880 non-null   float64
 3   CO2            2880 non-null   float64
 4   HumidityRatio  2880 non-null   float64
 5   Occupancy      2880 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 157.5 KB
# retive data for a particular day
df.loc['2015-2-5', 'CO2']
date
2015-02-05 00:00:00    456.500000
2015-02-05 00:01:00    458.500000
2015-02-05 00:02:00    459.666667
2015-02-05 00:03:00    464.000000
2015-02-05 00:04:00    465.000000
                          ...    
2015-02-05 23:55:00    439.500000
2015-02-05 23:55:59    445.000000
2015-02-05 23:57:00    442.500000
2015-02-05 23:57:59    443.500000
2015-02-05 23:58:59    444.000000
Name: CO2, Length: 1440, dtype: float64
# change the frequency of the data to hourly
df.asfreq('H').info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 136 entries, 2015-02-04 17:51:00 to 2015-02-10 08:51:00
Freq: H
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Temperature    136 non-null    float64
 1   Humidity       136 non-null    float64
 2   Light          136 non-null    float64
 3   CO2            136 non-null    float64
 4   HumidityRatio  136 non-null    float64
 5   Occupancy      136 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 7.4 KB
df.asfreq('H').plot(subplots=True, figsize = (12,10));

subplot_2

# import library for example datasets
from vega_datasets import data as vds

df_stocks = vds.stocks()
df_stocks.head(5)
	symbol	date	price
0	MSFT	2000-01-01	39.81
1	MSFT	2000-02-01	36.35
2	MSFT	2000-03-01	43.22
3	MSFT	2000-04-01	28.37
4	MSFT	2000-05-01	25.45
5	MSFT	2000-06-01	32.54
df_stocks.tail()
	symbol	date	price
555	AAPL	2009-11-01	199.91
556	AAPL	2009-12-01	210.73
557	AAPL	2010-01-01	192.06
558	AAPL	2010-02-01	204.62
559	AAPL	2010-03-01	223.02

df_stocks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560 entries, 0 to 559
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   symbol  560 non-null    object        
 1   date    560 non-null    datetime64[ns]
 2   price   560 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 13.2+ KB
# Create a new dataframe only for AAPL stock

df_aapl = df_stocks.loc[(df_stocks['symbol'] == 'AAPL')]

df_aapl.head()
	symbol	date	price
437	AAPL	2000-01-01	25.94
438	AAPL	2000-02-01	28.66
439	AAPL	2000-03-01	33.95
440	AAPL	2000-04-01	31.01
441	AAPL	2000-05-01	21.00
# set date as index
df_aapl.set_index('date', inplace=True)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create an empty dataframe for stock prices
prices = pd.DataFrame()

# Select data for each year and concatenate with prices
for year in ['2007', '2008', '2009']:
    price_per_year = df_aapl.loc[year, ['price']].reset_index(drop=True)
    price_per_year.rename(columns={'price': year}, inplace=True)
    prices = pd.concat([prices, price_per_year], axis=1)

# Plot prices
prices.plot(title="AAPL Stock for 2007 to 2009")
plt.show()

graph_3

Basic Timeseries Calculations

Some of the basic calculations we can do with a timeseries dataset using pandas based on the pd.DateTimeIndex:

  • Shit or lag values back or forward in time
  • Get differences in values for a given time period
  • Compute the percent change over any number of periods

df_aapl.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 123 entries, 437 to 559
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   symbol  123 non-null    object        
 1   date    123 non-null    datetime64[ns]
 2   price   123 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 3.8+ KB

Moving Data between past & future

Shifting data is useful so they can be compared at different points in time. As the name indicates .shift() moves the data within the dataframe. .shift():

  • defaults to periods=1
  • when the value passed to periods is positive the data will shift forward and when the value is negative -1 the data will shift backwards/lag
  • 1 period into future or past
df_aapl['shifted'] = df_aapl.price.shift()
df_aapl.head()
	symbol	price	shifted
date			
2000-01-01	AAPL	25.94	NaN
2000-02-01	AAPL	28.66	25.94
2000-03-01	AAPL	33.95	28.66
2000-04-01	AAPL	31.01	33.95
2000-05-01	AAPL	21.00	31.01
df_aapl['lagged'] = df_aapl.price.shift(periods=-1)
df_aapl.tail()
	symbol	price	shifted	lagged
date				
2009-11-01	AAPL	199.91	188.50	210.73
2009-12-01	AAPL	210.73	199.91	192.06
2010-01-01	AAPL	192.06	210.73	204.62
2010-02-01	AAPL	204.62	192.06	223.02
2010-03-01	AAPL	223.02	204.62	NaN

Calculate one-period percent change

We can use the shifted data we just created to calculate the change between periods which is also called the financial return The finacial return is calculated as follows:

\[x_t / x_{t-1}\]
# create a new column and calculate the financial return
df_aapl['change'] = df_aapl.price.div(df_aapl.shifted)
df_aapl.head()
	symbol	price	shifted	lagged	change
date					
2000-01-01	AAPL	25.94	NaN	28.66	NaN
2000-02-01	AAPL	28.66	25.94	33.95	1.104857
2000-03-01	AAPL	33.95	28.66	31.01	1.184578
2000-04-01	AAPL	31.01	33.95	21.00	0.913402
2000-05-01	AAPL	21.00	31.01	26.19	0.677201

Calculating one-period percent change

# create a new column and calculate the result
df_aapl['return'] = df_aapl.change.sub(1).mul(100)
df_aapl.head()
	symbol	price	shifted	lagged	change	return
date						
2000-01-01	AAPL	25.94	NaN	28.66	NaN	NaN
2000-02-01	AAPL	28.66	25.94	33.95	1.104857	10.485736
2000-03-01	AAPL	33.95	28.66	31.01	1.184578	18.457781
2000-04-01	AAPL	31.01	33.95	21.00	0.913402	-8.659794
2000-05-01	AAPL	21.00	31.01	26.19	0.677201	-32.279910
## Built-in timeseries change
`.diff()`

*   This method is used to calculate the difference invalue for two adjacent periods. In finance, it is used to calculate the difference in close price since the last day the stocks were traded. 

$$x_t / x_{t-1}$$
# create a new column and calculate the diff()

df_aapl['diff'] = df_aapl.price.diff()
df_aapl.head()
	symbol	price	shifted	lagged	change	return	diff
date							
2000-01-01	AAPL	25.94	NaN	28.66	NaN	NaN	NaN
2000-02-01	AAPL	28.66	25.94	33.95	1.104857	10.485736	2.72
2000-03-01	AAPL	33.95	28.66	31.01	1.184578	18.457781	5.29
2000-04-01	AAPL	31.01	33.95	21.00	0.913402	-8.659794	-2.94
2000-05-01	AAPL	21.00	31.01	26.19	0.677201	-32.279910	-10.01

Calculating % change

.pct_change()

  • Calculates the percent change for two adjacent periods
\[\frac{x_t}{x_{t-1}}\]
# create a new column and calculate the percentage change

df_aapl['pct_change'] = df_aapl.price.pct_change().mul(100)
df_aapl.head()
	symbol	price	shifted	lagged	change	return	diff	pct_change
date								
2000-01-01	AAPL	25.94	NaN	28.66	NaN	NaN	NaN	NaN
2000-02-01	AAPL	28.66	25.94	33.95	1.104857	10.485736	2.72	10.485736
2000-03-01	AAPL	33.95	28.66	31.01	1.184578	18.457781	5.29	18.457781
2000-04-01	AAPL	31.01	33.95	21.00	0.913402	-8.659794	-2.94	-8.659794
2000-05-01	AAPL	21.00	31.01	26.19	0.677201	-32.279910	-10.01	-32.279910

Exercise:

Set the frequency of the data to business day and shift the AAPL stock price 90 days into past and future and plot the results. Ensure the df contains daily prices.

# set frequency to business days
df_apple = df_aapl['price']
# create lagged and shifted columns
df_apple['lagged'] = df_apple.price.shift(periods=-90)
df_apple['shifted'] = df_apple.price.shift(periods=90)
# plot the price, lagged and shifted columns

df_aapl[['price', 'lagged', 'shifted']].plot(title="Apple Stock Price Comparison 90 Days into Past and Future")
plt.show()

Tags:

Updated: