Working with Timeseries Data
Notes on Python/pandas for TimeSeries Data from Datacamp’s Manipulating Time Series course
Date & time series functionality
- At the core – data types for date & time information
- Objects for points in time periods
- Attributes & methods reflect time-related details
- Sequences of Dates & Periods
- Series or DataFrame columns
- Index: converts entire dataframe into TimeSeries
- Many series/dataframes methods rely on time information in the index to provide time-series functionality
The basic building block is pandas Timestamp
:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
# create a timestamp manually
time_stamp = pd.Timestamp(datetime(2020, 1, 1))
time_stamp
# we can also use a date string instead of a datetime object
# to create the timestamp
pd.Timestamp('2020-01-01')
Timestamp('2020-01-01 00:00:00')
pd.Timestamp
has different attributes so we can access different elements of the timestamp. For instance:
# to get the year
time_stamp.year
2020
# to get the week day
time_stamp.day_name()
Wednesday
We can also access more attributes for pd.Timestamp
such as:
.second
, .minute
, .hour
.day
, .month
, .quarter
, .year
.weekday
, .dayofweek
, .weekofyear
, .dayofyear
Period & Frequency
The period object has freq
attribute to store frequency info:
period = pd.Period('2020-01')
period
Period('2020-01', 'M')
# change requency to daily 'D'
period.asfreq('D')
Period('2020-01-31', 'D')
# we can do simply math with periods
period + 2
Period('2020-03', 'M')
There are other frequencies that we can use besides month ‘M’ and day ‘D’
Hour H
Day D
Week W
Month M
Quarter Q
Year A
Business Day B
# create a sequence of timestamps
# 12 monthly timestamps
index = pd.date_range(start='2019-01-01', periods=12, freq='M')
index
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
'2019-05-31', '2019-06-30', '2019-07-31', '2019-08-31',
'2019-09-30', '2019-10-31', '2019-11-30', '2019-12-31'],
dtype='datetime64[ns]', freq='M')
DatetimeIndex
is a sequence of Timestamp objects with frequency info.
We can convert a DatatimeIndex
to a Periodindex
index[0]
Timestamp('2019-01-31 00:00:00', freq='M')
index.to_period()
PeriodIndex(['2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06',
'2019-07', '2019-08', '2019-09', '2019-10', '2019-11', '2019-12'],
dtype='period[M]', freq='M')
Now that we have an DatetimeIndex
we can use it to create a time-series dataframe.
data = np.random.random(size=(12,2))
df = pd.DataFrame(data=data, index=index)
df.head(10)
0 1
2019-01-31 0.685220 0.456167
2019-02-28 0.809490 0.905552
2019-03-31 0.454890 0.683387
2019-04-30 0.860495 0.341041
2019-05-31 0.156462 0.030216
2019-06-30 0.885845 0.085404
2019-07-31 0.995309 0.719039
2019-08-31 0.417888 0.032380
2019-09-30 0.167757 0.773513
2019-10-31 0.631097 0.583592
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 12 entries, 2019-01-31 to 2019-12-31
Freq: M
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 12 non-null float64
1 1 12 non-null float64
dtypes: float64(2)
memory usage: 608.0 bytes
# print the day of week and the day's name for the df above
for day in index:
print(day.dayofweek, day.day_name())
3 Thursday
3 Thursday
6 Sunday
1 Tuesday
4 Friday
6 Sunday
2 Wednesday
5 Saturday
0 Monday
3 Thursday
5 Saturday
1 Tuesday
Timeseries Transformations
Basic timeseries transformations include:
-
Parsing string dates and convert to
datetime64
-
Selecting & slicing for specific subperiods
-
Setting & changing
DatetimeIndex
frequency:- Upsampling (creating more data) vs Downsampling (aggregating data)
When we load a dataset (.csv file) that contains dates, most of the time these dates will be treated as string objects
. To convert these strings to the correct datatype, pandas has the pd.to_datetime()
function. This function parses the date string and converts it to datetime64
.
df.date = pd.to_datetime(df.date)
Now that it is converted to the correct data type, we can set it as index:
df.set_index('date', inplace=True)
We now have a complete timeseries
dataframe.
TIP: When using pd.read_csv
we can parse dates and set index in one step:
df = pd.read_csv("datasetname.csv", parse_dates=['column_name_with_dates'], index_col='column_name_with_dates')
df = pd.read_csv('/occupancy.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8143 entries, 0 to 8142
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 8143 non-null object
1 Temperature 8143 non-null float64
2 Humidity 8143 non-null float64
3 Light 8143 non-null float64
4 CO2 8143 non-null float64
5 HumidityRatio 8143 non-null float64
6 Occupancy 8143 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 445.4+ KB
df.head()
date Temperature Humidity Light CO2 HumidityRatio Occupancy
0 2015-02-04 17:51:00 23.18 27.2720 426.0 721.25 0.004793 1
1 2015-02-04 17:51:59 23.15 27.2675 429.5 714.00 0.004783 1
2 2015-02-04 17:53:00 23.15 27.2450 426.0 713.50 0.004779 1
3 2015-02-04 17:54:00 23.15 27.2000 426.0 708.25 0.004772 1
4 2015-02-04 17:55:00 23.10 27.2000 426.0 704.50 0.004757 1
df.date = pd.to_datetime(df.date)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8143 entries, 0 to 8142
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 8143 non-null datetime64[ns]
1 Temperature 8143 non-null float64
2 Humidity 8143 non-null float64
3 Light 8143 non-null float64
4 CO2 8143 non-null float64
5 HumidityRatio 8143 non-null float64
6 Occupancy 8143 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 445.4 KB
df.set_index('date', inplace=True)
df.head()
Temperature Humidity Light CO2 HumidityRatio Occupancy
date
2015-02-04 17:51:00 23.18 27.2720 426.0 721.25 0.004793 1
2015-02-04 17:51:59 23.15 27.2675 429.5 714.00 0.004783 1
2015-02-04 17:53:00 23.15 27.2450 426.0 713.50 0.004779 1
2015-02-04 17:54:00 23.15 27.2000 426.0 708.25 0.004772 1
2015-02-04 17:55:00 23.10 27.2000 426.0 704.50 0.004757 1
# plot the df
df.plot(title="Occupancy DF Plot", subplots=True, figsize = (12,10))
plt.tight_layout();
plt.legend('best')
plt.show();
Partial String Indexing
We can pass a string representing a period of time (year) to retrieve all the values for that particular year:
df['2015'].info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8143 entries, 2015-02-04 17:51:00 to 2015-02-10 09:33:00
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Temperature 8143 non-null float64
1 Humidity 8143 non-null float64
2 Light 8143 non-null float64
3 CO2 8143 non-null float64
4 HumidityRatio 8143 non-null float64
5 Occupancy 8143 non-null int64
dtypes: float64(5), int64(1)
memory usage: 445.3 KB
We can also pass a ‘slice’ or a range of date that we want to retrieve data for:
df['2015-2-5' : '2015-2-6'].info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2880 entries, 2015-02-05 00:00:00 to 2015-02-06 23:58:59
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Temperature 2880 non-null float64
1 Humidity 2880 non-null float64
2 Light 2880 non-null float64
3 CO2 2880 non-null float64
4 HumidityRatio 2880 non-null float64
5 Occupancy 2880 non-null int64
dtypes: float64(5), int64(1)
memory usage: 157.5 KB
# retive data for a particular day
df.loc['2015-2-5', 'CO2']
date
2015-02-05 00:00:00 456.500000
2015-02-05 00:01:00 458.500000
2015-02-05 00:02:00 459.666667
2015-02-05 00:03:00 464.000000
2015-02-05 00:04:00 465.000000
...
2015-02-05 23:55:00 439.500000
2015-02-05 23:55:59 445.000000
2015-02-05 23:57:00 442.500000
2015-02-05 23:57:59 443.500000
2015-02-05 23:58:59 444.000000
Name: CO2, Length: 1440, dtype: float64
# change the frequency of the data to hourly
df.asfreq('H').info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 136 entries, 2015-02-04 17:51:00 to 2015-02-10 08:51:00
Freq: H
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Temperature 136 non-null float64
1 Humidity 136 non-null float64
2 Light 136 non-null float64
3 CO2 136 non-null float64
4 HumidityRatio 136 non-null float64
5 Occupancy 136 non-null int64
dtypes: float64(5), int64(1)
memory usage: 7.4 KB
df.asfreq('H').plot(subplots=True, figsize = (12,10));
# import library for example datasets
from vega_datasets import data as vds
df_stocks = vds.stocks()
df_stocks.head(5)
symbol date price
0 MSFT 2000-01-01 39.81
1 MSFT 2000-02-01 36.35
2 MSFT 2000-03-01 43.22
3 MSFT 2000-04-01 28.37
4 MSFT 2000-05-01 25.45
5 MSFT 2000-06-01 32.54
df_stocks.tail()
symbol date price
555 AAPL 2009-11-01 199.91
556 AAPL 2009-12-01 210.73
557 AAPL 2010-01-01 192.06
558 AAPL 2010-02-01 204.62
559 AAPL 2010-03-01 223.02
df_stocks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560 entries, 0 to 559
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symbol 560 non-null object
1 date 560 non-null datetime64[ns]
2 price 560 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 13.2+ KB
# Create a new dataframe only for AAPL stock
df_aapl = df_stocks.loc[(df_stocks['symbol'] == 'AAPL')]
df_aapl.head()
symbol date price
437 AAPL 2000-01-01 25.94
438 AAPL 2000-02-01 28.66
439 AAPL 2000-03-01 33.95
440 AAPL 2000-04-01 31.01
441 AAPL 2000-05-01 21.00
# set date as index
df_aapl.set_index('date', inplace=True)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create an empty dataframe for stock prices
prices = pd.DataFrame()
# Select data for each year and concatenate with prices
for year in ['2007', '2008', '2009']:
price_per_year = df_aapl.loc[year, ['price']].reset_index(drop=True)
price_per_year.rename(columns={'price': year}, inplace=True)
prices = pd.concat([prices, price_per_year], axis=1)
# Plot prices
prices.plot(title="AAPL Stock for 2007 to 2009")
plt.show()
Basic Timeseries Calculations
Some of the basic calculations we can do with a timeseries dataset using pandas
based on the pd.DateTimeIndex
:
- Shit or lag values back or forward in time
- Get differences in values for a given time period
- Compute the percent change over any number of periods
df_aapl.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 123 entries, 437 to 559
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symbol 123 non-null object
1 date 123 non-null datetime64[ns]
2 price 123 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 3.8+ KB
Moving Data between past & future
Shifting data is useful so they can be compared at different points in time.
As the name indicates .shift()
moves the data within the dataframe.
.shift()
:
- defaults to
periods=1
- when the value passed to
periods
is positive the data will shift forward and when the value is negative-1
the data will shift backwards/lag - 1 period into future or past
df_aapl['shifted'] = df_aapl.price.shift()
df_aapl.head()
symbol price shifted
date
2000-01-01 AAPL 25.94 NaN
2000-02-01 AAPL 28.66 25.94
2000-03-01 AAPL 33.95 28.66
2000-04-01 AAPL 31.01 33.95
2000-05-01 AAPL 21.00 31.01
df_aapl['lagged'] = df_aapl.price.shift(periods=-1)
df_aapl.tail()
symbol price shifted lagged
date
2009-11-01 AAPL 199.91 188.50 210.73
2009-12-01 AAPL 210.73 199.91 192.06
2010-01-01 AAPL 192.06 210.73 204.62
2010-02-01 AAPL 204.62 192.06 223.02
2010-03-01 AAPL 223.02 204.62 NaN
Calculate one-period percent change
We can use the shifted data we just created to calculate the change between periods which is also called the financial return The finacial return is calculated as follows:
\[x_t / x_{t-1}\]# create a new column and calculate the financial return
df_aapl['change'] = df_aapl.price.div(df_aapl.shifted)
df_aapl.head()
symbol price shifted lagged change
date
2000-01-01 AAPL 25.94 NaN 28.66 NaN
2000-02-01 AAPL 28.66 25.94 33.95 1.104857
2000-03-01 AAPL 33.95 28.66 31.01 1.184578
2000-04-01 AAPL 31.01 33.95 21.00 0.913402
2000-05-01 AAPL 21.00 31.01 26.19 0.677201
Calculating one-period percent change
# create a new column and calculate the result
df_aapl['return'] = df_aapl.change.sub(1).mul(100)
df_aapl.head()
symbol price shifted lagged change return
date
2000-01-01 AAPL 25.94 NaN 28.66 NaN NaN
2000-02-01 AAPL 28.66 25.94 33.95 1.104857 10.485736
2000-03-01 AAPL 33.95 28.66 31.01 1.184578 18.457781
2000-04-01 AAPL 31.01 33.95 21.00 0.913402 -8.659794
2000-05-01 AAPL 21.00 31.01 26.19 0.677201 -32.279910
## Built-in timeseries change
`.diff()`
* This method is used to calculate the difference invalue for two adjacent periods. In finance, it is used to calculate the difference in close price since the last day the stocks were traded.
$$x_t / x_{t-1}$$
# create a new column and calculate the diff()
df_aapl['diff'] = df_aapl.price.diff()
df_aapl.head()
symbol price shifted lagged change return diff
date
2000-01-01 AAPL 25.94 NaN 28.66 NaN NaN NaN
2000-02-01 AAPL 28.66 25.94 33.95 1.104857 10.485736 2.72
2000-03-01 AAPL 33.95 28.66 31.01 1.184578 18.457781 5.29
2000-04-01 AAPL 31.01 33.95 21.00 0.913402 -8.659794 -2.94
2000-05-01 AAPL 21.00 31.01 26.19 0.677201 -32.279910 -10.01
Calculating % change
.pct_change()
- Calculates the percent change for two adjacent periods
# create a new column and calculate the percentage change
df_aapl['pct_change'] = df_aapl.price.pct_change().mul(100)
df_aapl.head()
symbol price shifted lagged change return diff pct_change
date
2000-01-01 AAPL 25.94 NaN 28.66 NaN NaN NaN NaN
2000-02-01 AAPL 28.66 25.94 33.95 1.104857 10.485736 2.72 10.485736
2000-03-01 AAPL 33.95 28.66 31.01 1.184578 18.457781 5.29 18.457781
2000-04-01 AAPL 31.01 33.95 21.00 0.913402 -8.659794 -2.94 -8.659794
2000-05-01 AAPL 21.00 31.01 26.19 0.677201 -32.279910 -10.01 -32.279910
Exercise:
Set the frequency of the data to business day and shift the AAPL stock price 90 days into past and future and plot the results. Ensure the df contains daily prices.
# set frequency to business days
df_apple = df_aapl['price']
# create lagged and shifted columns
df_apple['lagged'] = df_apple.price.shift(periods=-90)
df_apple['shifted'] = df_apple.price.shift(periods=90)
# plot the price, lagged and shifted columns
df_aapl[['price', 'lagged', 'shifted']].plot(title="Apple Stock Price Comparison 90 Days into Past and Future")
plt.show()