Working with Timeseries Data II
Notes on Python/pandas for TimeSeries Data from Datacamp’s Manipulating Time Series course
Comparing Timeseries Growth Rate
- Timeseries data, stocks for instance, are hard to compare at different levels. They must be normalized so that the price series starts at 100.
- This is done by dividing all prices by the first element in the series and multiplying by 100
- As a result, the first value = 1 and each subsequent price now reflects the relative change to the initial price. All prices are relative to the starting point
- Multiply the normalized series by 100 and now we get the relative change to the initial prince in percentage points. For instance, a price change from 100 to 130 represents a 30 percent point increase.
Let’s look at an example.
# import packages
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Create a data frame containing the price (Adj Close ) for Kodak (‘KODK’), iRobot (‘IRBT’), Amazon (‘AMZN’) and Google (‘GOOG’) using pandas datareader which imports data from the web.
all_data = {ticker : web.get_data_yahoo(ticker)
for ticker in ['KODK', 'IRBT', 'AMZN', 'GOOG']}
price = pd.DataFrame({ticker : data['Adj Close']
for ticker, data in all_data.items()})
price.head()
KODK IRBT AMZN GOOG
Date
2015-09-14 14.99 29.889999 521.380005 623.239990
2015-09-15 15.41 30.320000 522.369995 635.140015
2015-09-16 15.45 30.270000 527.390015 635.979980
2015-09-17 15.68 31.340000 538.869995 642.900024
2015-09-18 15.58 30.330000 540.260010 629.250000
price.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1257 entries, 2015-09-14 to 2020-09-09
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 KODK 1257 non-null float64
1 IRBT 1257 non-null float64
2 AMZN 1257 non-null float64
3 GOOG 1257 non-null float64
dtypes: float64(4)
memory usage: 49.1 KB
# Select the first price for the stocks
price.iloc[0]
KODK 14.990000
IRBT 29.889999
AMZN 521.380005
GOOG 623.239990
Name: 2015-09-14 00:00:00, dtype: float64
This returns a Series containing the first row of the dataframe which can now be used to normalize the data.
normalized_df = price.div(price.iloc[0])
normalized_df.head()
KODK IRBT AMZN GOOG
Date
2015-09-14 1.000000 1.000000 1.000000 1.000000
2015-09-15 1.028019 1.014386 1.001899 1.019094
2015-09-16 1.030687 1.012713 1.011527 1.020442
2015-09-17 1.046031 1.048511 1.033546 1.031545
2015-09-18 1.039360 1.014721 1.036212 1.009643
We would need a benchmark to compare performance so we can import data from the web for SPDR S&P 500 ETF SPY
bench = {ticker : web.get_data_yahoo(ticker)
for ticker in ['SPY']}
bench_price = pd.DataFrame({ticker : data['Adj Close']
for ticker, data in bench.items()})
bench_price.head()
SPY
Date
2015-09-14 177.454758
2015-09-15 179.672806
2015-09-16 181.229980
2015-09-17 180.822556
2015-09-18 177.867691
bench_price.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1257 entries, 2015-09-14 to 2020-09-09
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SPY 1257 non-null float64
dtypes: float64(1)
memory usage: 19.6 KB
Now that we have the stock price and the benchmark data, we can combine them into a single dataframe using pd.concat
price_concat = pd.concat([price, bench_price], axis=1).dropna()
price_concat.head(3)
KODK IRBT AMZN GOOG SPY
Date
2015-09-14 14.99 29.889999 521.380005 623.239990 177.454758
2015-09-15 15.41 30.320000 522.369995 635.140015 179.672806
2015-09-16 15.45 30.270000 527.390015 635.979980 181.229980
price_concat.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1257 entries, 2015-09-14 to 2020-09-09
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 KODK 1257 non-null float64
1 IRBT 1257 non-null float64
2 AMZN 1257 non-null float64
3 GOOG 1257 non-null float64
4 SPY 1257 non-null float64
dtypes: float64(5)
memory usage: 58.9 KB
We can use the concatenated dataframe to compare the prices against the benchmark. We first normalized the data and multiply by 100, plot the data to see how each stock performed against SPY
normalized = price_concat.div(price_concat.iloc[0]).mul(100)
normalized.plot(title="Stocks / SPY",figsize = (8,6));
To see the performance difference for each individual stock relative to the benchmark in percentage points, we can subtract the normalized SPY
from the normalized stock prices. Use .sub
with the keyword axis=0
to align the Series index with the DataFrame index. This will make pandas subtract the Series from each column.
Plot the data to see how each stock performed relative to the benchmark.
diff = normalized.sub(normalized['SPY'], axis=0)
diff.head()
KODK IRBT AMZN GOOG SPY
Date
2015-09-14 0.000000 0.000000 0.000000 0.000000 0.0
2015-09-15 1.551945 0.188686 -1.060044 0.659458 0.0
2015-09-16 0.941284 -0.856097 -0.974716 -0.083274 0.0
2015-09-17 2.705238 2.953289 1.456723 1.256654 0.0
2015-09-18 3.703261 1.239368 3.388462 0.731619 0.0
diff[['KODK', 'IRBT', 'AMZN', 'GOOG']].plot(title="Stock performance / SPY", figsize = (8,6));
Changing Frequency - Resampling
We can change the frequency in a DateTimeIndex
using as.freq()
. Keep in mind that changing the frequency also changes the underlying data.
When unsampling, converting the data to a higher frequency, we create new rows and need to tell pandas how to interpolate the missing data.
When downsampling we reduce the number of rows and need to tell pandas how to aggregate the existing data.
We can use the following methods to manipulate our data:
.asfreq()
.reindex()
.resample()
+ tranformation method
Let’s illustrate this concept by creating some random quaterly data and resampling it.
# Resample data
dates = pd.date_range(start='2019', periods=4, freq='Q')
data = range(1, 5)
quaterly = pd.Series(data=data, index=dates)
quaterly
2019-03-31 1
2019-06-30 2
2019-09-30 3
2019-12-31 4
Freq: Q-DEC, dtype: int64
# change to monthly frequency
monthly = quaterly.asfreq('M')
monthly
2019-03-31 1.0
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 2.0
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 3.0
2019-10-31 NaN
2019-11-30 NaN
2019-12-31 4.0
Freq: M, dtype: float64
Upsampling to fill in missing values
We can convert a Series to DataFrame by applying the to_frame()
method passing a column name as a parameter.
monthly = monthly.to_frame('baselines')
monthly
baselines
2019-03-31 1.0
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 2.0
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 3.0
2019-10-31 NaN
2019-11-30 NaN
2019-12-31 4.0
We can impute the missing values using the following methods:
- forward fill –
method='ffill
- back fill –
method='bfill
- fill with 0 –
fill_value=0
monthly['ffill'] = quaterly.asfreq('M', method='ffill')
monthly['bfill'] = quaterly.asfreq('M', method='bfill')
monthly['value'] = quaterly.asfreq('M', fill_value=0)
monthly
baselines ffill bfill value
2019-03-31 1.0 1 1 1
2019-04-30 NaN 1 2 0
2019-05-31 NaN 1 2 0
2019-06-30 2.0 2 2 2
2019-07-31 NaN 2 3 0
2019-08-31 NaN 2 3 0
2019-09-30 3.0 3 3 3
2019-10-31 NaN 3 4 0
2019-11-30 NaN 3 4 0
2019-12-31 4.0 4 4 4
Reindexing
Let’s say we want a monthly DateTiemIndex
for a full year, we can use .reindex()
In this case pandas aligns existing data with new monthly values and creates missing values elsewhere. and we can use the same filling methods we saw above the impute the missing data.
dates = pd.date_range(start='2019',
periods=12,
freq='M')
quaterly.reindex(dates)
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 1.0
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 2.0
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 3.0
2019-10-31 NaN
2019-11-30 NaN
2019-12-31 4.0
Freq: M, dtype: float64
quaterly.reindex(dates, method='bfill')
2019-01-31 1
2019-02-28 1
2019-03-31 1
2019-04-30 2
2019-05-31 2
2019-06-30 2
2019-07-31 3
2019-08-31 3
2019-09-30 3
2019-10-31 4
2019-11-30 4
2019-12-31 4
Freq: M, dtype: int64
quaterly.reindex(dates, method='ffill')
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 1.0
2019-04-30 1.0
2019-05-31 1.0
2019-06-30 2.0
2019-07-31 2.0
2019-08-31 2.0
2019-09-30 3.0
2019-10-31 3.0
2019-11-30 3.0
2019-12-31 4.0
Freq: M, dtype: float64
Frequency Conversion & Transformation Methods
Pandas’ .resample()
follows a logic similar to .groupby()
. It groups data within resampling periods and applies one or several methods to each group. It takes the value created by this method and assigns a new date within the resampling period. The new date is determined by the offset
and it can be located at start
, end
,or a custom location.
Let’s apply these concepts to the Civilian Unemployment Rate data for the US.
I am going to use QUANDL
to retrieve the unemployment data.
You can pip install
it in your system by running the command below on the terminal or notebook cell:
!pip3 install quandl
import quandl
# pass your API key and dataset name
quandl.ApiConfig.api_key = "Enter your API Key"
# retrieve unemployment rate data
df_unemployment = quandl.get("FRED/UNRATE")
df_unemployment.tail()
Value
Date
2020-04-01 14.7
2020-05-01 13.3
2020-06-01 11.1
2020-07-01 10.2
2020-08-01 8.4
df_unemployment.plot(title="US Unemployment Rate", figsize = (8,6));
df_unemployment.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 872 entries, 1948-01-01 to 2020-08-01
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Value 872 non-null float64
dtypes: float64(1)
memory usage: 13.6 KB
As we can see, the dataset contains the unemployment rate for the US from 1948 to now (Aug 2020). It has a monthly frequency as each data point is entered at the beginning of each month.
To make it more manageable, we will create a new data set that contains only the last 14 years of data.
We would need to reindex these data to weekly values. Reindex will create NanNs
that we will have to impute.
df_unempl_14_years = df_unemployment['2006':]
df_unempl_14_years.head()
Value
Date
2006-01-01 4.7
2006-02-01 4.8
2006-03-01 4.7
2006-04-01 4.7
2006-05-01 4.6
df_unempl_14_years.plot(title='US Unemployment LAst 14 Years')
# Create weekly dates
weekly_dates_un = pd.date_range(start=df_unempl_14_years.index.min(),
end=df_unempl_14_years.index.max(),
freq='W')
# reindex from monthly to weekly data
weekly_un= df_unempl_14_years.reindex(weekly_dates_un)
weekly_un
Value
2006-01-01 4.7
2006-01-08 NaN
2006-01-15 NaN
2006-01-22 NaN
2006-01-29 NaN
... ...
2020-06-28 NaN
2020-07-05 NaN
2020-07-12 NaN
2020-07-19 NaN
2020-07-26 NaN
weekly_un.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 761 entries, 2006-01-01 to 2020-07-26
Freq: W-SUN
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Value 27 non-null float64
dtypes: float64(1)
memory usage: 11.9 KB
# Create a new column and forward fill the data
weekly_un['ffill'] = weekly_un.Value.ffill()
weekly_un['interpolated'] = weekly_un.Value.interpolate()
# plot data
weekly_un.plot(figsize = (8,6));
Downsampling & Aggregating Methods
We can define downsampling as reducing the frequency of our timeseries data. For instance, we can reduce hourly data to daily data and daily to monthly data. To achieve this, we need to determine how to summarize/aggregate the existing data – the 24 hourly data points will become a single datapoint when downsampling from hourly to daily.
The options to aggregate data include using the mean
, media
or last value
and that choice would be determined by the context/need of our analysis.
For this example, we can use the CO2 Concentration data set from vega which contains a monthly value starting from 1958.
# import library for toy datasets
from vega_datasets import data as vds
df_co2 = vds.co2_concentration()
df_co2.head(5)
Date CO2 adjusted CO2
0 1958-03-01 315.70 314.44
1 1958-04-01 317.46 315.16
2 1958-05-01 317.51 314.71
3 1958-07-01 315.86 315.19
4 1958-08-01 314.93 316.19
df_co2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 741 entries, 0 to 740
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 741 non-null object
1 CO2 741 non-null float64
2 adjusted CO2 741 non-null float64
dtypes: float64(2), object(1)
memory usage: 17.5+ KB
By getting the dataframe’s info we can see that it is not a timeseries dataframe so we must convert the date object column to ‘DateTimeIndex` and also set it as the index.
df_co2.Date = pd.to_datetime(df_co2.Date)
# set date as index
df_co2.set_index('Date', inplace=True)
df_co2.head()
CO2 adjusted CO2
Date
1958-03-01 315.70 314.44
1958-04-01 317.46 315.16
1958-05-01 317.51 314.71
1958-07-01 315.86 315.19
1958-08-01 314.93 316.19
df_co2.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 741 entries, 1958-03-01 to 2020-04-01
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CO2 741 non-null float64
1 adjusted CO2 741 non-null float64
dtypes: float64(2)
memory usage: 17.4 KB
#plot data
df_co2.plot(title='CO2 Concentration', figsize=(10,8));
# Change frequency from monthly to yearly
# and aggregate using the mean
df_co2_yearly = df_co2.resample('Y').mean()
# display head
df_co2_yearly.head()
CO2 adjusted CO2
Date
1958-12-31 315.333750 315.300000
1959-12-31 315.981667 315.977500
1960-12-31 316.909167 316.907500
1961-12-31 317.645000 317.638333
1962-12-31 318.454167 318.448333
df_co2_yearly.plot(title="CO2 Yearly Average Concentrations", figsize=(10,8))
We aggregated the data using the .mean()
but we also have other options to do so. We can also apply multiple aggregations at once by using the .agg
method and passing a list of aggregation functions such as median
and std
.
df_co2_yearly.resample('Y').agg(['mean', 'std']).head()
CO2 adjusted CO2
mean std mean std
Date
1958-12-31 315.333750 NaN 315.300000 NaN
1959-12-31 315.981667 NaN 315.977500 NaN
1960-12-31 316.909167 NaN 316.907500 NaN
1961-12-31 317.645000 NaN 317.638333 NaN
1962-12-31 318.454167 NaN 318.448333 NaN
ax = df_co2[['CO2']].plot(figsize=(10, 8))
yearly = df_co2[['CO2']].resample('Y').mean()
yearly.add_suffix('_yearly').plot(ax=ax) # ax to plot on same plot
We can revisit the price
dataframe we created earlier to practice resampling, changing frequency and aggregating methods.
price.head()
KODK IRBT AMZN GOOG
Date
2015-09-14 14.99 29.889999 521.380005 623.239990
2015-09-15 15.41 30.320000 522.369995 635.140015
2015-09-16 15.45 30.270000 527.390015 635.979980
2015-09-17 15.68 31.340000 538.869995 642.900024
2015-09-18 15.58 30.330000 540.260010 629.250000
# create monthly averages
monthly_average = price.resample('M').mean()
monthly_average.head()
KODK IRBT AMZN GOOG
Date
2015-09-30 15.434615 29.640000 526.396160 621.773071
2015-10-31 15.185909 30.333636 566.743181 663.592718
2015-11-30 13.173000 31.819500 657.695499 735.388498
2015-12-31 12.955454 34.823182 669.262279 755.354548
2016-01-31 9.513684 32.277369 601.061578 718.495792
# plot data
monthly_average.plot(title="Stocks Monthly Average", subplots=True, figsize=(10,8));
ax = price.plot(figsize=(10, 8))
m_avg = price.resample('M').mean()
m_avg.add_suffix('_m_avg').plot(ax=ax)
# calculate daily returns
# convert df prices to a pd.Series() and calculate the pct_change
daily_returns = price.squeeze().pct_change()
# Resample and calculate stats
stats = daily_returns.resample('M').agg(['mean', 'median', 'std'])
# plot stats
stats.plot(title="Stocks Stats", subplots=True, figsize=(12,12));
Working with Window Functions Using Pandas
Windows are useful because they allow us to work with sub-periods of our timeseries data. We can calculate metrics of the data inside the window and the results can be assigned to a new dataframe whose datapoints represent a summary of the datapoints in the original series.
The two main type of windows are:
- Rolling window – these windows maintain the size while sliding over the timeseries so each data point is the result of a given number of observations.
- Expanding windows – which grow with the timeseries so the calculation that produces a new datapoint is the result of all previous datapoints.
Let’s calculate a rolling average using pandas:
# create a new dataframe for AMZN stock price
# (single bracket returns a Series)
# (double bracket returns a DataFrame)
amzn_price = price[['AMZN']]
amzn_price.head()
AMZN
Date
2015-09-14 521.380005
2015-09-15 522.369995
2015-09-16 527.390015
2015-09-17 538.869995
2015-09-18 540.260010
amzn_price.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1257 entries, 2015-09-14 to 2020-09-09
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AMZN 1257 non-null float64
dtypes: float64(1)
memory usage: 59.6 KB
amzn_price.plot(title='AMZN Stock Price')
# Calculate the rolling average
# Integer-based window size
amzn_price.rolling(window=30).mean().plot()
# Offset-based window size
# 30D = 30 calendar days
amzn_price.rolling(window='30D').mean().plot()
ax = amzn_price.plot(figsize=(10, 8))
roll_30 = amzn_price.rolling(window='30D').mean()
roll_90 = amzn_price.rolling(window='90D').mean()
roll_360 = amzn_price.rolling(window='360D').mean()
roll_30.add_suffix('_mean_30').plot(ax=ax)
roll_90.add_suffix('_mean_90').plot(ax=ax)
roll_360.add_suffix('_mean_360').plot(ax=ax)
Calculating Multiple Rolling Metrics
As we’ve seen before, we can pass a list of metrics as an argument for .agg
mul_metrics = amzn_price.rolling(window='90D').agg(['mean', 'std'])
# plot metrics
mul_metrics.plot(subplots=True);
# Calculate quantiles
# create rolling window
rolling = amzn_price.rolling(360)['AMZN']
# define quantiles
amzn_price['q10'] = rolling.quantile(.1)
amzn_price['q50'] = rolling.quantile(.5)
amzn_price['q90'] = rolling.quantile(.9)
amzn_price.plot();
Expanding Windows with Pandas
We can use expanding windows to calculate metrics up to current date. It returns a new timeseries that reflects all historical values. Expanding windows are useful for calculating the running rate of return and running min/max.
Pandas gives two options to calculate expanding windows:
.expanding()
which is used just as.rolling()
.cumsum()
,.cumprod()
,.cummin()
/max()
# implementing the basic idea
df = pd.DataFrame({'data' : range(5)})
df['expanding_sum'] = df.data.expanding().sum()
df['cumulative_sum'] = df.data.cumsum()
df
data expanding_sum cumulative_sum
0 0 0.0 0
1 1 1.0 1
2 2 3.0 3
3 3 6.0 6
4 4 10.0 10
How to calculate a running return
-
Single period return \(r_t\) : current price over last price minus 1 \(r_t = \frac{P_t}{P_{t-1}} - 1\)
-
Multi=period return: product of \((1 + r_t)\) for all periods minus 1 \(R_t= (1+r_1)(1 + r_2)...(1+r_T)-1\)
# calculate percentage change
pr = price.pct_change()
pr_1 = pr.add(1)
cumulative_return = pr_1.cumprod().sub(1)
cumulative_return.mul(100).plot(figsize=(8,6));
Calculating running min and max
amzn_price.head()
AMZN q10 q50 q90
Date
2015-09-14 521.380005 NaN NaN NaN
2015-09-15 522.369995 NaN NaN NaN
2015-09-16 527.390015 NaN NaN NaN
2015-09-17 538.869995 NaN NaN NaN
2015-09-18 540.260010 NaN NaN NaN
amz_price = amzn_price[['AMZN']]
amz_price.head()
AMZN
Date
2015-09-14 521.380005
2015-09-15 522.369995
2015-09-16 527.390015
2015-09-17 538.869995
2015-09-18 540.260010
amz_price['running_min'] = amz_price.AMZN.expanding().min()
amz_price['running_max'] = amz_price.AMZN.expanding().max()
amz_price.plot(figsize=(10,8));
Calculating Rolling Annual Rate of Return
def multi_period_return(period_returns):
return np.product(period_returns + 1) - 1
pr = amz_price.AMZN.pct_change()
r = pr.rolling('360D').apply(multi_period_return)
amz_price['Rolling 1yr Return'] = r.mul(100)
amz_price.plot(subplots=True, figsize=(10,8));
To show these new skills, we are going to calculate the cumulativce return on a $1000 investment in amazon vs google.
# Create a new dataframe just for google and amazon prices
investment_comparison = price[['AMZN', 'GOOG']]
investment_comparison.head()
AMZN GOOG
Date
2015-09-14 521.380005 623.239990
2015-09-15 522.369995 635.140015
2015-09-16 527.390015 635.979980
2015-09-17 538.869995 642.900024
2015-09-18 540.260010 629.250000
# define investment variable
investment = 1000
# calculate returns
returns = investment_comparison.pct_change()
# calculate cumulative returns
r1 = returns.add(1)
cumulative_rtrn = r1.cumprod()
# calculate and plot the investment
cumulative_rtrn.mul(investment).plot(title=("AMZN vs GOOG 1000 Investment Comparison"), figsize=(10,8))
It is clearly that Amazon greatly outperformed Google stock over the entire period. To look deeper, we can look at 1-year subperiods and see how they compare.
# lets look at the data again
investment_comparison.head()
AMZN GOOG
Date
2015-09-14 521.380005 623.239990
2015-09-15 522.369995 635.140015
2015-09-16 527.390015 635.979980
2015-09-17 538.869995 642.900024
2015-09-18 540.260010 629.250000
# Define mult_period_return function
def multi_period_return(period_returns):
return np.product(period_returns + 1) - 1
# calculate daily returns
daily_rtrn = investment_comparison.pct_change()
# calculate rolling annual returns
rolling_annual_returns = daily_rtrn.rolling('360D').apply(multi_period_return)
# multiply by 100 and plot returns
rolling_annual_returns.mul(100).plot(figsize=(10,8))
It is clear that Amazon stock outperformed Google for the most part but there are a few instances where Google performed slightly better that Amazon.
Random Walk & Simulations
Predicting stock prices is a difficult task. The models used for prediction assume that stocks are random in nature so we can generate random numbers to recreate this behavior.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
from numpy.random import seed, normal
from scipy.stats import norm
seed(42)
random_returns = normal(loc=0, scale=0.01, size=1000)
sns.distplot(random_returns, fit=norm, kde=False);
Create a random price path
return_series = pd.Series(random_returns)
random_prices = return_series.add(1).cumprod().sub(1)
random_prices.mul(100).plot();
Calcualate SPY
returns
bench_price.head()
SPY
Date
2015-09-14 177.454758
2015-09-15 179.672806
2015-09-16 181.229980
2015-09-17 180.822556
2015-09-18 177.867691
bench_price['returns'] = bench_price.SPY.pct_change()
# plot price and return
bench_price.plot(subplots=True);
sns.distplot(bench_price.returns.dropna().mul(100), fit=norm);
# generate random and SPY 500 returns
from numpy.random import choice
sample = bench_price.returns.dropna()
# number of observations
n_obs = bench_price.returns.count()
random_walk = choice(sample, size=n_obs)
random_walk = pd.Series(random_walk, index=sample.index)
random_walk.head()
Date
2015-09-15 -0.000752
2015-09-16 0.004779
2015-09-17 -0.008254
2015-09-18 -0.024997
2015-09-21 -0.000259
dtype: float64
start = bench_price.SPY.first('D')
start
Date
2015-09-14 177.454758
Name: SPY, dtype: float64
spy_random = start.append(random_walk.add(1))
spy_random.head()
Date
2015-09-14 177.454758
2015-09-15 0.999248
2015-09-16 1.004779
2015-09-17 0.991746
2015-09-18 0.975003
dtype: float64
# calculate cumulative product
bench_price['SPY500_random'] = spy_random.cumprod()
# plot data
bench_price[['SPY', 'SPY500_random']].plot()
Correlations – Relationships between Timeseries
In timeseries correlation is a key measure of linear relationship between two variables. In financial markets, as well as other industries, correlation is used for predictions and risk management.
We can use pandas .corr()
to calculate correlation and seaborn’s jointplot()
to visualize pairwise linear relationships.
price.head()
KODK IRBT AMZN GOOG
Date
2015-09-14 14.99 29.889999 521.380005 623.239990
2015-09-15 15.41 30.320000 522.369995 635.140015
2015-09-16 15.45 30.270000 527.390015 635.979980
2015-09-17 15.68 31.340000 538.869995 642.900024
2015-09-18 15.58 30.330000 540.260010 629.250000
daily_returns = price.pct_change()
sns.jointplot(x='AMZN', y='GOOG', data=daily_returns);
sns.jointplot(x='IRBT', y='KODK', data=daily_returns);
# calculate correlations
correlations = daily_returns.corr()
correlations
KODK IRBT AMZN GOOG
KODK 1.000000 0.080010 0.058488 0.083955
IRBT 0.080010 1.000000 0.340479 0.378362
AMZN 0.058488 0.340479 1.000000 0.680849
GOOG 0.083955 0.378362 0.680849 1.000000
# plot in a heatmap
sns.heatmap(correlations, annot=True);