Working with Timeseries Data II

15 minute read

Notes on Python/pandas for TimeSeries Data from Datacamp’s Manipulating Time Series course

Comparing Timeseries Growth Rate

Timeseries data, stocks for instance, are hard to compare at different levels. They must be normalized so that the price series starts at 100.
- This is done by dividing all prices by the first element in the series and multiplying by 100
- As a result, the first value = 1 and each subsequent price now reflects the relative change to the initial price. All prices are relative to the starting point
- Multiply the normalized series by 100 and now we get the relative change to the initial prince in percentage points. For instance, a price change from 100 to 130 represents a 30 percent point increase.

Let’s look at an example.

# import packages
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Create a data frame containing the price (Adj Close ) for Kodak (‘KODK’), iRobot (‘IRBT’), Amazon (‘AMZN’) and Google (‘GOOG’) using pandas datareader which imports data from the web.

all_data = {ticker : web.get_data_yahoo(ticker)
            for ticker in ['KODK', 'IRBT', 'AMZN', 'GOOG']}
price = pd.DataFrame({ticker : data['Adj Close']
                      for ticker, data in all_data.items()})

price.head()                      

	         KODK	   IRBT	     AMZN	     GOOG
Date				
2015-09-14	14.99	29.889999	521.380005	623.239990
2015-09-15	15.41	30.320000	522.369995	635.140015
2015-09-16	15.45	30.270000	527.390015	635.979980
2015-09-17	15.68	31.340000	538.869995	642.900024
2015-09-18	15.58	30.330000	540.260010	629.250000

price.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1257 entries, 2015-09-14 to 2020-09-09
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   KODK    1257 non-null   float64
 1   IRBT    1257 non-null   float64
 2   AMZN    1257 non-null   float64
 3   GOOG    1257 non-null   float64
dtypes: float64(4)
memory usage: 49.1 KB

# Select the first price for the stocks
price.iloc[0]

KODK     14.990000
IRBT     29.889999
AMZN    521.380005
GOOG    623.239990
Name: 2015-09-14 00:00:00, dtype: float64

This returns a Series containing the first row of the dataframe which can now be used to normalize the data.

normalized_df = price.div(price.iloc[0])
normalized_df.head()

	          KODK	      IRBT	      AMZN	      GOOG
Date				
2015-09-14	1.000000	1.000000	1.000000	1.000000
2015-09-15	1.028019	1.014386	1.001899	1.019094
2015-09-16	1.030687	1.012713	1.011527	1.020442
2015-09-17	1.046031	1.048511	1.033546	1.031545
2015-09-18	1.039360	1.014721	1.036212	1.009643

We would need a benchmark to compare performance so we can import data from the web for SPDR S&P 500 ETF SPY

bench = {ticker : web.get_data_yahoo(ticker)
         for ticker in ['SPY']}
bench_price = pd.DataFrame({ticker : data['Adj Close']
                      for ticker, data in bench.items()})

bench_price.head()

	            SPY
Date	
2015-09-14	177.454758
2015-09-15	179.672806
2015-09-16	181.229980
2015-09-17	180.822556
2015-09-18	177.867691

bench_price.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1257 entries, 2015-09-14 to 2020-09-09
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   SPY     1257 non-null   float64
dtypes: float64(1)
memory usage: 19.6 KB

Now that we have the stock price and the benchmark data, we can combine them into a single dataframe using pd.concat

price_concat = pd.concat([price, bench_price], axis=1).dropna()
price_concat.head(3)

	         KODK	   IRBT	       AMZN	      GOOG	      SPY
Date					
2015-09-14	14.99	29.889999	521.380005	623.239990	177.454758
2015-09-15	15.41	30.320000	522.369995	635.140015	179.672806
2015-09-16	15.45	30.270000	527.390015	635.979980	181.229980

price_concat.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1257 entries, 2015-09-14 to 2020-09-09
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   KODK    1257 non-null   float64
 1   IRBT    1257 non-null   float64
 2   AMZN    1257 non-null   float64
 3   GOOG    1257 non-null   float64
 4   SPY     1257 non-null   float64
dtypes: float64(5)
memory usage: 58.9 KB

We can use the concatenated dataframe to compare the prices against the benchmark. We first normalized the data and multiply by 100, plot the data to see how each stock performed against SPY

normalized = price_concat.div(price_concat.iloc[0]).mul(100)
normalized.plot(title="Stocks / SPY",figsize = (8,6));

Stocks/SPY_plot

To see the performance difference for each individual stock relative to the benchmark in percentage points, we can subtract the normalized SPY from the normalized stock prices. Use .sub with the keyword axis=0 to align the Series index with the DataFrame index. This will make pandas subtract the Series from each column. Plot the data to see how each stock performed relative to the benchmark.

diff = normalized.sub(normalized['SPY'], axis=0)
diff.head()

	          KODK	       IRBT	      AMZN	     GOOG	    SPY
Date					
2015-09-14	0.000000	0.000000	0.000000	0.000000	0.0
2015-09-15	1.551945	0.188686	-1.060044	0.659458	0.0
2015-09-16	0.941284	-0.856097	-0.974716	-0.083274	0.0
2015-09-17	2.705238	2.953289	1.456723	1.256654	0.0
2015-09-18	3.703261	1.239368	3.388462	0.731619	0.0

diff[['KODK', 'IRBT', 'AMZN', 'GOOG']].plot(title="Stock performance / SPY", figsize = (8,6));

graph_2

Changing Frequency - Resampling

We can change the frequency in a DateTimeIndex using as.freq(). Keep in mind that changing the frequency also changes the underlying data. When unsampling, converting the data to a higher frequency, we create new rows and need to tell pandas how to interpolate the missing data. When downsampling we reduce the number of rows and need to tell pandas how to aggregate the existing data.

We can use the following methods to manipulate our data:

.asfreq()
.reindex()
.resample() + tranformation method

Let’s illustrate this concept by creating some random quaterly data and resampling it.

# Resample data

dates = pd.date_range(start='2019', periods=4, freq='Q')
data = range(1, 5)

quaterly = pd.Series(data=data, index=dates)
quaterly

2019-03-31    1
2019-06-30    2
2019-09-30    3
2019-12-31    4
Freq: Q-DEC, dtype: int64

# change to monthly frequency

monthly = quaterly.asfreq('M')
monthly

2019-03-31    1.0
2019-04-30    NaN
2019-05-31    NaN
2019-06-30    2.0
2019-07-31    NaN
2019-08-31    NaN
2019-09-30    3.0
2019-10-31    NaN
2019-11-30    NaN
2019-12-31    4.0
Freq: M, dtype: float64

Upsampling to fill in missing values

We can convert a Series to DataFrame by applying the to_frame() method passing a column name as a parameter.

monthly = monthly.to_frame('baselines')
monthly

	        baselines
2019-03-31	1.0
2019-04-30	NaN
2019-05-31	NaN
2019-06-30	2.0
2019-07-31	NaN
2019-08-31	NaN
2019-09-30	3.0
2019-10-31	NaN
2019-11-30	NaN
2019-12-31	4.0

We can impute the missing values using the following methods:

forward fill – method='ffill
back fill – method='bfill
fill with 0 – fill_value=0

monthly['ffill'] = quaterly.asfreq('M', method='ffill')
monthly['bfill'] = quaterly.asfreq('M', method='bfill')
monthly['value'] = quaterly.asfreq('M', fill_value=0)

monthly

	       baselines	ffill	bfill	value
2019-03-31	     1.0	1	    1	    1
2019-04-30	     NaN	1	    2	    0
2019-05-31	     NaN	1	    2	    0
2019-06-30	     2.0	2	    2	    2
2019-07-31	     NaN	2	    3	    0
2019-08-31	     NaN	2	    3	    0   
2019-09-30	     3.0	3	    3	    3
2019-10-31	     NaN	3	    4	    0
2019-11-30	     NaN	3	    4	    0
2019-12-31	     4.0	4	    4	    4

Reindexing

Let’s say we want a monthly DateTiemIndex for a full year, we can use .reindex() In this case pandas aligns existing data with new monthly values and creates missing values elsewhere. and we can use the same filling methods we saw above the impute the missing data.

dates = pd.date_range(start='2019',
                      periods=12,
                      freq='M')

quaterly.reindex(dates)                      

2019-01-31    NaN
2019-02-28    NaN
2019-03-31    1.0
2019-04-30    NaN
2019-05-31    NaN
2019-06-30    2.0
2019-07-31    NaN
2019-08-31    NaN
2019-09-30    3.0
2019-10-31    NaN
2019-11-30    NaN
2019-12-31    4.0
Freq: M, dtype: float64

quaterly.reindex(dates, method='bfill')

2019-01-31    1
2019-02-28    1
2019-03-31    1
2019-04-30    2
2019-05-31    2
2019-06-30    2
2019-07-31    3
2019-08-31    3
2019-09-30    3
2019-10-31    4
2019-11-30    4
2019-12-31    4
Freq: M, dtype: int64

quaterly.reindex(dates, method='ffill')

2019-01-31    NaN
2019-02-28    NaN
2019-03-31    1.0
2019-04-30    1.0
2019-05-31    1.0
2019-06-30    2.0
2019-07-31    2.0
2019-08-31    2.0
2019-09-30    3.0
2019-10-31    3.0
2019-11-30    3.0
2019-12-31    4.0
Freq: M, dtype: float64

Frequency Conversion & Transformation Methods

Pandas’ .resample() follows a logic similar to .groupby(). It groups data within resampling periods and applies one or several methods to each group. It takes the value created by this method and assigns a new date within the resampling period. The new date is determined by the offset and it can be located at start, end,or a custom location.

Let’s apply these concepts to the Civilian Unemployment Rate data for the US.

I am going to use QUANDL to retrieve the unemployment data. You can pip install it in your system by running the command below on the terminal or notebook cell:

!pip3 install quandl

import quandl

# pass your API key and dataset name
quandl.ApiConfig.api_key = "Enter your API Key"

# retrieve unemployment rate data
df_unemployment = quandl.get("FRED/UNRATE")

df_unemployment.tail()

Value
Date	
2020-04-01	14.7
2020-05-01	13.3
2020-06-01	11.1
2020-07-01	10.2
2020-08-01	8.4

df_unemployment.plot(title="US Unemployment Rate", figsize = (8,6));

unemploy_graph

df_unemployment.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 872 entries, 1948-01-01 to 2020-08-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Value   872 non-null    float64
dtypes: float64(1)
memory usage: 13.6 KB

As we can see, the dataset contains the unemployment rate for the US from 1948 to now (Aug 2020). It has a monthly frequency as each data point is entered at the beginning of each month. To make it more manageable, we will create a new data set that contains only the last 14 years of data. We would need to reindex these data to weekly values. Reindex will create NanNs that we will have to impute.

df_unempl_14_years = df_unemployment['2006':]
df_unempl_14_years.head()

Value
Date	
2006-01-01	4.7
2006-02-01	4.8
2006-03-01	4.7
2006-04-01	4.7
2006-05-01	4.6

df_unempl_14_years.plot(title='US Unemployment LAst 14 Years')

unemploy_graph2

# Create weekly dates
weekly_dates_un = pd.date_range(start=df_unempl_14_years.index.min(),
                             end=df_unempl_14_years.index.max(),
                             freq='W')

# reindex from monthly to weekly data
weekly_un= df_unempl_14_years.reindex(weekly_dates_un)
weekly_un                             

Value
2006-01-01	4.7
2006-01-08	NaN
2006-01-15	NaN
2006-01-22	NaN
2006-01-29	NaN
...	...
2020-06-28	NaN
2020-07-05	NaN
2020-07-12	NaN
2020-07-19	NaN
2020-07-26	NaN

weekly_un.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 761 entries, 2006-01-01 to 2020-07-26
Freq: W-SUN
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Value   27 non-null     float64
dtypes: float64(1)
memory usage: 11.9 KB

# Create a new column and forward fill the data

weekly_un['ffill'] = weekly_un.Value.ffill()
weekly_un['interpolated'] = weekly_un.Value.interpolate()

# plot data
weekly_un.plot(figsize = (8,6));

un_graph4

Downsampling & Aggregating Methods

We can define downsampling as reducing the frequency of our timeseries data. For instance, we can reduce hourly data to daily data and daily to monthly data. To achieve this, we need to determine how to summarize/aggregate the existing data – the 24 hourly data points will become a single datapoint when downsampling from hourly to daily. The options to aggregate data include using the mean, media or last value and that choice would be determined by the context/need of our analysis. For this example, we can use the CO2 Concentration data set from vega which contains a monthly value starting from 1958.

# import library for toy datasets
from vega_datasets import data as vds

df_co2 = vds.co2_concentration()
df_co2.head(5)

    Date	       CO2	adjusted CO2
1958-03-01	315.70	314.44
1958-04-01	317.46	315.16
1958-05-01	317.51	314.71
1958-07-01	315.86	315.19
1958-08-01	314.93	316.19

df_co2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 741 entries, 0 to 740
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          741 non-null    object 
 1   CO2           741 non-null    float64
 2   adjusted CO2  741 non-null    float64
dtypes: float64(2), object(1)
memory usage: 17.5+ KB

By getting the dataframe’s info we can see that it is not a timeseries dataframe so we must convert the date object column to ‘DateTimeIndex` and also set it as the index.

df_co2.Date = pd.to_datetime(df_co2.Date)

# set date as index
df_co2.set_index('Date', inplace=True)

df_co2.head()

             CO2	adjusted CO2
Date		
1958-03-01	315.70	314.44
1958-04-01	317.46	315.16
1958-05-01	317.51	314.71
1958-07-01	315.86	315.19
1958-08-01	314.93	316.19

df_co2.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 741 entries, 1958-03-01 to 2020-04-01
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   CO2           741 non-null    float64
 1   adjusted CO2  741 non-null    float64
dtypes: float64(2)
memory usage: 17.4 KB

#plot data
df_co2.plot(title='CO2 Concentration', figsize=(10,8));

co2_img

# Change frequency from monthly to yearly
# and aggregate using the mean
df_co2_yearly = df_co2.resample('Y').mean()

# display head
df_co2_yearly.head()

                   CO2	adjusted CO2
Date		
1958-12-31	315.333750	315.300000
1959-12-31	315.981667	315.977500
1960-12-31	316.909167	316.907500
1961-12-31	317.645000	317.638333
1962-12-31	318.454167	318.448333

df_co2_yearly.plot(title="CO2 Yearly Average Concentrations", figsize=(10,8))

co2_img_2

We aggregated the data using the .mean() but we also have other options to do so. We can also apply multiple aggregations at once by using the .agg method and passing a list of aggregation functions such as median and std.

df_co2_yearly.resample('Y').agg(['mean', 'std']).head()

            CO2	        adjusted CO2
               mean	    std	   mean	    std
Date				
1958-12-31	315.333750	NaN	315.300000	NaN
1959-12-31	315.981667	NaN	315.977500	NaN
1960-12-31	316.909167	NaN	316.907500	NaN
1961-12-31	317.645000	NaN	317.638333	NaN
1962-12-31	318.454167	NaN	318.448333	NaN

ax = df_co2[['CO2']].plot(figsize=(10, 8))
yearly = df_co2[['CO2']].resample('Y').mean()
yearly.add_suffix('_yearly').plot(ax=ax) # ax to plot on same plot

co2_img_4

We can revisit the price dataframe we created earlier to practice resampling, changing frequency and aggregating methods.

price.head()

             KODK	   IRBT	      AMZN	      GOOG
Date				
2015-09-14	14.99	29.889999	521.380005	623.239990
2015-09-15	15.41	30.320000	522.369995	635.140015
2015-09-16	15.45	30.270000	527.390015	635.979980
2015-09-17	15.68	31.340000	538.869995	642.900024
2015-09-18	15.58	30.330000	540.260010	629.250000

# create monthly averages

monthly_average = price.resample('M').mean()

monthly_average.head()

              KODK	      IRBT	      AMZN	      GOOG
Date				
2015-09-30	15.434615	29.640000	526.396160	621.773071
2015-10-31	15.185909	30.333636	566.743181	663.592718
2015-11-30	13.173000	31.819500	657.695499	735.388498
2015-12-31	12.955454	34.823182	669.262279	755.354548
2016-01-31	9.513684	32.277369	601.061578	718.495792

# plot data

monthly_average.plot(title="Stocks Monthly Average", subplots=True, figsize=(10,8));

stocks_img1

ax = price.plot(figsize=(10, 8))
m_avg = price.resample('M').mean()
m_avg.add_suffix('_m_avg').plot(ax=ax)

sctocks_img2

# calculate daily returns

# convert df prices to a pd.Series() and calculate the pct_change

daily_returns = price.squeeze().pct_change()

# Resample and calculate stats

stats = daily_returns.resample('M').agg(['mean', 'median', 'std'])

# plot stats
stats.plot(title="Stocks Stats", subplots=True, figsize=(12,12));

stocks_img_5

Working with Window Functions Using Pandas

Windows are useful because they allow us to work with sub-periods of our timeseries data. We can calculate metrics of the data inside the window and the results can be assigned to a new dataframe whose datapoints represent a summary of the datapoints in the original series.

The two main type of windows are:

Rolling window – these windows maintain the size while sliding over the timeseries so each data point is the result of a given number of observations.
Expanding windows – which grow with the timeseries so the calculation that produces a new datapoint is the result of all previous datapoints.

Let’s calculate a rolling average using pandas:

# create a new dataframe for AMZN stock price
# (single bracket returns a Series)
# (double bracket returns a DataFrame)


amzn_price = price[['AMZN']]
amzn_price.head()

AMZN
Date	
2015-09-14	521.380005
2015-09-15	522.369995
2015-09-16	527.390015
2015-09-17	538.869995
2015-09-18	540.260010

amzn_price.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1257 entries, 2015-09-14 to 2020-09-09
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AMZN    1257 non-null   float64
dtypes: float64(1)
memory usage: 59.6 KB

amzn_price.plot(title='AMZN Stock Price')

amz_img1

# Calculate the rolling average

# Integer-based window size

amzn_price.rolling(window=30).mean().plot()

amz_img_3

# Offset-based window size
# 30D = 30 calendar days 
amzn_price.rolling(window='30D').mean().plot()

amz_img_4

ax = amzn_price.plot(figsize=(10, 8))
roll_30 = amzn_price.rolling(window='30D').mean()
roll_90 = amzn_price.rolling(window='90D').mean()
roll_360 = amzn_price.rolling(window='360D').mean()

roll_30.add_suffix('_mean_30').plot(ax=ax)

roll_90.add_suffix('_mean_90').plot(ax=ax)

roll_360.add_suffix('_mean_360').plot(ax=ax)

amz_img_5

Calculating Multiple Rolling Metrics

As we’ve seen before, we can pass a list of metrics as an argument for .agg

mul_metrics = amzn_price.rolling(window='90D').agg(['mean', 'std'])

# plot metrics
mul_metrics.plot(subplots=True);

metrics_img

# Calculate quantiles

# create rolling window
rolling = amzn_price.rolling(360)['AMZN']

# define quantiles

amzn_price['q10'] = rolling.quantile(.1)
amzn_price['q50'] = rolling.quantile(.5)
amzn_price['q90'] = rolling.quantile(.9)

amzn_price.plot();

quantiles_1

Expanding Windows with Pandas

We can use expanding windows to calculate metrics up to current date. It returns a new timeseries that reflects all historical values. Expanding windows are useful for calculating the running rate of return and running min/max.

Pandas gives two options to calculate expanding windows:

.expanding() which is used just as .rolling()
.cumsum(), .cumprod(), .cummin()/ max()

# implementing the basic idea

df = pd.DataFrame({'data' : range(5)})

df['expanding_sum'] = df.data.expanding().sum()

df['cumulative_sum'] = df.data.cumsum()

df

  data	expanding_sum	cumulative_sum
0	    0.0	            0
1	    1.0         	1
2	    3.0         	3
3	    6.0	            6
4	    10.0	        10

How to calculate a running return

Single period return $r_t$ : current price over last price minus 1 $r_t = \frac{P_t}{P_{t-1}} - 1$
Multi=period return: product of $(1 + r_t)$ for all periods minus 1 $R_t= (1+r_1)(1 + r_2)...(1+r_T)-1$

# calculate percentage change
pr = price.pct_change()

pr_1 = pr.add(1)

cumulative_return = pr_1.cumprod().sub(1)

cumulative_return.mul(100).plot(figsize=(8,6));

cumulative_img

Calculating running min and max

amzn_price.head()

	            AMZN	q10	q50	q90
Date				
2015-09-14	521.380005	NaN	NaN	NaN
2015-09-15	522.369995	NaN	NaN	NaN
2015-09-16	527.390015	NaN	NaN	NaN
2015-09-17	538.869995	NaN	NaN	NaN
2015-09-18	540.260010	NaN	NaN	NaN

amz_price = amzn_price[['AMZN']]
amz_price.head()

	           AMZN
Date	
2015-09-14	521.380005
2015-09-15	522.369995
2015-09-16	527.390015
2015-09-17	538.869995
2015-09-18	540.260010

amz_price['running_min'] = amz_price.AMZN.expanding().min()
amz_price['running_max'] = amz_price.AMZN.expanding().max()
amz_price.plot(figsize=(10,8));

running_img

Calculating Rolling Annual Rate of Return

def multi_period_return(period_returns):
  return np.product(period_returns + 1) - 1

pr = amz_price.AMZN.pct_change()

r = pr.rolling('360D').apply(multi_period_return)

amz_price['Rolling 1yr Return'] = r.mul(100)

amz_price.plot(subplots=True, figsize=(10,8));

rolling_img

To show these new skills, we are going to calculate the cumulativce return on a $1000 investment in amazon vs google.

# Create a new dataframe just for google and amazon prices

investment_comparison = price[['AMZN', 'GOOG']]
investment_comparison.head()

	          AMZN	      GOOG
Date		
2015-09-14	521.380005	623.239990
2015-09-15	522.369995	635.140015
2015-09-16	527.390015	635.979980
2015-09-17	538.869995	642.900024
2015-09-18	540.260010	629.250000

# define investment variable

investment = 1000

# calculate returns
returns = investment_comparison.pct_change()

# calculate cumulative returns
r1 = returns.add(1)
cumulative_rtrn = r1.cumprod()

# calculate and plot the investment
cumulative_rtrn.mul(investment).plot(title=("AMZN vs GOOG 1000 Investment Comparison"), figsize=(10,8))

cumulative_img_2

It is clearly that Amazon greatly outperformed Google stock over the entire period. To look deeper, we can look at 1-year subperiods and see how they compare.

# lets look at the data again
investment_comparison.head()

	          AMZN	      GOOG
Date		
2015-09-14	521.380005	623.239990
2015-09-15	522.369995	635.140015
2015-09-16	527.390015	635.979980
2015-09-17	538.869995	642.900024
2015-09-18	540.260010	629.250000

# Define mult_period_return function

def multi_period_return(period_returns):
  return np.product(period_returns + 1) - 1


# calculate daily returns
daily_rtrn = investment_comparison.pct_change()

# calculate rolling annual returns
rolling_annual_returns = daily_rtrn.rolling('360D').apply(multi_period_return)

# multiply by 100 and plot returns
rolling_annual_returns.mul(100).plot(figsize=(10,8))

rolling_img_2

It is clear that Amazon stock outperformed Google for the most part but there are a few instances where Google performed slightly better that Amazon.

Random Walk & Simulations

Predicting stock prices is a difficult task. The models used for prediction assume that stocks are random in nature so we can generate random numbers to recreate this behavior.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

from numpy.random import seed, normal
from scipy.stats import norm


seed(42)

random_returns = normal(loc=0, scale=0.01, size=1000)

sns.distplot(random_returns, fit=norm, kde=False);

fde_img

Create a random price path

return_series = pd.Series(random_returns)
random_prices = return_series.add(1).cumprod().sub(1)
random_prices.mul(100).plot();

ran_img

Calcualate SPY returns

bench_price.head()

	           SPY
Date	
2015-09-14	177.454758
2015-09-15	179.672806
2015-09-16	181.229980
2015-09-17	180.822556
2015-09-18	177.867691

bench_price['returns'] = bench_price.SPY.pct_change()

# plot price and return

bench_price.plot(subplots=True);

bench_img

sns.distplot(bench_price.returns.dropna().mul(100), fit=norm);

bench_img_2

# generate random and SPY 500 returns

from numpy.random import choice

sample = bench_price.returns.dropna()

# number of observations
n_obs = bench_price.returns.count()

random_walk = choice(sample, size=n_obs)

random_walk = pd.Series(random_walk, index=sample.index)
random_walk.head()

Date
2015-09-15   -0.000752
2015-09-16    0.004779
2015-09-17   -0.008254
2015-09-18   -0.024997
2015-09-21   -0.000259
dtype: float64

start = bench_price.SPY.first('D')
start

Date
2015-09-14    177.454758
Name: SPY, dtype: float64

spy_random = start.append(random_walk.add(1))
spy_random.head()

Date
2015-09-14    177.454758
2015-09-15      0.999248
2015-09-16      1.004779
2015-09-17      0.991746
2015-09-18      0.975003
dtype: float64

# calculate cumulative product
bench_price['SPY500_random'] = spy_random.cumprod()

# plot data
bench_price[['SPY', 'SPY500_random']].plot()

product_img

Correlations – Relationships between Timeseries

In timeseries correlation is a key measure of linear relationship between two variables. In financial markets, as well as other industries, correlation is used for predictions and risk management.

We can use pandas .corr() to calculate correlation and seaborn’s jointplot() to visualize pairwise linear relationships.

price.head()

            KODK	  IRBT	       AMZN	       GOOG
Date				
2015-09-14	14.99	29.889999	521.380005	623.239990
2015-09-15	15.41	30.320000	522.369995	635.140015
2015-09-16	15.45	30.270000	527.390015	635.979980
2015-09-17	15.68	31.340000	538.869995	642.900024
2015-09-18	15.58	30.330000	540.260010	629.250000

daily_returns = price.pct_change()
sns.jointplot(x='AMZN', y='GOOG', data=daily_returns);

jointplot

sns.jointplot(x='IRBT', y='KODK', data=daily_returns);

jointplot_2

# calculate correlations

correlations = daily_returns.corr()
correlations

KODK	IRBT	AMZN	GOOG
KODK	1.000000	0.080010	0.058488	0.083955
IRBT	0.080010	1.000000	0.340479	0.378362
AMZN	0.058488	0.340479	1.000000	0.680849
GOOG	0.083955	0.378362	0.680849	1.000000

# plot in a heatmap
sns.heatmap(correlations, annot=True);

heatmap_img

Share on

Twitter Facebook Google+ LinkedIn

Gabe Maldonado