How to slice on DateTime objects more efficiently and compute a given statistic at each iteration?
I am dealing with a pandas
dataframe where the index is a DateTime object and the columns represent minute-by-minute returns on several stocks from the SP500 index, together with a column of returns from the index. It’s fairly long (100 stocks, 1510 trading days, minute-by-minute data each day) and looks like this (only three stocks for the sake of example):
DateTime SPY AAPL AMZN T 2014-01-02 9:30 0.032 -0.01 0.164 0.007 2014-01-02 9:31 -0.012 0.02 0.001 -0.004 2014-01-02 9:32 -0.015 0.031 0.004 -0.001
I am trying to compute the betas of each stock for each different day and for each 30-minute window. The beta of a stock in this case is defined as the covariance between its returns and the SPY
returns divided by the variance of SPY
in the same period. My desired output is a 3-dimensional numpy
array beta_HF
where beta_HF[s, i, j]
, for instance, means the beta of stock s
at day i
at window j
. At this moment, I am computing the betas in the following way (let returns
be full dataframe):
trading_days = pd.unique(returns.index.date) window = "30min" moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time def dispersion(trading_days, moments, df, verbose = True): index = 'SPY' beta_HF = np.zeros((df.shape[1] - 1, len(trading_days), len(moments) - 1)) for i, day in enumerate(trading_days): daily_data = df[df.index.date == day] start_time = dt.time(9,30) for j, end_time in enumerate(moments[1:]): moment_data = daily_data.between_time(start_time, end_time) covariances = np.array([moment_data[index].cov(moment_data[symbol]) for symbol in df]) beta_HF[:, i,j] = covariances[1:]/covariances[0] if verbose == True: if np.remainder(i, 100) == 0: print("Current Trading Day: {}".format(day)) return(beta_HF)
The dispersion()
function generates the correct output. However, I understand that I am looping over long iterables and this is not very efficient. I seek a more efficient way to "slice" the dataframe at each 30-minute window for each day in the sample and compute the covariances. Effectively, for each slice, I need to compute 101 numbers (100 covariances + 1 variance). On my local machine (a 2013 Retina i5 Macbook Pro) it’s taking around 8 minutes to compute everything. I tested it on a research server of my university and the computing time was basically the same, which probably implies that computing power is not the bottleneck but my code has low quality in this part. I would appreciate any ideas on how to make this faster.
One might point out that parallelization is the way to go here since the elements in beta_HF
never interact with each other. So this seems to be easy to parallelize. However, I have never implemented anything with parallelization so I am very new to these concepts. Any ideas on how to make the code run faster? Thanks a lot!
You can use pandas Grouper in order to group your data by frequency. The only drawbacks are that you cannot have overlapping windows and it will iterate over times that are not existant.
The first issue basically means that the window will slide from 9:30
–9:59
to 10:00
–10:29
instead of 9:30
–10:00
to 10:00
–10:30
.
The second issue comes to play during holidays and night when no trading takes place. Hence, if you have a large period without trading then you might want to split the DataFrame
and combine them afterwards.
Create example data
import pandas as pd import numpy as np time = pd.date_range(start="2014-01-02 09:30", end="2014-01-02 16:00", freq="min") time = time.append( pd.date_range(start="2014-01-03 09:30", end="2014-01-03 16:00", freq="min") ) df = pd.DataFrame(data=np.random.rand(time.shape[0], 4)-0.5, index=time, columns=['SPY','AAPL','AMZN','T'])
define the range you want to use
freq = '30min' obs_per_day = len(pd.date_range(start = "9:30", end = "16:00", freq = "30min")) trading_days = len(pd.unique(df.index.date))
make a function to calculate the beta values
def beta(df): if df.empty: # returns nan when no trading takes place return np.nan mat = df.to_numpy() # numpy is faster than pandas m = mat.mean(axis=0) mat = mat - m[np.newaxis,:] # demean dof = mat.shape[0] - 1 # degree of freedom if dof != 0: # check if you data has more than one observation mat = mat.T.dot(mat[:,0]) / dof # covariance with first column return mat[1:] / mat[0] # beta else: return np.zeros(mat.shape[1] - 1) # return zeros for to short data e.g. 16:00
and in the end use pd.groupby().apply()
res = df.groupby(pd.Grouper(freq=freq)).apply(beta) res = np.array( [k for k in res.values if ~np.isnan(k).any()] ) # remove NaN res = res.reshape([trading_days, obs_per_day, df.shape[1]-1])
Note that the result is in a slightly different shape than yours.
The results also differ a bit because of the different window sliding. To check whether the results are the same, simply try somthing like this
trading_days = pd.unique(df.index.date) # Your result moments1 = pd.date_range(start = "9:30", end = "10:00", freq = "30min").time beta(df[df.index.date == trading_days[0]].between_time(moments1[0], moments1[1])) # mine moments2 = pd.date_range(start = "9:30", end = "10:00", freq = "29min").time beta(df[df.index.date == trading_days[0]].between_time(moments[0], moments2[1]))