How to slice on DateTime objects more efficiently and compute a given statistic at each iteration?

I am dealing with a pandas dataframe where the index is a DateTime object and the columns represent minute-by-minute returns on several stocks from the SP500 index, together with a column of returns from the index. It’s fairly long (100 stocks, 1510 trading days, minute-by-minute data each day) and looks like this (only three stocks for the sake of example):

DateTime         SPY     AAPL    AMZN     T 2014-01-02 9:30  0.032  -0.01   0.164   0.007 2014-01-02 9:31  -0.012  0.02   0.001   -0.004 2014-01-02 9:32  -0.015  0.031  0.004   -0.001 

I am trying to compute the betas of each stock for each different day and for each 30-minute window. The beta of a stock in this case is defined as the covariance between its returns and the SPY returns divided by the variance of SPY in the same period. My desired output is a 3-dimensional numpy array beta_HF where beta_HF[s, i, j], for instance, means the beta of stock s at day i at window j. At this moment, I am computing the betas in the following way (let returns be full dataframe):

trading_days = pd.unique(returns.index.date) window = "30min" moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time def dispersion(trading_days, moments, df, verbose = True):     index = 'SPY'     beta_HF = np.zeros((df.shape[1] - 1, len(trading_days), len(moments) - 1))      for i, day in enumerate(trading_days):         daily_data = df[df.index.date == day]         start_time = dt.time(9,30)              for j, end_time in enumerate(moments[1:]):             moment_data = daily_data.between_time(start_time, end_time)             covariances = np.array([moment_data[index].cov(moment_data[symbol]) for symbol in df])             beta_HF[:, i,j] = covariances[1:]/covariances[0]              if verbose == True:             if np.remainder(i, 100) == 0:                 print("Current Trading Day: {}".format(day))               return(beta_HF) 

The dispersion() function generates the correct output. However, I understand that I am looping over long iterables and this is not very efficient. I seek a more efficient way to "slice" the dataframe at each 30-minute window for each day in the sample and compute the covariances. Effectively, for each slice, I need to compute 101 numbers (100 covariances + 1 variance). On my local machine (a 2013 Retina i5 Macbook Pro) it’s taking around 8 minutes to compute everything. I tested it on a research server of my university and the computing time was basically the same, which probably implies that computing power is not the bottleneck but my code has low quality in this part. I would appreciate any ideas on how to make this faster.

One might point out that parallelization is the way to go here since the elements in beta_HF never interact with each other. So this seems to be easy to parallelize. However, I have never implemented anything with parallelization so I am very new to these concepts. Any ideas on how to make the code run faster? Thanks a lot!

Add Comment
1 Answer(s)

You can use pandas Grouper in order to group your data by frequency. The only drawbacks are that you cannot have overlapping windows and it will iterate over times that are not existant.

The first issue basically means that the window will slide from 9:309:59 to 10:0010:29 instead of 9:3010:00 to 10:0010:30.

The second issue comes to play during holidays and night when no trading takes place. Hence, if you have a large period without trading then you might want to split the DataFrame and combine them afterwards.

Create example data

import pandas as pd import numpy as np  time = pd.date_range(start="2014-01-02 09:30",                       end="2014-01-02 16:00", freq="min") time = time.append( pd.date_range(start="2014-01-03 09:30",                                    end="2014-01-03 16:00", freq="min") ) df = pd.DataFrame(data=np.random.rand(time.shape[0], 4)-0.5,                    index=time, columns=['SPY','AAPL','AMZN','T']) 

define the range you want to use

freq = '30min' obs_per_day = len(pd.date_range(start = "9:30", end = "16:00", freq = "30min")) trading_days = len(pd.unique(df.index.date)) 

make a function to calculate the beta values

def beta(df):     if df.empty: # returns nan when no trading takes place         return np.nan     mat = df.to_numpy() # numpy is faster than pandas     m = mat.mean(axis=0)     mat = mat - m[np.newaxis,:] # demean      dof = mat.shape[0] - 1 # degree of freedom     if dof != 0: # check if you data has more than one observation         mat = mat.T.dot(mat[:,0]) / dof  # covariance with first column         return mat[1:] / mat[0] # beta     else:         return np.zeros(mat.shape[1] - 1) # return zeros for to short data e.g. 16:00 

and in the end use pd.groupby().apply()

res = df.groupby(pd.Grouper(freq=freq)).apply(beta) res = np.array( [k for k in res.values if ~np.isnan(k).any()] ) # remove NaN res = res.reshape([trading_days, obs_per_day, df.shape[1]-1]) 

Note that the result is in a slightly different shape than yours.
The results also differ a bit because of the different window sliding. To check whether the results are the same, simply try somthing like this

trading_days = pd.unique(df.index.date)  # Your result moments1 = pd.date_range(start = "9:30", end = "10:00", freq = "30min").time beta(df[df.index.date == trading_days[0]].between_time(moments1[0], moments1[1]))  # mine moments2 = pd.date_range(start = "9:30", end = "10:00", freq = "29min").time beta(df[df.index.date == trading_days[0]].between_time(moments[0], moments2[1])) 
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.