How to slice on DateTime objects more efficiently and compute a given statistic at each iteration?

Question

Home

How to slice on DateTime objects more efficiently and compute a given statistic at each iteration?

0

I am dealing with a pandas dataframe where the index is a DateTime object and the columns represent minute-by-minute returns on several stocks from the SP500 index, together with a column of returns from the index. It’s fairly long (100 stocks, 1510 trading days, minute-by-minute data each day) and looks like this (only three stocks for the sake of example):

DateTime         SPY     AAPL    AMZN     T 2014-01-02 9:30  0.032  -0.01   0.164   0.007 2014-01-02 9:31  -0.012  0.02   0.001   -0.004 2014-01-02 9:32  -0.015  0.031  0.004   -0.001

I am trying to compute the betas of each stock for each different day and for each 30-minute window. The beta of a stock in this case is defined as the covariance between its returns and the SPY returns divided by the variance of SPY in the same period. My desired output is a 3-dimensional numpy array beta_HF where beta_HF[s, i, j], for instance, means the beta of stock s at day i at window j. At this moment, I am computing the betas in the following way (let returns be full dataframe):

trading_days = pd.unique(returns.index.date) window = "30min" moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time def dispersion(trading_days, moments, df, verbose = True):     index = 'SPY'     beta_HF = np.zeros((df.shape[1] - 1, len(trading_days), len(moments) - 1))      for i, day in enumerate(trading_days):         daily_data = df[df.index.date == day]         start_time = dt.time(9,30)              for j, end_time in enumerate(moments[1:]):             moment_data = daily_data.between_time(start_time, end_time)             covariances = np.array([moment_data[index].cov(moment_data[symbol]) for symbol in df])             beta_HF[:, i,j] = covariances[1:]/covariances[0]              if verbose == True:             if np.remainder(i, 100) == 0:                 print("Current Trading Day: {}".format(day))               return(beta_HF)

The dispersion() function generates the correct output. However, I understand that I am looping over long iterables and this is not very efficient. I seek a more efficient way to "slice" the dataframe at each 30-minute window for each day in the sample and compute the covariances. Effectively, for each slice, I need to compute 101 numbers (100 covariances + 1 variance). On my local machine (a 2013 Retina i5 Macbook Pro) it’s taking around 8 minutes to compute everything. I tested it on a research server of my university and the computing time was basically the same, which probably implies that computing power is not the bottleneck but my code has low quality in this part. I would appreciate any ideas on how to make this faster.

One might point out that parallelization is the way to go here since the elements in beta_HF never interact with each other. So this seems to be easy to parallelize. However, I have never implemented anything with parallelization so I am very new to these concepts. Any ideas on how to make the code run faster? Thanks a lot!

Ramonrachelleshelby Asked on July 16, 2020 in Python.

Share
Comment(0)

Add Comment

1 Answer(s)

Votes
Oldest

0

You can use pandas Grouper in order to group your data by frequency. The only drawbacks are that you cannot have overlapping windows and it will iterate over times that are not existant.

The first issue basically means that the window will slide from 9:30–9:59 to 10:00–10:29 instead of 9:30–10:00 to 10:00–10:30.

The second issue comes to play during holidays and night when no trading takes place. Hence, if you have a large period without trading then you might want to split the DataFrame and combine them afterwards.

Create example data

import pandas as pd import numpy as np  time = pd.date_range(start="2014-01-02 09:30",                       end="2014-01-02 16:00", freq="min") time = time.append( pd.date_range(start="2014-01-03 09:30",                                    end="2014-01-03 16:00", freq="min") ) df = pd.DataFrame(data=np.random.rand(time.shape[0], 4)-0.5,                    index=time, columns=['SPY','AAPL','AMZN','T'])

define the range you want to use

freq = '30min' obs_per_day = len(pd.date_range(start = "9:30", end = "16:00", freq = "30min")) trading_days = len(pd.unique(df.index.date))

make a function to calculate the beta values

def beta(df):     if df.empty: # returns nan when no trading takes place         return np.nan     mat = df.to_numpy() # numpy is faster than pandas     m = mat.mean(axis=0)     mat = mat - m[np.newaxis,:] # demean      dof = mat.shape[0] - 1 # degree of freedom     if dof != 0: # check if you data has more than one observation         mat = mat.T.dot(mat[:,0]) / dof  # covariance with first column         return mat[1:] / mat[0] # beta     else:         return np.zeros(mat.shape[1] - 1) # return zeros for to short data e.g. 16:00

and in the end use pd.groupby().apply()

res = df.groupby(pd.Grouper(freq=freq)).apply(beta) res = np.array( [k for k in res.values if ~np.isnan(k).any()] ) # remove NaN res = res.reshape([trading_days, obs_per_day, df.shape[1]-1])

Note that the result is in a slightly different shape than yours.
The results also differ a bit because of the different window sliding. To check whether the results are the same, simply try somthing like this

trading_days = pd.unique(df.index.date)  # Your result moments1 = pd.date_range(start = "9:30", end = "10:00", freq = "30min").time beta(df[df.index.date == trading_days[0]].between_time(moments1[0], moments1[1]))  # mine moments2 = pd.date_range(start = "9:30", end = "10:00", freq = "29min").time beta(df[df.index.date == trading_days[0]].between_time(moments[0], moments2[1]))

Barrettlucianojoan Answered on July 16, 2020.

Share
Comment(0)

Add Comment

Your Answer

Answer 1

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 2

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 3

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 4

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 5

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 6

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 7

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 8

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

Answer 9

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 10

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 11

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 12

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 13

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 14

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 15

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 16

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

LATEST ANSWERS

How to slice on DateTime objects more efficiently and compute a given statistic at each iteration?

Your Answer

TOP USERS

HOT QUESTIONS

LATEST ANSWERS

How to slice on DateTime objects more efficiently and compute a given statistic at each iteration?

Your Answer

Tags Widget

TOP USERS

HOT QUESTIONS