Get fitted coefficient of linear regression equation
I have a dataset with predicted and observed data. The equation that predicts the data is given by: y = AfT
With Af = constant (now at 1.35), T = wave period, g = gravitation 9.81, h = wave height.
Id like to use linear regression to find the best fitted coefficient (Af in the equation), so that the predicted value is closer to the observed data.
I now have Af = 1.35 (from suggestion in the literature) results in r^2 = 0.5676 Ideally, I`d use python to find the best fitted coefficient for my data.
import statsmodels.formula.api as smf from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris X = np.array([11.52, 11.559, 12.31, 16.46, 11.84, 7.38, 9.99, 16.72, 11.617, 11.77, 6.48, 9.035, 12.87, 11.18, 6.75]) y = np.array([25.51658407, 24.61306145, 19.4007494, 24.85111923, 25.99397106, 14.30284824, 17.69451713, 27.37460301, 22.23326366, 18.44905152, 10.28001306, 10.68681843, 28.85399089, 14.02840557, 18.41941787]).reshape((-1, 1)) X, y = load_iris(return_X_y=True) clf = LogisticRegression(random_state=0).fit(X, y) print(clf.coef_, clf.intercept_)
X = observed/measured values in the field, y = the predicted values of X using the equation
I have difficulties incorporating the actual equation and finding the best fit for Af.
Using the data you provided but taking the log you can get fit to an ordinary least squares model from scikit-learn:
import numpy as np import pymc3 as pm import pandas as pd import theano.tensor as tt from sklearn.linear_model import LinearRegression import statsmodels.formula.api as smf X = np.log(np.array([11.52, 11.559, 12.31, 16.46, 11.84, 7.38, 9.99, 16.72, 11.617, 11.77, 6.48, 9.035, 12.87, 11.18, 6.75])) y = np.log(np.array([25.51658407, 24.61306145, 19.4007494, 24.85111923, 25.99397106, 14.30284824, 17.69451713, 27.37460301, 22.23326366, 18.44905152, 10.28001306, 10.68681843, 28.85399089, 14.02840557, 18.41941787])) reg = LinearRegression().fit(X.reshape(-1, 1), y.reshape(-1, 1)) print('Af estimate: ', np.exp(reg.intercept_))
This gives the following estimate of Af
Af estimate: [2.4844087]
Since you don’t seem to be interested in predicting new data using a model but you want the best estimate of the linear model parameters you could use statsmodels:
results = smf.ols('y ~ X + 1', data=pd.DataFrame({'y':y,'X':X})).fit() print('Statsmodels Af estimate: ', np.exp(results.params['X']))
which gives 2.366 quite close to the previous value. r^2
is similar to the one you quote.
Lastly my suggestion would be to use pymc3 and get a full bayesian fit that would allow you to naturally estimate the uncertainty of the quantity you want to measure. There is undoubtedly a bit a learning curve to pymc3 but it is great package for probabilistic programming. It allows to estimate the full posterior of your parameter space which is what most people are interested when fitting a model. An implementation for your problem could be the following:
with pm.Model() as model: # Prior alpha = pm.Normal('alpha', mu=1.35, sd=5) # centered around the literature value beta = pm.HalfNormal('beta', sd=10) # only positive values as it goes into the sqrt. Also is height always positive here? sigma = pm.HalfNormal("sigma", sd=1) beta2 = pm.Deterministic('beta2', tt.sqrt(beta*9.81)) # g is very well known alpha_f = pm.Deterministic('alpha_f', tt.exp(alpha)) # estimate directly the output value we want # Likelihood likelihood = pm.Normal('y', mu=alpha + beta2 * X,sigma=sigma,observed=y) # Samplingtemp trace = pm.sample(init='adapt_diag') print(pm.summary(trace))
This gives the following output:
mean sd hpd_3% hpd_97% ... ess_sd ess_bulk ess_tail r_hat alpha 0.781 0.544 -0.232 1.864 ... 309.0 440.0 406.0 1.01 beta 0.091 0.044 0.013 0.167 ... 517.0 438.0 359.0 1.01 sigma 0.259 0.056 0.172 0.368 ... 530.0 479.0 147.0 1.00 beta2 0.917 0.229 0.439 1.316 ... 434.0 438.0 359.0 1.01 alpha_f 2.535 1.552 0.465 5.224 ... 317.0 440.0 406.0 1.01
and you can see that there is a lot of uncertainty in Af
.
However it is important to consider the data that goes in too and not overinterpret the results. At the moment you don’t provide any uncertainty in either y
or X
or the covariance matrix. It is however very unlikely that you know these quantities perfectly so the rational thing to do is including these uncertainties in your modelling. pymc3 allow for this to be implemented naturally. My implementation above includes an estimate of uncertainty from the data but you may well have your own uncertainty from the measurement device. See also this question and this paper.