Regression Analysis is a mathematical technique for estimating the value of parameters in a linear equation for the creation of extremely useful mathematical models that quantify the relationship between variables. The term regression comes from Sir Fransis Galton for the term previously described by the great mathematician Fredrik Gauss and Legendre as “least square”. Legenre and Gauss both applied regression analysis to the observation of planetary movements to make inferences and predictions about their future positions in the sky.
Regression analysis is one of the most important mathematical tools ever discovered. The applications to linear regression include finding the connection between cigarette smoking and lung cancer, quantifying systemic risk of a stock investment, determining the impact of breast feeding on psychological health and the impacts of different home improvements on the potential resale value of a home. Regression has even been used to forecast the expected increase in crime as a result of an increase in the price of heroin. These analyses can better help society plan for the future, determine the toxicity of products, and help individuals generate income through real estate.
Regression analysis is also misused and abused in data analysis so there must be a series of hypothesis testing conducting after a regression has been calculated. Hypothesis testing is a key feature in determining the statistical significance of the independent variable parameters. The topic of hypothesis testing and model fitting will be a discussion for another post, but keep in mind that there may be various assumptions and values involved in linear regression which must be carefully examined before coming to conclusions about quantitative relationship.
The parameters used in linear regression for one independent variable are estimated with the following equations:
The following is a proof showing that these equations are the maximum likelihood estimates for the slope (alpha) and the coefficient (beta) of the linear equation estimate in the two dimensional case:
The assumptions about the error term in the Classical Linear Regression Model are,
We take the likelihood, which is the product of the probability distribution of the model we are estimated, and then take the natural logarithm of that equation. Finally we take partial derivatives with respect to alpha and beta and solve the system of two equation and two unknowns to get the maximum likelihood estimates for alpha and beta,
These estimates for alpha and beta are the ones that are used in statistics textbooks when dealing with the estimation of the parameters of a single-variable regression.
Imagine that you are a real estate investor and you have purchased a home to remodel and then resell on the market. Past experience tells you that two of the most important factors that determine the price of a home is the number of bathrooms and restrooms it has. You want to know if you should add a bathroom or a bedroom to the 3 bedroom 2 bathroom house in order to get the highest return on your investment. There are other factors such as location, quality of schools, uniformity within the neighborhood, and lot size which are also key indicators of value. In order to conduct a simple multivariate regression analysis we will not consider these variables, but they very well could be included in the model if they create a better estimate of value; for the sake of brevity these other variables will be omitted-the ordinary least square calculation would be exactly the same if they were included.
Your ask your real estate broker to send you a list of homes within a 10 mile radius that have sold in the last couple of months-this helps control for the other factors mentioned above which might increased the variance of our estimate. The broker hands you over a list that has information on the selling price, number of restroom, and number of bedrooms for the last couple of months which you will used to run your analysis to answer the investment question. The question is, should you add an additional bedroom or bathroom to your newly acquired home if you want maximize the return on investment for the construction of either addition to the property.
(Prices) ( Restrooms) ( Bedrooms)
We want to design a model of the form: $latex y\left(\beta_i\right)= \beta_0 +\beta_1\left(x_1\right)+\beta_2\left(x_2\right)$. The betas are the objective of our calculations beta0 is the intercept of the regression line which has no real economic interpretation, beta1 demonstrates the change in home price from adding an additional restroom keeping the number of bedrooms constant, and beta 2 is the change in a homes price from adding an additional bedroom while keeping the number of restrooms constant.
In order to calculate the $latex \beta_i’s$ by hand you will need to know some linear/matrix algebra. Namely taking inverses, transposes and matrix multiplication will be part of the handwritten calculations for our model. Here is the linear algebra notation that will generate the least square estimates of $latex \beta_i’s$:
Where “A” is the matrix of restrooms and bedroom columns and the vector “b” is the price vector. This calculation can be tedious, even for the small amount of data in this problem so a more convenient way of calculating the beta coefficients is detailed in the Excel section.
One way to organize your data on Excel to ease the multivariate regression analysis is to format your data in the following way;
(Prices) ( Restrooms) ( Bedrooms)
Highlight a 3×5 area of cells on Excel then type the following function, =LINEST(known y’s, known x’s, const, stats). The “known y’s” is your home prices and you can input this series by highlighting the column with the home prices in the range. The “known x’s” can be highlighted as a block and these series can be placed into the formula. For the const term just leave a space , “, ,”and for the stats parameter type in TRUE.
The ouputput you should get if this is done correctly is;
The betas are on the first row; $latex \beta_0=84927.98$, $latex \beta_1=9542.18$, and $latex \beta_2=78009.25$. We can now plug the betas into our original model:
The other numbers in the output matrix are important to conduct hypothesis testing and assess the fit of the model, but this will be left out of the analysis until a later post.
The model says that adding a bedroom to a house will increase its value, on average, $78009.25. The model also says that if you add an additional restroom to the house you will increase its value by $9542.18.
If the cost of adding an additional bedroom is $60,000, then the return to investment would be ($78009.25-$60,000)/$60,000=30%, but if the cost of adding an additional restroom was $6,500, then the return on investment would be ($9542.18-$4,500)/$6,500=46%. If the objective is to maximize return on money spent on upgrading the property then the obvious choice would be to add an additional restroom.