## Quantifying Multicollinearity

Multicollinearity is one of the more serious problems that can arise in a regression analysis or even in simple frequency and mean analysis with a causal interpretation. Regression analysis assumes a degree of independence between the explanatory factors. However, in practice many of the explanatory variables are correlated with each other. Imagine you are trying to understand what factors are leading causes of lung cancer. There could be a host of socioeconomic and demographic factors than can be correlated with behaviors such as smoking. It may be that people who smoke are more likely to 1) work in factories with carcinogens , 2) have poorer diets, 3) live closer to toxic waste dumps or under power lines. How can one separate out important causal risk factors for the development of lung cancer when all of these factors move together? This inseparability is the essence of what econometricians call multicollinearity problem. Fortunately, we don’t have to guess or speculate about the severity of this problem. One can use a regression analysis to precisely quantify the severity of this problem. This statistic is called the Variance Inflation Factor (VIF), and it is simply the R-squared of the regression of the suspected multicollinear variable with the other explanatory variables. Here is the formal mathematical syntax for the VIF in the context of a causal regression between smoking and lung cancer. This is...

Read More