Collinearity

MP223 - Applied Econometrics Methods for the Social Sciences

Eduard Bukin

R setup

library(tidyverse)       # for data wrangling
library(alr4)            # for the data sets #
library(GGally)
library(ggpmisc)
library(parameters)
library(performance)
library(see)
library(car)
library(broom)
library(modelsummary)
library(texreg)
library(report)

ggplot2::theme_set(ggplot2::theme_bw())

knitr::opts_chunk$set(
  fig.width = 12,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "100%", 
  message = FALSE,
  echo = TRUE, 
  cache = TRUE
)

my_gof <- function(fit_obj, digits = 4) {
  sum_fit <- summary(fit_obj)
  
  stars <- 
    pf(sum_fit$fstatistic[1],
       sum_fit$fstatistic[2], 
       sum_fit$fstatistic[3],
       lower.tail=FALSE) %>% 
    symnum(corr = FALSE, na = FALSE, 
           cutpoints = c(0,  .001,.01,.05,  1),
           symbols   =  c("***","**","*"," ")) %>% 
    as.character()
  
  list(
    # `R^2` = sum_fit$r.squared %>% round(digits),
    # `Adj. R^2` = sum_fit$adj.r.squared %>% round(digits),
    # `Num. obs.` = sum_fit$residuals %>% length(),
    `Num. df` = sum_fit$df[[2]],
    `F statistic` = 
      str_c(sum_fit$fstatistic[1] %>% round(digits), " ", stars)
  )
}

# Function for screening many regressors
screen_many_regs <-
  function(fit_obj_list, ..., digits = 4, single.row = TRUE) {
    
    if (class(fit_obj_list) == "lm") 
      fit_obj_list <- list(fit_obj_list)
    
    if (length(rlang::dots_list(...)) > 0)  
      fit_obj_list <- fit_obj_list %>% append(rlang::dots_list(...))
    
    # browser()
    fit_obj_list %>%
      screenreg(
        custom.note =
          map2_chr(., seq_along(.), ~ {
            str_c("Model ", .y, " ", as.character(.x$call)[[2]])
          }) %>%
          c("*** p < 0.001; ** p < 0.01; * p < 0.05", .) %>%
          str_c(collapse = "\n") ,
        digits = digits,
        single.row = single.row,
        custom.gof.rows =
          map(., ~my_gof(.x, digits)) %>%
          transpose() %>%
          map(unlist),
        reorder.gof = c(3, 4, 5, 1, 2)
      )
  }

Assumption 3. No Perfect Collinearity (Some variation in X Variables)

Collinearity or Muticollinearity

No collinearity means
- none of the regressors can be written as an exact linear combinations of some other regressors in the model.
For example:
- in \(Y = \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\) ,
- where \(X_3 = X_2 + X_1\) ,
- all \(X\) are collinear.

Consequence of collinearity:

biased estimates of the collinear variables
over-significant results;

Detection of collinearity:

Scatter plot; Correlation matrix;
Model specification;
Step-wise regression approach;
Variance Inflation Factor;

Solution to collinearity:

Re specify the model;
Choose different regressors;
See also:
- Overview: “Assumption AMLR.3 No Perfect Collinearity” in (Wooldridge 2020) ;
- Examples of causes in Chapter 9.5 (Wooldridge 2020) ;
- Chapter 9.4-9.5 in (Weisberg 2005);

Collinearity examples

Example 0.1

\[\hat{output} = \hat\beta_1 land + \hat\beta_2 seeds + \hat\beta_3 fertilizers + \hat\beta_4 others + \hat\beta_5 total\]

where \(total = seeds + fertilizers + others\)
Important

Is there a multicollinearity problem here?
Danger

YES! Definitely!
Tip

Coefficient \(\hat\beta_5\) is aliased and wont be estimated

Example 0.2

\[\hat{output} = \hat\beta_1 land + \hat\beta_2 seeds + \hat\beta_3 fertilizers + \hat\beta_4 others\]

where \(seeds\) and \(fertilizers\) highly correlate between each other,
VIF of \(seeds\) and \(fertilizers\) is 12.2
our key variable is \(land\).
Important

Is there a multicollinearity problem here?
Danger

Not really!
Because \(fertilizers\) is a control variable and we may have OVB if we remove it!
If we really want to reduce VIF…:
- Dis-aggregate fertilizers into mineral and organic, for example.
- Aggregate fertilizers and seeds

Example 0.3

Same model but in log:

\[log(\hat{output}) = \hat\beta_1 log(land) + \hat\beta_2 log(seeds) + \hat\beta_3 log(fertilizers) + \\ \hat\beta_4 log(others) + \hat\beta_5 log(total) \]

where \(total = seeds + fertilizers + others\)

Important

Is there a multicollinearity problem here?
Think!
\(log(a) + log(b) = log(a * b)\)
Danger

Not really

Example 0.4

Same model but with a quadratic term:

\[\hat{output} = \hat\beta_1 land + \hat\beta_2 land^2 + \hat\beta_3 seeds + \hat\beta_4 fertilizers + \hat\beta_5 others\]

Important

Is there a multicollinearity problem here?
Think!
Danger

Not really
\(land^2\) is not a linear combination of \(land\) ;
Linear combination is when \(land + land\) not when \(land \times land\) ;

Collinearity example 1:

Collinearity detection by checking the model specification

Perfect collinearity with dummy variable (1)

We want to build a naive regression, where the wage is a function of sex (female and male):
\(\text{wage} = \beta_0 + \beta_1 \cdot \text{female} + \beta_2 \cdot \text{male}\)
The data is fictional:

Code

library(tidyverse)
n <- 14
set.seed(122)
dta <- 
    tibble(female = as.integer(round(runif(n), 0))) %>% 
    mutate(male = as.integer(1 - female),
           wage = 10 - 3 * male + runif(n, -3, 3))
glimpse(dta)

Rows: 14
Columns: 3
$ female <int> 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1
$ male   <int> 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0
$ wage   <dbl> 10.847522, 7.167989, 4.941890, 7.477957, 9.391538, 8.087289, 9.…

Perfect collinearity with dummy variable (2)


===============================================
             Model 1           Model 2         
-----------------------------------------------
(Intercept)   8.75 (0.47) ***   6.09 (0.63) ***
male         -2.65 (0.78) **                   
female                          2.65 (0.78) ** 
-----------------------------------------------
R^2           0.49              0.49           
Adj. R^2      0.45              0.45           
Num. obs.    14                14              
Num. df      12                12              
F statistic  11.45 **          11.45 **        
===============================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 wage ~ male
Model 2 wage ~ female

Perfect collinearity with dummy variable (2)


=================================================================
             Model 1           Model 2           Model 3         
-----------------------------------------------------------------
(Intercept)   8.75 (0.47) ***   6.09 (0.63) ***   6.09 (0.63) ***
male         -2.65 (0.78) **                                     
female                          2.65 (0.78) **    2.65 (0.78) ** 
-----------------------------------------------------------------
R^2           0.49              0.49              0.49           
Adj. R^2      0.45              0.45              0.45           
Num. obs.    14                14                14              
Num. df      12                12                12              
F statistic  11.45 **          11.45 **          11.45 **        
=================================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 wage ~ male
Model 2 wage ~ female
Model 3 wage ~ female + male

Perfect collinearity with dummy variable (2)


====================================================================================
             Model 1           Model 2           Model 3           Model 4          
------------------------------------------------------------------------------------
(Intercept)   8.75 (0.47) ***   6.09 (0.63) ***   6.09 (0.63) ***                   
male         -2.65 (0.78) **                                         6.09 (0.63) ***
female                          2.65 (0.78) **    2.65 (0.78) **     8.75 (0.47) ***
------------------------------------------------------------------------------------
R^2           0.49              0.49              0.49               0.97           
Adj. R^2      0.45              0.45              0.45               0.97           
Num. obs.    14                14                14                 14              
Num. df      12                12                12                 12              
F statistic  11.45 **          11.45 **          11.45 **          221.44 ***       
====================================================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 wage ~ male
Model 2 wage ~ female
Model 3 wage ~ female + male
Model 4 wage ~ 0 + female + male

Near collinearity example 3:

An example of water consumption in a region as a function of population, year and annual precipitation.

Near collinearity occurs when variables are highly correlated.

Problem

\[\text{log(muniUse)} = \hat \beta_0 + \hat \beta_1 \cdot \text{year} + \hat \beta_2 \cdot \text{muniPrecip} + \\ \hat \beta_3 \cdot \text{log(muniPop)} + \hat u\]

log(muniUse) - total water consumption in logarithm;
muniPrecip - precipitation level in March-September, when there are needs of irrigation;
log(muniPop) - total water consumption in logarithm;
year - year
What could be the ex-ante expectations about the coefficients?

Data preparation and description

Code

precip_dta <- 
    alr4::MinnWater %>%
    as_tibble() %>% 
    mutate(`log(muniPop)` = log(muniPop),
           `log(muniUse)` = log(muniUse)) %>% 
    select(year, muniPrecip, `log(muniPop)`, `log(muniUse)`) 
glimpse(precip_dta, n = 20)

Rows: 24
Columns: 4
$ year           <int> 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2…
$ muniPrecip     <dbl> 21.0, 25.6, 14.5, 15.1, 18.8, 17.4, 22.6, 22.8, 16.4, 2…
$ `log(muniPop)` <dbl> 15.02581, 15.01623, 15.00598, 14.99355, 14.98164, 14.96…
$ `log(muniUse)` <dbl> 4.834693, 4.833898, 4.925077, 4.936630, 4.985659, 4.972…

Code

report::report_table(precip_dta) %>% as_tibble()

# A tibble: 4 × 11
  Variable     n_Obs    Mean     SD Median   MAD    Min    Max Skewness Kurtosis
  <chr>        <int>   <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
1 year            24 2000.   7.07   2.00e3 8.90  1.99e3 2.01e3    0       -1.2  
2 muniPrecip      24   19.9  4.40   1.95e1 4.74  1.23e1 2.89e1    0.254   -0.593
3 log(muniPop)    24   14.9  0.0879 1.49e1 0.103 1.47e1 1.50e1   -0.253   -1.10 
4 log(muniUse)    24    4.81 0.108  4.81e0 0.118 4.61e0 4.99e0   -0.208   -0.756
# … with 1 more variable: n_Missing <int>

Visuall detection: scatter plots and correlation

Detection: Step-wise regression approach (1)

Code

fit3.1 <- lm(`log(muniUse)` ~ muniPrecip , precip_dta)
fit3.2 <- lm(`log(muniUse)` ~ muniPrecip + year, precip_dta)
fit3.3 <- lm(`log(muniUse)` ~ muniPrecip + `log(muniPop)`, precip_dta)
fit3.4 <- lm(`log(muniUse)` ~ muniPrecip + year  + `log(muniPop)`, precip_dta)
screen_many_regs(fit3.1, single.row = T, digits = 2)


=============================
             Model 1         
-----------------------------
(Intercept)   5.00 (0.10) ***
muniPrecip   -0.01 (0.00)    
-----------------------------
R^2           0.15           
Adj. R^2      0.11           
Num. obs.    24              
Num. df      22              
F statistic   3.84           
=============================
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip

Detection: Step-wise regression approach (2)

Code

screen_many_regs(fit3.1, fit3.2, single.row = T, digits = 2)


================================================
             Model 1           Model 2          
------------------------------------------------
(Intercept)   5.00 (0.10) ***  -20.16 (2.73) ***
muniPrecip   -0.01 (0.00)       -0.01 (0.00) ***
year                             0.01 (0.00) ***
------------------------------------------------
R^2           0.15               0.83           
Adj. R^2      0.11               0.82           
Num. obs.    24                 24              
Num. df      22                 21              
F statistic   3.84              51.83 ***       
================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip
Model 2 `log(muniUse)` ~ muniPrecip + year

Detection: Step-wise regression approach (2)

Code

screen_many_regs(fit3.1, fit3.2, fit3.3, single.row = T, digits = 2)


======================================================================
                Model 1           Model 2            Model 3          
----------------------------------------------------------------------
(Intercept)      5.00 (0.10) ***  -20.16 (2.73) ***  -10.25 (1.55) ***
muniPrecip      -0.01 (0.00)       -0.01 (0.00) ***   -0.01 (0.00) ***
year                                0.01 (0.00) ***                   
`log(muniPop)`                                         1.03 (0.10) ***
----------------------------------------------------------------------
R^2              0.15               0.83               0.85           
Adj. R^2         0.11               0.82               0.83           
Num. obs.       24                 24                 24              
Num. df         22                 21                 21              
F statistic      3.84              51.83 ***          58.53 ***       
======================================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip
Model 2 `log(muniUse)` ~ muniPrecip + year
Model 3 `log(muniUse)` ~ muniPrecip + `log(muniPop)`

Detection: Step-wise regression approach (2)

Code

screen_many_regs(fit3.1, fit3.2, fit3.3, fit3.4, 
                 single.row = T, digits = 2)


=========================================================================================
                Model 1           Model 2            Model 3            Model 4          
-----------------------------------------------------------------------------------------
(Intercept)      5.00 (0.10) ***  -20.16 (2.73) ***  -10.25 (1.55) ***  -1.28 (11.51)    
muniPrecip      -0.01 (0.00)       -0.01 (0.00) ***   -0.01 (0.00) ***  -0.01  (0.00) ***
year                                0.01 (0.00) ***                     -0.01  (0.01)    
`log(muniPop)`                                         1.03 (0.10) ***   1.92  (1.14)    
-----------------------------------------------------------------------------------------
R^2              0.15               0.83               0.85              0.85            
Adj. R^2         0.11               0.82               0.83              0.83            
Num. obs.       24                 24                 24                24               
Num. df         22                 21                 21                20               
F statistic      3.84              51.83 ***          58.53 ***         38.52 ***        
=========================================================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip
Model 2 `log(muniUse)` ~ muniPrecip + year
Model 3 `log(muniUse)` ~ muniPrecip + `log(muniPop)`
Model 4 `log(muniUse)` ~ muniPrecip + year + `log(muniPop)`

Detection: Step-wise regression approach (2)

Both collinear variables individually contribute substantially to the R-Squared; but when included jointly, there is no big improvement;
Individually, collinear variables are highly significant, but when included jointly, they are weakly- or not-significant.

Detection: Variance Inflation Factor (1)

Variance Inflation Factor - is a simple measure of the harm produced by collinearity:
The square root of the VIF indicates how much the confidence interval for \(\beta\) is expanded relative to similar uncorrelated data
- (assuming that such data might exists, for example, in a designed experiment).
If VIF > 4 OR VIF > 10, the variable may be collinear with another variable.

Detection: Variance Inflation Factor (2)

Compute VIF for regression
See where VIF exceeds 4 (or squared root of VIF exceeds 2).
Explore correlation between regressors, revise the model.
Discuss correlations in the data.
Explain why variables are kept (if the case).

Detection: Variance Inflation Factor (3)

coef(fit3.4)

   (Intercept)     muniPrecip           year `log(muniPop)` 
   -1.27839362    -0.01055911    -0.01113242     1.91735477

library(car) #?vif
vif(fit3.4)

    muniPrecip           year `log(muniPop)` 
      1.032013     116.969782     117.095745

Collinearity is present.
It is between year and log(muniPop)
Given that log(muniPop) captures an annual linear trend and other variations in the population growth, we should keep log(muniPop) instead of the year.

Final revised model without collinearity

screen_many_regs(fit3.3)


=====================================
                Model 1              
-------------------------------------
(Intercept)     -10.2544 (1.5526) ***
muniPrecip       -0.0103 (0.0021) ***
`log(muniPop)`    1.0251 (0.1043) ***
-------------------------------------
R^2               0.8479             
Adj. R^2          0.8334             
Num. obs.        24                  
Num. df          21                  
F statistic      58.5313 ***         
=====================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip + `log(muniPop)`

Interpretation:

This is a log-level and log-log model. We must interpret it accordingly.
Increase in precipitation by 1 (one) unit (1 mm of rainfall) causes \(-0.0103 \cdot 100 = -1.03\) % (percent) decrease in water consumption holding all other factors fixed (because with the log-level transformation \(\%\Delta y = 100 \beta \Delta x\)).
Increase in population by 1 (one) % (percent) causes \(1.025\) % (percent) increase in water consumption holding all other factors fixed (because with the log-log transformation \(\%\Delta y = \beta \%\Delta x\)).

Collinearity example 2:

Collinearity detection by checking the model specification

Perfect collinearity (1)

Explain the voting outcome in election for party ‘A’
- (variable voteA, which stands for % of votes for party “A” from all votes)
as a function of:
- % of expenditure of the party “A” on a voting campaign (variable shareA) and
- % of expenditure of the party “B” (shareB).
\(\hat{voteA} = \hat\beta_0 + \hat\beta_1 shareA + \hat\beta_2 (shareB)\)
where \(shareB = 1 - shareA\)

. . .

\[\hat{voteA} = \hat\beta_0 + \hat\beta_1 shareA + \hat\beta_2 (1 - shareA)\]

Perfect collinearity (2)

Code

woolvote <- wooldridge::vote1 %>% 
    as_tibble() %>% 
    mutate(shareB = 100 - shareA, 
           democB = 1-democA) %>% 
    select(voteA, democA, democB,
           shareA, shareB, expendA, expendB)
glimpse(woolvote, n = 20)

Rows: 173
Columns: 7
$ voteA   <int> 68, 62, 73, 69, 75, 69, 59, 71, 76, 73, 68, 71, 52, 79, 50, 64…
$ democA  <int> 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,…
$ democB  <dbl> 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,…
$ shareA  <dbl> 97.40767, 60.88104, 97.01476, 92.40370, 72.61247, 96.38355, 78…
$ shareB  <dbl> 2.592331, 39.118961, 2.985237, 7.596298, 27.387527, 3.616447, …
$ expendA <dbl> 328.296, 626.377, 99.607, 319.690, 159.221, 570.155, 696.748, …
$ expendB <dbl> 8.737, 402.477, 3.065, 26.281, 60.054, 21.393, 193.915, 7.695,…

Code

report::report_table(woolvote) %>% as_tibble()

# A tibble: 7 × 11
  Variable n_Obs    Mean      SD Median   MAD     Min    Max Skewness Kurtosis
  <chr>    <int>   <dbl>   <dbl>  <dbl> <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
1 voteA      173  50.5    16.8     50    22.2 16        84   -0.0579     -1.28
2 democA     173   0.555   0.498    1     0    0         1   -0.223      -1.97
3 democB     173   0.445   0.498    0     0    0         1    0.223      -1.97
4 shareA     173  51.1    33.5     50.8  47.9  0.0946   99.5 -0.00206    -1.46
5 shareB     173  48.9    33.5     49.2  47.9  0.505    99.9  0.00206    -1.46
6 expendA    173 311.    281.     243.  280.   0.302  1471.   1.34        2.49
7 expendB    173 305.    306.     222.  268.   0.930  1548.   1.39        1.96
# … with 1 more variable: percentage_Missing <dbl>

Perfect collinearity (3)

fit_vote_1 <- lm(voteA ~ shareA + shareB, data = woolvote)
screen_many_regs(fit_vote_1, single.row = T, digits = 2)


===============================
             Model 1           
-------------------------------
(Intercept)    26.81 (0.89) ***
shareA          0.46 (0.01) ***
-------------------------------
R^2             0.86           
Adj. R^2        0.86           
Num. obs.     173              
Num. df       171              
F statistic  1017.66 ***       
===============================
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 voteA ~ shareA + shareB

References

Weisberg, Sanford. 2005. Applied Linear Regression. John Wiley & Sons, Inc. https://doi.org/10.1002/0471704091.

Wooldridge, M. Jeffrey. 2020. Introductory Econometrics: A Modern Approach. South-Western. https://www.cengage.uk/shop/isbn/9781337558860.