
MP223 - Applied Econometrics Methods for the Social Sciences

Eduard Bukin

R setup

library(tidyverse)       # for data wrangling
library(alr4)            # for the data sets #


  fig.width = 12,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "100%", 
  message = FALSE,
  echo = TRUE, 
  cache = TRUE

my_gof <- function(fit_obj, digits = 4) {
  sum_fit <- summary(fit_obj)
  stars <- 
       lower.tail=FALSE) %>% 
    symnum(corr = FALSE, na = FALSE, 
           cutpoints = c(0,  .001,.01,.05,  1),
           symbols   =  c("***","**","*"," ")) %>% 
    # `R^2` = sum_fit$r.squared %>% round(digits),
    # `Adj. R^2` = sum_fit$adj.r.squared %>% round(digits),
    # `Num. obs.` = sum_fit$residuals %>% length(),
    `Num. df` = sum_fit$df[[2]],
    `F statistic` = 
      str_c(sum_fit$fstatistic[1] %>% round(digits), " ", stars)

# Function for screening many regressors
screen_many_regs <-
  function(fit_obj_list, ..., digits = 4, single.row = TRUE) {
    if (class(fit_obj_list) == "lm") 
      fit_obj_list <- list(fit_obj_list)
    if (length(rlang::dots_list(...)) > 0)  
      fit_obj_list <- fit_obj_list %>% append(rlang::dots_list(...))
    # browser()
    fit_obj_list %>%
        custom.note =
          map2_chr(., seq_along(.), ~ {
            str_c("Model ", .y, " ", as.character(.x$call)[[2]])
          }) %>%
          c("*** p < 0.001; ** p < 0.01; * p < 0.05", .) %>%
          str_c(collapse = "\n") ,
        digits = digits,
        single.row = single.row,
        custom.gof.rows =
          map(., ~my_gof(.x, digits)) %>%
          transpose() %>%
        reorder.gof = c(3, 4, 5, 1, 2)

Assumption 3. No Perfect Collinearity (Some variation in X Variables)

Collinearity or Muticollinearity

  • No collinearity means

    • none of the regressors can be written as an exact linear combinations of some other regressors in the model.
  • For example:

    • in \(Y = \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\) ,
    • where \(X_3 = X_2 + X_1\) ,
    • all \(X\) are collinear.

Consequence of collinearity:

  • biased estimates of the collinear variables

  • over-significant results;

Detection of collinearity:

  • Scatter plot; Correlation matrix;

  • Model specification;

  • Step-wise regression approach;

  • Variance Inflation Factor;

Solution to collinearity:

  • Re specify the model;

  • Choose different regressors;

  • See also:

Collinearity examples

Example 0.1

\[\hat{output} = \hat\beta_1 land + \hat\beta_2 seeds + \hat\beta_3 fertilizers + \hat\beta_4 others + \hat\beta_5 total\]

  • where \(total = seeds + fertilizers + others\)

  • Important

    Is there a multicollinearity problem here?

  • Danger

    YES! Definitely!

  • Tip

    Coefficient \(\hat\beta_5\) is aliased and wont be estimated

Example 0.2

\[\hat{output} = \hat\beta_1 land + \hat\beta_2 seeds + \hat\beta_3 fertilizers + \hat\beta_4 others\]

  • where \(seeds\) and \(fertilizers\) highly correlate between each other,

  • VIF of \(seeds\) and \(fertilizers\) is 12.2

  • our key variable is \(land\).

  • Important

    Is there a multicollinearity problem here?

  • Danger

    Not really!

  • Because \(fertilizers\) is a control variable and we may have OVB if we remove it!

  • If we really want to reduce VIF…:

    • Dis-aggregate fertilizers into mineral and organic, for example.
    • Aggregate fertilizers and seeds

Example 0.3

Same model but in log:

\[log(\hat{output}) = \hat\beta_1 log(land) + \hat\beta_2 log(seeds) + \hat\beta_3 log(fertilizers) + \\ \hat\beta_4 log(others) + \hat\beta_5 log(total) \]

where \(total = seeds + fertilizers + others\)

  • Important

    Is there a multicollinearity problem here?

  • Think!

  • \(log(a) + log(b) = log(a * b)\)

  • Danger

    Not really

Example 0.4

Same model but with a quadratic term:

\[\hat{output} = \hat\beta_1 land + \hat\beta_2 land^2 + \hat\beta_3 seeds + \hat\beta_4 fertilizers + \hat\beta_5 others\]

  • Important

    Is there a multicollinearity problem here?

  • Think!

  • Danger

    Not really

  • \(land^2\) is not a linear combination of \(land\) ;

  • Linear combination is when \(land + land\) not when \(land \times land\) ;

Collinearity example 1:

Collinearity detection by checking the model specification

Perfect collinearity with dummy variable (1)

  • We want to build a naive regression, where the wage is a function of sex (female and male):

  • \(\text{wage} = \beta_0 + \beta_1 \cdot \text{female} + \beta_2 \cdot \text{male}\)

  • The data is fictional:

n <- 14
dta <- 
    tibble(female = as.integer(round(runif(n), 0))) %>% 
    mutate(male = as.integer(1 - female),
           wage = 10 - 3 * male + runif(n, -3, 3))
Rows: 14
Columns: 3
$ female <int> 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1
$ male   <int> 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0
$ wage   <dbl> 10.847522, 7.167989, 4.941890, 7.477957, 9.391538, 8.087289, 9.…

Perfect collinearity with dummy variable (2)

             Model 1           Model 2         
(Intercept)   8.75 (0.47) ***   6.09 (0.63) ***
male         -2.65 (0.78) **                   
female                          2.65 (0.78) ** 
R^2           0.49              0.49           
Adj. R^2      0.45              0.45           
Num. obs.    14                14              
Num. df      12                12              
F statistic  11.45 **          11.45 **        
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 wage ~ male
Model 2 wage ~ female

Perfect collinearity with dummy variable (2)

             Model 1           Model 2           Model 3         
(Intercept)   8.75 (0.47) ***   6.09 (0.63) ***   6.09 (0.63) ***
male         -2.65 (0.78) **                                     
female                          2.65 (0.78) **    2.65 (0.78) ** 
R^2           0.49              0.49              0.49           
Adj. R^2      0.45              0.45              0.45           
Num. obs.    14                14                14              
Num. df      12                12                12              
F statistic  11.45 **          11.45 **          11.45 **        
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 wage ~ male
Model 2 wage ~ female
Model 3 wage ~ female + male

Perfect collinearity with dummy variable (2)

             Model 1           Model 2           Model 3           Model 4          
(Intercept)   8.75 (0.47) ***   6.09 (0.63) ***   6.09 (0.63) ***                   
male         -2.65 (0.78) **                                         6.09 (0.63) ***
female                          2.65 (0.78) **    2.65 (0.78) **     8.75 (0.47) ***
R^2           0.49              0.49              0.49               0.97           
Adj. R^2      0.45              0.45              0.45               0.97           
Num. obs.    14                14                14                 14              
Num. df      12                12                12                 12              
F statistic  11.45 **          11.45 **          11.45 **          221.44 ***       
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 wage ~ male
Model 2 wage ~ female
Model 3 wage ~ female + male
Model 4 wage ~ 0 + female + male

Near collinearity example 3:

An example of water consumption in a region as a function of population, year and annual precipitation.

  • Near collinearity occurs when variables are highly correlated.


\[\text{log(muniUse)} = \hat \beta_0 + \hat \beta_1 \cdot \text{year} + \hat \beta_2 \cdot \text{muniPrecip} + \\ \hat \beta_3 \cdot \text{log(muniPop)} + \hat u\]

  • log(muniUse) - total water consumption in logarithm;

  • muniPrecip - precipitation level in March-September, when there are needs of irrigation;

  • log(muniPop) - total water consumption in logarithm;

  • year - year

  • What could be the ex-ante expectations about the coefficients?

Data preparation and description

precip_dta <- 
    alr4::MinnWater %>%
    as_tibble() %>% 
    mutate(`log(muniPop)` = log(muniPop),
           `log(muniUse)` = log(muniUse)) %>% 
    select(year, muniPrecip, `log(muniPop)`, `log(muniUse)`) 
glimpse(precip_dta, n = 20)
Rows: 24
Columns: 4
$ year           <int> 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2…
$ muniPrecip     <dbl> 21.0, 25.6, 14.5, 15.1, 18.8, 17.4, 22.6, 22.8, 16.4, 2…
$ `log(muniPop)` <dbl> 15.02581, 15.01623, 15.00598, 14.99355, 14.98164, 14.96…
$ `log(muniUse)` <dbl> 4.834693, 4.833898, 4.925077, 4.936630, 4.985659, 4.972…
report::report_table(precip_dta) %>% as_tibble()
# A tibble: 4 × 11
  Variable     n_Obs    Mean     SD Median   MAD    Min    Max Skewness Kurtosis
  <chr>        <int>   <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
1 year            24 2000.   7.07   2.00e3 8.90  1.99e3 2.01e3    0       -1.2  
2 muniPrecip      24   19.9  4.40   1.95e1 4.74  1.23e1 2.89e1    0.254   -0.593
3 log(muniPop)    24   14.9  0.0879 1.49e1 0.103 1.47e1 1.50e1   -0.253   -1.10 
4 log(muniUse)    24    4.81 0.108  4.81e0 0.118 4.61e0 4.99e0   -0.208   -0.756
# … with 1 more variable: n_Missing <int>

Visuall detection: scatter plots and correlation

Detection: Step-wise regression approach (1)

fit3.1 <- lm(`log(muniUse)` ~ muniPrecip , precip_dta)
fit3.2 <- lm(`log(muniUse)` ~ muniPrecip + year, precip_dta)
fit3.3 <- lm(`log(muniUse)` ~ muniPrecip + `log(muniPop)`, precip_dta)
fit3.4 <- lm(`log(muniUse)` ~ muniPrecip + year  + `log(muniPop)`, precip_dta)
screen_many_regs(fit3.1, single.row = T, digits = 2)

             Model 1         
(Intercept)   5.00 (0.10) ***
muniPrecip   -0.01 (0.00)    
R^2           0.15           
Adj. R^2      0.11           
Num. obs.    24              
Num. df      22              
F statistic   3.84           
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip

Detection: Step-wise regression approach (2)

screen_many_regs(fit3.1, fit3.2, single.row = T, digits = 2)

             Model 1           Model 2          
(Intercept)   5.00 (0.10) ***  -20.16 (2.73) ***
muniPrecip   -0.01 (0.00)       -0.01 (0.00) ***
year                             0.01 (0.00) ***
R^2           0.15               0.83           
Adj. R^2      0.11               0.82           
Num. obs.    24                 24              
Num. df      22                 21              
F statistic   3.84              51.83 ***       
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip
Model 2 `log(muniUse)` ~ muniPrecip + year

Detection: Step-wise regression approach (2)

screen_many_regs(fit3.1, fit3.2, fit3.3, single.row = T, digits = 2)

                Model 1           Model 2            Model 3          
(Intercept)      5.00 (0.10) ***  -20.16 (2.73) ***  -10.25 (1.55) ***
muniPrecip      -0.01 (0.00)       -0.01 (0.00) ***   -0.01 (0.00) ***
year                                0.01 (0.00) ***                   
`log(muniPop)`                                         1.03 (0.10) ***
R^2              0.15               0.83               0.85           
Adj. R^2         0.11               0.82               0.83           
Num. obs.       24                 24                 24              
Num. df         22                 21                 21              
F statistic      3.84              51.83 ***          58.53 ***       
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip
Model 2 `log(muniUse)` ~ muniPrecip + year
Model 3 `log(muniUse)` ~ muniPrecip + `log(muniPop)`

Detection: Step-wise regression approach (2)

screen_many_regs(fit3.1, fit3.2, fit3.3, fit3.4, 
                 single.row = T, digits = 2)

                Model 1           Model 2            Model 3            Model 4          
(Intercept)      5.00 (0.10) ***  -20.16 (2.73) ***  -10.25 (1.55) ***  -1.28 (11.51)    
muniPrecip      -0.01 (0.00)       -0.01 (0.00) ***   -0.01 (0.00) ***  -0.01  (0.00) ***
year                                0.01 (0.00) ***                     -0.01  (0.01)    
`log(muniPop)`                                         1.03 (0.10) ***   1.92  (1.14)    
R^2              0.15               0.83               0.85              0.85            
Adj. R^2         0.11               0.82               0.83              0.83            
Num. obs.       24                 24                 24                24               
Num. df         22                 21                 21                20               
F statistic      3.84              51.83 ***          58.53 ***         38.52 ***        
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip
Model 2 `log(muniUse)` ~ muniPrecip + year
Model 3 `log(muniUse)` ~ muniPrecip + `log(muniPop)`
Model 4 `log(muniUse)` ~ muniPrecip + year + `log(muniPop)`

Detection: Step-wise regression approach (2)

  • Both collinear variables individually contribute substantially to the R-Squared; but when included jointly, there is no big improvement;

  • Individually, collinear variables are highly significant, but when included jointly, they are weakly- or not-significant.

Detection: Variance Inflation Factor (1)

  • Variance Inflation Factor - is a simple measure of the harm produced by collinearity:

  • The square root of the VIF indicates how much the confidence interval for \(\beta\) is expanded relative to similar uncorrelated data

    • (assuming that such data might exists, for example, in a designed experiment).
  • If VIF > 4 OR VIF > 10, the variable may be collinear with another variable.

Detection: Variance Inflation Factor (2)

  1. Compute VIF for regression

  2. See where VIF exceeds 4 (or squared root of VIF exceeds 2).

  3. Explore correlation between regressors, revise the model.

  4. Discuss correlations in the data.

  5. Explain why variables are kept (if the case).

Detection: Variance Inflation Factor (3)

   (Intercept)     muniPrecip           year `log(muniPop)` 
   -1.27839362    -0.01055911    -0.01113242     1.91735477 
library(car) #?vif
    muniPrecip           year `log(muniPop)` 
      1.032013     116.969782     117.095745 
  • Collinearity is present.

  • It is between year and log(muniPop)

  • Given that log(muniPop) captures an annual linear trend and other variations in the population growth, we should keep log(muniPop) instead of the year.

Final revised model without collinearity


                Model 1              
(Intercept)     -10.2544 (1.5526) ***
muniPrecip       -0.0103 (0.0021) ***
`log(muniPop)`    1.0251 (0.1043) ***
R^2               0.8479             
Adj. R^2          0.8334             
Num. obs.        24                  
Num. df          21                  
F statistic      58.5313 ***         
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 `log(muniUse)` ~ muniPrecip + `log(muniPop)`


  • This is a log-level and log-log model. We must interpret it accordingly.

  • Increase in precipitation by 1 (one) unit (1 mm of rainfall) causes \(-0.0103 \cdot 100 = -1.03\) % (percent) decrease in water consumption holding all other factors fixed (because with the log-level transformation \(\%\Delta y = 100 \beta \Delta x\)).

  • Increase in population by 1 (one) % (percent) causes \(1.025\) % (percent) increase in water consumption holding all other factors fixed (because with the log-log transformation \(\%\Delta y = \beta \%\Delta x\)).

Collinearity example 2:

Collinearity detection by checking the model specification

Perfect collinearity (1)

  • Explain the voting outcome in election for party ‘A’

    • (variable voteA, which stands for % of votes for party “A” from all votes)
  • as a function of:

    • % of expenditure of the party “A” on a voting campaign (variable shareA) and
    • % of expenditure of the party “B” (shareB).
  • \(\hat{voteA} = \hat\beta_0 + \hat\beta_1 shareA + \hat\beta_2 (shareB)\)

  • where \(shareB = 1 - shareA\)

. . .

\[\hat{voteA} = \hat\beta_0 + \hat\beta_1 shareA + \hat\beta_2 (1 - shareA)\]

Perfect collinearity (2)

woolvote <- wooldridge::vote1 %>% 
    as_tibble() %>% 
    mutate(shareB = 100 - shareA, 
           democB = 1-democA) %>% 
    select(voteA, democA, democB,
           shareA, shareB, expendA, expendB)
glimpse(woolvote, n = 20)
Rows: 173
Columns: 7
$ voteA   <int> 68, 62, 73, 69, 75, 69, 59, 71, 76, 73, 68, 71, 52, 79, 50, 64…
$ democA  <int> 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,…
$ democB  <dbl> 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,…
$ shareA  <dbl> 97.40767, 60.88104, 97.01476, 92.40370, 72.61247, 96.38355, 78…
$ shareB  <dbl> 2.592331, 39.118961, 2.985237, 7.596298, 27.387527, 3.616447, …
$ expendA <dbl> 328.296, 626.377, 99.607, 319.690, 159.221, 570.155, 696.748, …
$ expendB <dbl> 8.737, 402.477, 3.065, 26.281, 60.054, 21.393, 193.915, 7.695,…
report::report_table(woolvote) %>% as_tibble()
# A tibble: 7 × 11
  Variable n_Obs    Mean      SD Median   MAD     Min    Max Skewness Kurtosis
  <chr>    <int>   <dbl>   <dbl>  <dbl> <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
1 voteA      173  50.5    16.8     50    22.2 16        84   -0.0579     -1.28
2 democA     173   0.555   0.498    1     0    0         1   -0.223      -1.97
3 democB     173   0.445   0.498    0     0    0         1    0.223      -1.97
4 shareA     173  51.1    33.5     50.8  47.9  0.0946   99.5 -0.00206    -1.46
5 shareB     173  48.9    33.5     49.2  47.9  0.505    99.9  0.00206    -1.46
6 expendA    173 311.    281.     243.  280.   0.302  1471.   1.34        2.49
7 expendB    173 305.    306.     222.  268.   0.930  1548.   1.39        1.96
# … with 1 more variable: percentage_Missing <dbl>

Perfect collinearity (3)

fit_vote_1 <- lm(voteA ~ shareA + shareB, data = woolvote)
screen_many_regs(fit_vote_1, single.row = T, digits = 2)

             Model 1           
(Intercept)    26.81 (0.89) ***
shareA          0.46 (0.01) ***
R^2             0.86           
Adj. R^2        0.86           
Num. obs.     173              
Num. df       171              
F statistic  1017.66 ***       
*** p < 0.001; ** p < 0.01; * p < 0.05
Model 1 voteA ~ shareA + shareB


Weisberg, Sanford. 2005. Applied Linear Regression. John Wiley & Sons, Inc.
Wooldridge, M. Jeffrey. 2020. Introductory Econometrics: A Modern Approach. South-Western.