Simple regression

MP223 - Applied Econometrics Methods for the Social Sciences

Eduard Bukin

R setup

# load packages
library(tidyverse)       # for data wrangling
library(alr4)            # for the data sets #
# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw())
# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "80%"
)

Data

Pearson-Lee data

Data used is published in (Pearson and Lee 1903).
Karl Pearson collected data on over 1100 families in England in the period 1893 to 1898;
Heights of mothers mheight and daughters dheight was recorded for 1375 observations.
We rely on the examples of SLR in (Weisberg 2005)

Data loading and preparation

Loading data
Converting data frame into a tibble() object
Renaming variables
Glimpse of data

Code

# Note, we use `sample_n(400)` to randomly select only 400 observations
dta <- 
  alr4::Heights %>% 
  as_tibble() %>% 
  rename(mother_height = mheight,
         daughter_height = dheight) %>% 
  sample_n(400)
glimpse(dta)

Rows: 400
Columns: 2
$ mother_height   <dbl> 58.8, 65.4, 65.5, 63.2, 60.1, 61.2, 60.8, 63.7, 63.8, …
$ daughter_height <dbl> 62.7, 66.2, 62.8, 62.2, 65.1, 64.8, 63.4, 64.3, 62.5, …

Code

plt <- 
  dta %>% 
  ggplot() + 
  aes(x = mother_height, y = daughter_height) + 
  geom_point(alpha = 0.5) + 
  theme_minimal()
plt

What alpha = 0.5 stands for?
Add labels.
change color of points (add color = "red" after alpha separating arguments with comma)

Code

dta %>% 
  ggplot() + 
  aes(x = mother_height, y = daughter_height) + 
  geom_point(alpha = 0.5) + 
  labs(x = "___________", 
       y = "___________")

Simple Linear Regression

Simple regression line

\(Y \\ = \color{green}{f(X)} \\ + \text{Error term} \\ = \color{green}{E[Y|X]} + \epsilon\)

Code

fit <- 
  lm(daughter_height ~ mother_height, data = dta)
plt <- 
  plt + 
  labs(x = "Mothers' height", 
       y = "Daughter height") + 
  geom_smooth(method = "lm", 
              color = "green", 
              se = FALSE) + 
  theme_minimal()
plt

Simple regression

\[\Large{Y = \beta_0 + \beta_1 X}\]

\(Y\): dependent variable, observed values
\(X\): independent variable
\(\beta_1\): True slope
\(\beta_0\): True intercept
Note, in the population regression function, there is no error terms!

Estimated simple regression

\[\Large{\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X}\]

\(\hat{Y}\): fitted values, predicted values
\(\hat{\beta}_1\): Estimated slope
\(\hat{\beta}_0\): Estimated intercept
No error term!

Residuals

\(Y \\ = \color{green}{f(X)} \\ + \color{blue}{\text{Error term}} \\ = \color{green}{E[Y|X]} + \color{blue}{\epsilon}\)

Code

plt + 
  geom_segment(
    aes(x = mother_height, xend = mother_height,
        y = daughter_height, yend = predict(fit)), 
    color = "blue",
    alpha = 0.4
  )

Residuals

\(\text{residuals} \\ = \text{observed} - \text{predicted} \\ = \epsilon = Y - \hat{Y}\)
\({Y = \hat{\beta}_0 + \hat{\beta}_1 X + \epsilon}\)
\(\epsilon\) is the error term or residual
For each specific observation \(i\)
residual \(e_i = y_i - \hat{y_i}\)
squared residual \(e_i^2 = (y_i - \hat{y_i})^2\)

Ordinary Least Square (OLS)

“finds” values for \(\hat{\beta}_0\) and \(\hat{\beta}_1\)
each new value of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) generates new regression line;

Code

plt + 
  geom_segment(
    aes(x = mother_height, xend = mother_height,
        y = daughter_height, yend = predict(fit)), 
    color = "blue",
    alpha = 0.4
  ) + 
  geom_abline(intercept = 33, slope = 0.49, color = "black") + 
  geom_abline(intercept = 5, slope = 0.95, color = "black") + 
  geom_abline(intercept = 25, slope = 0.62, color = "black")

Ordinary Least Square (OLS)

the OLS finds such values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimizes the sum of squared residuals:

\[ \Large{ SSR = \sum_{i}^{n}{e_i^2} = \sum_{i}^{n}{(y_i - \hat{y_i})^2} \\ = {[e_1^2 + e_2^2 + ... + e_n^2]} } \]

Properties of OLS

The regression line goes through the center of all point.
The sum of the residuals (not squared) is zero: \(\sum_{i}^n e_i = 0\)
Zero correlation between residuals and regressors \(Cov(X,\epsilon) = 0\)
Predicted value of \(Y\), when all regressors are at means \(\bar{X}\) is the mean of \(\bar{Y}\): \(E[Y|\bar{X}] = \bar{Y}\)

Interpretation

Regression coefficients

fit <- lm(daughter_height ~ mother_height, data = dta)
fit


Call:
lm(formula = daughter_height ~ mother_height, data = dta)

Coefficients:
  (Intercept)  mother_height  
      29.4551         0.5498

Regression summary (1/3)

summary(fit)


Call:
lm(formula = daughter_height ~ mother_height, data = dta)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6255 -1.5767 -0.0878  1.3916  9.0083 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   29.45511    2.91727   10.10   <2e-16 ***
mother_height  0.54979    0.04662   11.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.337 on 398 degrees of freedom
Multiple R-squared:  0.2589,    Adjusted R-squared:  0.2571 
F-statistic: 139.1 on 1 and 398 DF,  p-value: < 2.2e-16

Regression summary (2/3)

using broom package overview and source code

library(broom)
tidy(fit)

# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     29.5      2.92        10.1 1.71e-21
2 mother_height    0.550    0.0466      11.8 9.88e-28

glance(fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.259         0.257  2.34      139. 9.88e-28     1  -906. 1818. 1830.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Regression summary (3/3)

using parameters package overview and source code

library(parameters)
parameters(fit)

Parameter     | Coefficient |   SE |         95% CI | t(398) |      p
---------------------------------------------------------------------
(Intercept)   |       29.46 | 2.92 | [23.72, 35.19] |  10.10 | < .001
mother height |        0.55 | 0.05 | [ 0.46,  0.64] |  11.79 | < .001

using performance package overview and source code

library(performance)
performance(fit)

# Indices of model performance

AIC      |      BIC |    R2 | R2 (adj.) |  RMSE | Sigma
-------------------------------------------------------
1818.301 | 1830.276 | 0.259 |     0.257 | 2.331 | 2.337

Intercept

Important in the context of the data.
Value of \(Y\) when all \(X\) are zero.

Slope

Marginal effect or unit change in \(Y\) on average, when \(X\) is being change by on unit, keeping all other regressors fixed.

When mother’s height increases by 1 inch, the height of a daughter increases by \(\hat{\beta_1}\) inches, keeping other variables constant.

Residuals and Fitted values

Residuals

# With `[1:20]` we foce R only to print first 20 values
residuals(fit)[1:20]

         1          2          3          4          5          6          7 
 0.9173961  0.7887997 -2.6661790 -2.0016682  2.6026726  1.6979065  0.5178215 
         8          9         10         11         12         13         14 
-0.1765618 -2.0315406 -1.5917107 -5.4523061 -5.1774125  0.8022473 -1.4757112 
        15         16         17         18         19         20 
-3.4674550  4.0178215 -0.6813279 -3.6462641  1.8927151 -4.2860940

resid(fit)[1:20]

         1          2          3          4          5          6          7 
 0.9173961  0.7887997 -2.6661790 -2.0016682  2.6026726  1.6979065  0.5178215 
         8          9         10         11         12         13         14 
-0.1765618 -2.0315406 -1.5917107 -5.4523061 -5.1774125  0.8022473 -1.4757112 
        15         16         17         18         19         20 
-3.4674550  4.0178215 -0.6813279 -3.6462641  1.8927151 -4.2860940

Fitted

# With `[1:20]` we foce R only to print first 20 values
fitted(fit)[1:20]

       1        2        3        4        5        6        7        8 
61.78260 65.41120 65.46618 64.20167 62.49733 63.10209 62.88218 64.47656 
       9       10       11       12       13       14       15       16 
64.53154 64.09171 62.55231 62.27741 61.39775 66.67571 62.16746 62.88218 
      17       18       19       20 
65.08133 65.24626 62.60728 65.68609

Residuals vs fitted (1/3)

dta_1 <- 
  dta %>% 
  mutate(fitted = fitted(fit)) %>% 
  mutate(residuals = resid(fit))

glimpse(dta_1)

Rows: 400
Columns: 4
$ mother_height   <dbl> 58.8, 65.4, 65.5, 63.2, 60.1, 61.2, 60.8, 63.7, 63.8, …
$ daughter_height <dbl> 62.7, 66.2, 62.8, 62.2, 65.1, 64.8, 63.4, 64.3, 62.5, …
$ fitted          <dbl> 61.78260, 65.41120, 65.46618, 64.20167, 62.49733, 63.1…
$ residuals       <dbl> 0.9173961, 0.7887997, -2.6661790, -2.0016682, 2.602672…

Residuals vs fitted (2/3)

dta_1 %>% 
  ggplot() + 
  aes(x = fitted, y = residuals) + 
  geom_point()

Residuals vs fitted (3/3)

library(performance)
library(see)
check_model(fit, check = "linearity", panel = FALSE)

$NCV

Takeaways

Takeaway

Simple linear regression
OLS
Slope and Intercept (interpretation)
Fitted values
Residuals
Residuals vs Fitted

fitting regression: fit()
regression summary: summary(), tidy(), glance(), parameters(), performance() , check_model() , fitted() , residuals() , resid()
packages: broom, parameters and performance

Homework

Create an R Script out of the R code in the presentation.

References

Pearson, Karl, and Alice Lee. 1903. “On the Laws of Inheritance in Man: I. Inheritance of Physical Characters.” Biometrika 2 (4): 357. https://doi.org/10.2307/2331507.

Weisberg, Sanford. 2005. Applied Linear Regression. John Wiley & Sons, Inc. https://doi.org/10.1002/0471704091.