MP223 - Applied Econometrics Methods for the Social Sciences
Eduard Bukin
# load packages
library(tidyverse) # for data wrangling
library(alr4) # for the data sets #
# set default theme and larger font size for ggplot2
# set default figure parameters for knitr
fig.width = 8,
fig.asp = 0.618,
fig.retina = 3,
dpi = 300,
out.width = "80%"
Data used is published in (Pearson and Lee 1903).
Karl Pearson collected data on over 1100 families in England in the period 1893 to 1898;
Heights of mothers mheight
and daughters dheight
was recorded for 1375 observations.
We rely on the examples of SLR in (Weisberg 2005)
objectRows: 400
Columns: 2
$ mother_height <dbl> 58.8, 65.4, 65.5, 63.2, 60.1, 61.2, 60.8, 63.7, 63.8, …
$ daughter_height <dbl> 62.7, 66.2, 62.8, 62.2, 65.1, 64.8, 63.4, 64.3, 62.5, …
\[\Large{Y = \beta_0 + \beta_1 X}\]
\[\Large{\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X}\]
\(\text{residuals} \\ = \text{observed} - \text{predicted} \\ = \epsilon = Y - \hat{Y}\)
\({Y = \hat{\beta}_0 + \hat{\beta}_1 X + \epsilon}\)
\(\epsilon\) is the error term or residual
For each specific observation \(i\)
residual \(e_i = y_i - \hat{y_i}\)
squared residual \(e_i^2 = (y_i - \hat{y_i})^2\)
“finds” values for \(\hat{\beta}_0\) and \(\hat{\beta}_1\)
each new value of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) generates new regression line;
plt +
aes(x = mother_height, xend = mother_height,
y = daughter_height, yend = predict(fit)),
color = "blue",
alpha = 0.4
) +
geom_abline(intercept = 33, slope = 0.49, color = "black") +
geom_abline(intercept = 5, slope = 0.95, color = "black") +
geom_abline(intercept = 25, slope = 0.62, color = "black")
the OLS finds such values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimizes the sum of squared residuals:
\[ \Large{ SSR = \sum_{i}^{n}{e_i^2} = \sum_{i}^{n}{(y_i - \hat{y_i})^2} \\ = {[e_1^2 + e_2^2 + ... + e_n^2]} } \]
The regression line goes through the center of all point.
The sum of the residuals (not squared) is zero: \(\sum_{i}^n e_i = 0\)
Zero correlation between residuals and regressors \(Cov(X,\epsilon) = 0\)
Predicted value of \(Y\), when all regressors are at means \(\bar{X}\) is the mean of \(\bar{Y}\): \(E[Y|\bar{X}] = \bar{Y}\)
lm(formula = daughter_height ~ mother_height, data = dta)
Min 1Q Median 3Q Max
-6.6255 -1.5767 -0.0878 1.3916 9.0083
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.45511 2.91727 10.10 <2e-16 ***
mother_height 0.54979 0.04662 11.79 <2e-16 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.337 on 398 degrees of freedom
Multiple R-squared: 0.2589, Adjusted R-squared: 0.2571
F-statistic: 139.1 on 1 and 398 DF, p-value: < 2.2e-16
package overview and source code# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 29.5 2.92 10.1 1.71e-21
2 mother_height 0.550 0.0466 11.8 9.88e-28
package overview and source codeParameter | Coefficient | SE | 95% CI | t(398) | p
(Intercept) | 29.46 | 2.92 | [23.72, 35.19] | 10.10 | < .001
mother height | 0.55 | 0.05 | [ 0.46, 0.64] | 11.79 | < .001
package overview and source codeImportant in the context of the data.
Value of \(Y\) when all \(X\) are zero.
Marginal effect or unit change in \(Y\) on average, when \(X\) is being change by on unit, keeping all other regressors fixed.
Rows: 400
Columns: 4
$ mother_height <dbl> 58.8, 65.4, 65.5, 63.2, 60.1, 61.2, 60.8, 63.7, 63.8, …
$ daughter_height <dbl> 62.7, 66.2, 62.8, 62.2, 65.1, 64.8, 63.4, 64.3, 62.5, …
$ fitted <dbl> 61.78260, 65.41120, 65.46618, 64.20167, 62.49733, 63.1…
$ residuals <dbl> 0.9173961, 0.7887997, -2.6661790, -2.0016682, 2.602672…
Create an R Script out of the R code in the presentation.