MP223 - Applied Econometrics Methods for the Social Sciences
Eduard Bukin
What is the ceteris paribus?
What is the Selection Bias?
How is Selection Bias different from the OVB?
What is long, short and auxiliary regression?
What is the OVB formula?
Why is selection bias causing a problem?
Does more years of schooling cause higher wages?
Jacob Mincer first try to quantify the return to schooling (see Mincer 1974) by estimating the log of annual earning (\(\ln Y_i\)) as a function of years of education (\(s_i\)) and potential work experience (\(x_i\)) in the following fashion:
\[ \ln Y_i = \alpha + \rho s_i + \beta_1 x_i + \beta_2 x^{2}_i + \varepsilon_i \qquad(1)\]
\[Y_i = \alpha + \rho s_i + \gamma A^{'}_{i} + \varepsilon_i \qquad(2)\]
Randomized trials/experiments (Joshua D. Angrist and Pischke 2009, Ch 1-2.; Joshua D. Angrist and Pischke 2014, Ch. 1);
Regression analysis (Joshua D. Angrist and Pischke 2009, Ch 3.; Joshua D. Angrist and Pischke 2014, Ch. 2);
Instrumental variables
DID - Difference in Difference;
RDD - Regression Discontinuity Design;
Is another terminology for the selection bias problem
\[Y_i = \alpha + \rho s_i + \gamma A^{'}_{i} + \varepsilon_i \\ Y_i = \alpha^S + \rho^S s_i + \varepsilon^{S}_i\]
In practice, endogeneity means that
If variance of \(s_i\) is truly independent of \(Y_i\), \(s_i\) is exogenous.
Omitted Variable Bias
Measurement Error
Simultaneity
Long model: \(Y_i = \alpha + \rho s_i + \gamma A^{'}_{i} + \varepsilon_i\)
Short model: \(Y_i = \alpha^S + \rho^S s_i + \varepsilon^{S}_i\)
If \(s_i\) and \(A_i\) are correlated, we can assume a linear relationship between them:
\[ A_i = \delta_0 + \delta_1 s_i + \upsilon_i \]
\[ \Rightarrow Y_i = \alpha + \rho s_i + \gamma (\delta_0 + \delta_1 A_i + \upsilon_i) + \varepsilon_i \]
\[ = \underbrace{(\alpha + \gamma \delta_0)}_{\alpha^S} + \underbrace{(\rho + \gamma \delta_1)}_{\rho^S} s_i + \underbrace{(\varepsilon_i + \gamma \upsilon_i)}_{\varepsilon_i^S} \]
We estimate a long model: \(Y_i = \alpha + \beta s^*_i + e_i \\\) ,
Desired coefficient \(\beta = \frac{Cov(Y_i, s_i)}{Var(s_i)}\)
But with the erroneous data, we estimate biased coefficient \(\beta_b\)
\[ \beta_b = \frac{Cov(Y_i, s_i)}{Var(s_i)} = \frac{Cov(a+\beta s^*_i + e_i, s^*_i + m_i)}{Var(s_i)} \\ = \frac{\beta \cdot Cov(s^*_i, s^*_i)}{Var(s_i)} = \beta \frac{Var(s^{*}_i)}{Var(s_i)} \]
Simultaneity occurs if at least two variables are jointly determined.
The prototypical case is a system of demand and supply equations:
Number of police people and the crime rate.
(see M. J. Wooldridge 2020, Ch. 17) for more details on the problem and solutions.
There are same five “lethal” weapons against endogeneity as there were against the selection bias:
Recall the short (\(Y_i = \alpha^S + \rho^S s_i + \varepsilon^{S}_i\)) and long (\(Y_i = \alpha + \rho s_i + \gamma A^{'}_{i} + \varepsilon_i\)) models.
Imagine that we have:
Estimate the first stage: \(s_i = \pi_0 + \pi_1 Z_i + nu_i\)
Substitute \(s_i\) with the fitted values from the first stage \(\hat{s_i}\)
Estimate the second stage: \(Y_i = \alpha^{IV} + \rho^{IV} \hat{s_i} + \varepsilon^{IV}_i\)
where
wg1 <- wooldridge::wage2 %>% as_tibble() %>%
filter(if_all(c(wage, educ, exper, meduc), ~!is.na(.)))
#
ols <- lm(log(wage) ~ educ + exper + I(exper^2), wg1)
#
first_stage <- lm(educ ~ meduc + exper + I(exper^2), wg1)
#
second_stage <- lm(log(wage) ~ educ_fit + exper + I(exper^2),
wg1 %>% mutate(educ_fit = fitted(first_stage)))
# A tibble: 9 × 4
parameter OLS `First stage` `Second stage`
<chr> <chr> <chr> <chr>
1 (Intercept) "5.4864*** (0.1308)" "12.7886*** (0.4575)" "4.3947*** (0.35…
2 educ "0.0802*** (0.0068)" "" ""
3 meduc "" "0.2199*** (0.0227)" ""
4 educ_fit "" "" "0.1518*** (0.02…
5 exper "0.0147 (0.0143)" "-0.0455 (0.0684)" "0.0170 (0.0151)"
6 I(exper^2) "0.0003 (0.0006)" "-0.0070* (0.0029)" "0.0010 (0.0007)"
7 N "857" "857" "857"
8 R-sq. adj. "0.1387" "0.2898" "0.0482"
9 F Statistics (df) "47*** (3)" "117*** (3)" "15*** (3)"
IV estimates are not unbiased, but they are consistent (Joshua D. Angrist and Krueger 2001).
Unbiasedness means the estimator has a sampling distribution centered on the parameter of interest in a sample of any size, while
Consistency only means that the estimator converges to the population parameter as the sample size grows.
\(Z_i\) that does not satisfy any of the Relevance condition, Exclusion restriction and Independence assumption;
\(Z_i\) that correlate with omitted variable (OV):
They result into much greater upwards shifting bias compare to the OLS;
For example the weather in Brazil and supply price and demand quantity of coffee:
weather shifts the supply curve, it is random, thus it seems as a plausible instrument for price in the demand model
the weather in Brazil determines supply expectations on futures exchange, thus, it also shifts the demand for coffee before the supply price is affected;
Weak instrument \(Z_i\):
When the instrument \(Z_i\) is only weakly correlates with endogenous regressor \(s_i\);
Find a better one!
Weak instrument test:
Run the first stage regression with and without the IV;
Compare the F-statistics
This test does not ensure that our instruments are independent of omitted variable \(A^{'}_i\) or \(Y_i\);
Staiger and Stock (1997)
number of instruments \(G\) in exceeds the number of endogenous variables \(K\).
If you have few candidates for IV and one endogenous regressor:
Sargan’s overidentification test:
\(H_0:Cov(Z^{'}_i,\varepsilon^{IV}_i)=0\) - the covariance between the instrument and the error term is zero
\(H_1:Cov(Z^{'}_i,\varepsilon^{IV}_i)\neq0\)
Thus, by rejecting the \(H_0\), we conclude that at least one of the instruments is not valid.
Wu-Hausman test for endogeneity tests if the variable that we are worried about is indeed endogenous.
\(H_0:Cov(s_i,\varepsilon_i)=0\) - the covariance between potentially endogenous variable and the error term is zero
\(H_1:Cov(s_i,\varepsilon_i) \neq 0\)
Thus, by rejecting the \(H_0\), we conclude that there is endogeneity and there might be a need for IV.
Angrist, J. D., & Krueger, A. B. (1991). Does Compulsory School Attendance Affect Schooling and Earnings? The Quarterly Journal of Economics, 106, 979–1014. https://doi.org/10.2307/2937954
Recall the Mincer’s regression Equation 1 with monthly wage (\(Y_i\)) as a function of years of education (\(s_i\)) and years of experience (\(x_i\)).
Answer to the following questions:
…
Ability bias!
Is it sufficient to use the IQ or knowledge of work index to resolve this bias?
What about creativity?
How to quantify the lottery change effect of getting a decent job?
How to measure the connections?
Where to find an IV?
Use theory!!!
Think and speculate:
Analyze, what were/are the policies/environments that could mimic the experimental setting?
Reasoning on how researcher use theory and available observational data to approximate real experiment is called Identification strategy!
Angrist, J. D., & Krueger, A. B. (1991). Does Compulsory School Attendance Affect Schooling and Earnings? The Quarterly Journal of Economics, 106, 979–1014. https://doi.org/10.2307/2937954
Identification strategy:
Policy required students to enter school in the calendar year in which they turned six years old;
Children born in the fourth quarter enter school at age 5 and 3⁄4 , while those born in the first quarter enter school at age 6 3⁄4;
Compulsory schooling laws require students to remain in school until their 16th birthdays;
Combination of school start age policies and compulsory schooling laws creates a natural experiment in which children are compelled to attend school for different lengths of time depending on their birthdays.
Source: (Joshua D. Angrist and Krueger 1991)
Source: (Joshua D. Angrist and Krueger 1991)
Quarter of birth;
The intuition is:
Only a small part of variance in education (the one linked to the quarter of birth) is used to identify the return to education.
This small part of variance occurs due to random natural experiment, thus the ceteris paribus holds here.
IV estimates are very close to the OLS;
What does it mean?
Before running a regression, ask the following four questions (see Joshua D. Angrist and Pischke 2009, Ch. 1)
What is the causal relationship of interest?
What is the experiment that could ideally be used to capture the causal effect of interest?
What is your identification strategy?
What is your mode of statistical inference?
Describe an ideal experiment.
Highlight the forces you’d like to manipulate and the factors you’d like to hold constant.
FUQs: fundamentally unidentified questions
Causal effect of race or gender;
Do children that start school 1 year later learn more in the primary school?
Use theory!
Analyze, what were/are the policies/environments that could mimic the experimental setting?
describes the population to be studied,
the sample to be used,
and the assumptions made when constructing standard errors.
choose appropriate statistical methods
apply them diligently.
(Acemoglu, Johnson, and Robinson 2001). The colonial origins of comparative development: An empirical investigation. American economic review, 91(5), 1369-1401.
What are the fundamental causes of the large differences in income per capita across countries?
with better “institutions,” more secure property rights, and less distortionary policies,
Institutions are a likely cause of income growth.
What would the ideal experiment here?
Rich economies choose or can afford better institutions.
Economies that are different for a variety of reasons
To estimate the impact of institutions on income,
Current performance is cause by:
Current institutions, which are caused by
Early institutions, which are caused by
Settlements types during colonization, which are caused by
Settlers’ (potential) mortality or colonization risks.
(J. Angrist and Evans 1998) Angrist, J., & Evans, W. N. (1996). Children and their parents’ labor supply: Evidence from exogenous variation in family size.
What is the effect of additional child on women labor market participation?
Conventional wisdom:
What would the ideal experiment here?
Families without children are inappropriate counter factual
Rich families can afford more children: inappropriate counter factual
Family usually plan for having an additional children
we need a source of exogenous variation in children
People may plan for a second child, but they cannot plan for having a twin!