Eduard Bukin
In multiple regression, the Ceteris Paribus is achieved by introducing control variables.
Warning
Having bad controls / insufficient / not right controls leaves us with the Selection Bias.
In the context of regression analysis, selection bias is called OVB - Omitted Variable Bias.
\[Y_i = \alpha ^ l + \beta ^ l P_i + \gamma A_i + e^l_i\]
where:
\(Y_i\) is the outcome variable;
\(P_i\) is the key variable of interest;
\(A_i\) is the omitted variable;
\(\alpha ^ l\) , \(\beta ^ l\) are true regression coefficients;
\(\gamma\) is the effect of omitted variable in long;
\(e^l_i\) true error terms.
\[Y_i = \alpha ^ s + \beta^s P_i + e^s_i\]
where:
\(Y_i\) is the outcome variable;
\(P_i\) is the key variable of interest;
\(\alpha ^ s\) , \(\beta ^ s\) are the estimates of regression coefficients in the short model;
\(e^s_i\) error terms.
Omitting variable \(A_i\) in the short model causes bias of \(\beta^s\).
\[ \beta^s = \beta^l + \text{OVB} \]
We can measure Omitted Variable Bias (\(\text{OVB}\)) as:
\[ \text{OVB} = \beta^s - \beta^l \]
\(P_i\) and \(A_i\) relates to each other:
\(A_i\) and \(Y_i\) relates to each other:
\[ A_i = \pi_0 + \pi_1 P_i + u_i \]
With:
Long: \(Y_i = \alpha ^ l + \beta ^ l P_i + \gamma A_i + e^l_i\);
Short: \(Y_i = \alpha ^ s + \beta^s P_i + e^s_i\);
Auxiliary: \(A_i = \pi_0 + \pi_1 P_i + u_i\);
We can measure Omitted Variable Bias as:
\[\text{OVB} = \beta^s - \beta^l\]
\[\text{OVB} = \pi_1 \times \gamma\]
\[ \Rightarrow Y_i = \alpha ^ l + \beta ^ l P_i + \gamma \{\pi_0 + \pi_1 P_i + u \} + e^l_i \]
\[ \Rightarrow Y_i = \underbrace{\alpha ^ l + \gamma \pi_0}_{\alpha ^ s} + \underbrace{(\beta ^ l + \gamma \pi_1)}_{\beta^s} P_i + \underbrace{e^l_i + \gamma u_i}_{e^s_i} \]
Omitted Variable - means that we cannot have it in the regression, we can’t use data.
Having knowledge of mathematics behind OVB, we can make an educated guess about consequences of the variable omission: the BIAS (Angrist & Pischke, 2014)
Write down Short, Long and Auxiliary regressions
Justify potential signs of \(\pi_1\) and \(\gamma\);
Conclude how the OV biases our regression based on the formula: \(\text{OVB} = \pi_1 \times \gamma\).
OBV can bias estimates:
upwards (\(\text{OVB} > 0\)): increasing the effect of \(P_i\)
downwards (\(\text{OVB} < 0\)): decreeing the effect of \(P_i\)
rendering the effect of \(P_i\) insignificant
No solution!
Proxies;
Research design (Panel Regression/DiD, RDD);
Acknowledge presence of the OVB;
Discuss the bias;
In 1970, Jacob Mincer in his work Schooling, Experience, and Earnings (Mincer, 1974) attempted to quantify the premium of schooling on wage. He used the following regression equation:
\[ \log \text{wage}_i = \beta_0 + \beta_1 \text{educ}_i + \beta_2 \text{exper}_i+ \epsilon_i \]
Prove that omitting experience causes OBV!
Long: \(\log \text{wage}_i = \beta_0 + \beta_1 \text{educ}_i + \beta_2 \text{exper}_i+ \epsilon_i\)
Short: \(\log \text{wage}_i = \beta_0^s + \beta_1^s \text{educ}_i + \epsilon_i^s\)
Auxiliary: \(\text{exper}_i = \rho_0 + \rho_1 \text{educ}_i + u_i\)
Use literature and other empirical research to reinforce your claims.
\[\beta_2 > 0\]
\[\rho_1 < 0\]
\[\text{OVB} = \beta_2 \times \rho_1\]
Given our previous hypotheses:
\(\beta_2 > 0 = +\)
\(\rho_1 < 0 = -\)
\[\text{OVB} = (+) \times (-) < 0\]
Omitting experience in short regression might cause a downward bias on the estimated effect of education. As a result, we may:
Supposed that we have estimates equation:
\(\log \text{wage}_i = \beta_0 + \beta_1 \text{educ}_i + \beta_2 \text{exper}_i + \beta_3 \text{exper}^2_i+ \epsilon_i\)
Call:
lm(formula = log(wage) ~ educ, data = dta)
Coefficients:
(Intercept) educ
5.97306 0.05984
Call:
lm(formula = log(wage) ~ educ + exper, data = dta)
Coefficients:
(Intercept) educ exper
5.50271 0.07778 0.01978
Show how omitting ability biases the estimates of the effect of education on wages.
Supposed that we have estimated the following regression:
\[ \log \text{wage}_i = \beta_0^s + \beta_1^s \text{educ}_i + \beta_2^s \text{exper}_i + \beta_3^s \text{exper}^2_i + \epsilon_i \]
What the other variables that are omitted and that may cause the bias to our estimates?
These are other human capital related variables: age, ability, motivation.
Write short, long and auxiliary regression
short:
long:
auxiliary:
Write the OVB formula:
Make Hypothesis about the effect of included on omitted and omitted on dependent:
Conclude about the bias:
Write short, long and auxiliary regression
short: \(\log \text{wage}_i = \beta_0^s + \beta_1^s \text{educ}_i + \beta_2^s \text{exper}_i + \beta_3^s \text{exper}^2_i + \epsilon_i\)
long: \(\log \text{wage}_i = \beta_0 + \beta_1 \text{educ}_i + \beta_2 \text{exper}_i + \beta_3 \text{exper}^2_i + \gamma \text{ability}_i + \epsilon_i\)
auxiliary: \(\text{ability}_i = \rho_0 + \rho_1 \text{educ}_i + \rho_2 \text{exper}_i + \rho_3 \text{exper}^2_i + u_i\)
Write the OVB formula:
Make Hypothesis about \(\gamma\) and \(\rho_1\).
\(\rho_1 > 0\) as more years of education are usually associated with higher abilities;
\(\gamma\) Higher abilities are usually rewarded with higher salary.
Conclude about the bias;
We might have an upwards bias of the estimates in our regression.
Specifically the effect of education is overestimated.
In the long model we might observe a lower effect of education on wage.
Estimate the short model
Estimate the long model where instead of abilities the IQ level is used as a proxy.
Calculate the extent of the OVB
OVB Formula (Short, Long and Auxiliary regressions)
Be ready to demonstrate how to use the OVB formula for making an educated guess about the direction of the bias during the exam.
Dale, S. B., & Krueger, A. B. (2002). Estimating the payoff to attending a more selective college: An application of selection on observables and unobservables. The Quarterly Journal of Economics, 117(4), 1491-1527.
Video 1. Selection Bias: Will You Make More Going to a Private University? From minute 6:30 to the end. https://youtu.be/6YrIDhaUQOE
Video 2. From 47:22 2017 AEA Cross-Section Econometrics. Part 2
You want to estimate the causal effect of union membership on employees’ wages. And you estimate the following regression equation:
\[ \log \text{wage}_i = \beta_0 + \beta_1 \text{union} + \beta_2 \text{experience} + \beta_3 \text{experience}^2 \\ + \beta_4 \text{married} + \beta_5 \text{sex} + \beta_6 \text{hours per week} + \epsilon_i \]
Your colleagues suggest that you should include an individual’s education in the list of control variables as omitting such regressor biases the estimate.
Using OVB formula prove that omitting education causes/does not causes the OVB.
Calculate the extent of the OVB