class: center, middle, inverse, title-slide # PPOL 502-07: Reg. Methods for Policy Analysis ## Week 12: Predictive Models ### Alexander Podkul, PhD ### Spring 2022 --- ## Today's Class Outline * Course Schedule * Predictions (Review) * Prediction Errors * Using Predictions * Working with Out of Sample Data * Fitting the Model * Where We're Going Next * __Break__ * Working in Stata --- ## Course Schedule __Tonight (4/13)__ * {Nothing due} __Next week (4/20)__ * Data Project due __The final week (4/27)__ * {Nothing due} * Post suggested readings! (or email them to me) __The final exam (5/6)__ * Final exam! --- ## Data Project Comments General update: * (Unless you have an extenuating circumstance) Feedback has been distributed! * Projects seem to be in a good place! -- A few focus areas as we approach the final stretch: * Consider multiple specifications (telling the whole story) * Think hard about omitted variable bias * Perform diagnostics * Transforming variables may add to your story * It's never too late to consider interaction or quadratic models 😄 --- ## Reviewing Predictions So far this semester, we've already discussed a number of prediction-related topics: * Fitted Values * Adapting Predictions for `\(log(y)\)` * Standard Errors of Fitted Values --- ### Fitted Values (Review) Already in this course we've discussed making predictions from our models and our data. Early in the semester we called them 'fitted values' and later we used them in thinking about interpreting interaction models. If we estimated a model such that: `$$\hat{y} = 11 + 2x_1 + 4x_2 + 0.25x_3$$` -- we could find the following _fitted values_ (also known as _predictions_) from our data. <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> x1 </th> <th style="text-align:right;"> x2 </th> <th style="text-align:right;"> x3 </th> <th style="text-align:right;"> y_i </th> <th style="text-align:right;"> y_pred </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 23.25 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 36.50 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 17.75 </td> </tr> </tbody> </table> --- ### Adapting Predictions for `\(log(y)\)` `$$log(y) = \beta_0 + \beta_1x_1 + \beta_2x_2$$` (Review) Due to our errors being distributed according to a log-normal distribution, predictions from a log-level or log-log model are going to under-estimate the expected value of `\(y\)` when we convert our `\(\hat{log(y)}\)` to `\(y\)`. -- <img src="Week12_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ### Adapting Predictions for `\(log(y)\)` In other words, the expected method `$$E(y|X) = e^{\hat{log(y)}}$$` would create underestimates. -- To fix this, we add a correction factor such that: `$$E(y|X) = \hat{\alpha}_0e^{\hat{log(y)}}$$` where we estimate `\(\hat{\alpha}\)` by looking at the average exponentiated residual or `\(\check{\alpha}\)` by standardizing the exponentiated values by the raw values of `\(y\)`. --- ### Standard Error of the Prediction (Review) Estimating a prediction is easy (plug it in the equation!). Estimating the standard error is a bit more complicated. -- If we have an estimated equation such that: `$$\hat{y} = \hat{\beta_0} + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 + ... + \hat{\beta_k}x_k$$` -- then the parameter we aim to estimate is: `$$\theta_0 = \beta_0 + \beta_1c_1 + \beta_2c_2 + ... + \beta_kc_k$$` where `\(c\)` represents particular values (or constants) for each of `\(k\)` independent variables. -- Finally, our estimator of `\(\theta_0\)` becomes: `$$\hat{\theta_0} = \hat{\beta_0} + \hat{\beta_1}c_1 + \hat{\beta_2}c_2 + ... + \hat{\beta_k}c_k$$` --- ### Standard Error of the Prediction To obtain a standard error for `\(\hat{\theta_0}\)` , we need to consider the linear combination of our OLS estimators (because the value is dependent on all values of `\(\hat{\beta_j}\)`, unless `\(c_j\)` = 0). -- To find the standard error associated with the _expected value_ of `\(y\)`, we can: 1. re-arrange our parameter equation so that `$$\theta_0 = \beta_0 + \beta_1c_1 + \beta_2c_2 + ... + \beta_kc_k$$` `$$\beta_0 = \theta_0 - \beta_1c_1 - \beta_2c_2 - ... - \beta_kc_k$$` 2. substitute our rearranged parameter equation into our regression formula so that `$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_kx_k$$` `$$y = \theta_0 + \beta_1(x_1 - c_1) + \beta_2(x_2 - c_2) + ... + \beta_k(x_k - c_k)$$` -- ... which is the equivalent to estimating a regression for `\(y\)` on `\(x_1 - c_1\)`, `\(x_2 - c_2\)` and using the standard error from the intercept term! --- ### Prediction Examples <table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;"> <caption>Statistical models</caption> <thead> <tr> <th style="padding-left: 5px;padding-right: 5px;"> </th> <th style="padding-left: 5px;padding-right: 5px;">Clinton Share</th> </tr> </thead> <tbody> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Intercept</td> <td style="padding-left: 5px;padding-right: 5px;">54.196<sup>***</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(2.014)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Median Age</td> <td style="padding-left: 5px;padding-right: 5px;">-0.106<sup>*</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.054)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Per Capita Income</td> <td style="padding-left: 5px;padding-right: 5px;">-0.001<sup>***</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.000)</td> </tr> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.052</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Adj. R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.051</td> </tr> <tr style="border-bottom: 2px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td> <td style="padding-left: 5px;padding-right: 5px;">2704</td> </tr> </tbody> <tfoot> <tr> <td style="font-size: 0.8em;" colspan="2"><sup>***</sup>p < 0.001; <sup>**</sup>p < 0.01; <sup>*</sup>p < 0.05</td> </tr> </tfoot> </table> --- ### Prediction Examples <table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;"> <caption>Statistical models</caption> <thead> <tr> <th style="padding-left: 5px;padding-right: 5px;"> </th> <th style="padding-left: 5px;padding-right: 5px;">Clinton Share</th> <th style="padding-left: 5px;padding-right: 5px;">Clinton Share</th> </tr> </thead> <tbody> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Intercept</td> <td style="padding-left: 5px;padding-right: 5px;">54.196<sup>***</sup></td> <td style="padding-left: 5px;padding-right: 5px;">43.807<sup>***</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(2.014)</td> <td style="padding-left: 5px;padding-right: 5px;">(0.413)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Median Age</td> <td style="padding-left: 5px;padding-right: 5px;">-0.106<sup>*</sup></td> <td style="padding-left: 5px;padding-right: 5px;"> </td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.054)</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Per Capita Income</td> <td style="padding-left: 5px;padding-right: 5px;">-0.001<sup>***</sup></td> <td style="padding-left: 5px;padding-right: 5px;"> </td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.000)</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Median Age - 35</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">-0.106<sup>*</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.054)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Per Capita Income - 10000</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">-0.001<sup>***</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.000)</td> </tr> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.052</td> <td style="padding-left: 5px;padding-right: 5px;">0.052</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Adj. R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.051</td> <td style="padding-left: 5px;padding-right: 5px;">0.051</td> </tr> <tr style="border-bottom: 2px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td> <td style="padding-left: 5px;padding-right: 5px;">2704</td> <td style="padding-left: 5px;padding-right: 5px;">2704</td> </tr> </tbody> <tfoot> <tr> <td style="font-size: 0.8em;" colspan="3"><sup>***</sup>p < 0.001; <sup>**</sup>p < 0.01; <sup>*</sup>p < 0.05</td> </tr> </tfoot> </table> ... which provides the prediction (and significance) for when median age = 35 and per capita income = $10,000 (i.e. the predictions of the expected value of our dependent variable given these covariates!) --- ## Prediction Interval The previous slides are useful by producing the standard error of `\(\hat{y}\)`. However, when wrestling with predictions, we might be interested not in the standard error of the _expectation_ but rather the standard error of our _predictions_. --- ## Prediction Interval <img src="Week12_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Prediction Interval <img src="Week12_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Prediction Interval <img src="Week12_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Prediction Interval <img src="Week12_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Prediction Interval Generally speaking, the __confidence interval__ refers to the range of _expected_ values. A __prediction interval__ refers to the fuller uncertainty associated with a single value. -- Assume we have two intervals, both at 95%: - the confidence interval band reveals the 95% confidence of the interval holding the expected value (think about sampling error) - the prediction interval reveals the 95% confidence of the interval containing various predictions (think sampling error + population error) -- In other words, the prediction interval accounts for additional variance in unobserved error --- ### Calculating the Prediction Interval As mentioned on the previous slide, our prediction interval is the sum of sampling error and population error, or: `$$Var(\hat{e^0}) = Var(\hat{y^0}) + Var(u^0) = Var(\hat{y^0}) + \sigma^2$$` -- Expressed slightly differently, we tend to estimate a standard error of the prediction error, or: `$$se(\hat{e}^0) = \sqrt{se(\hat{y^{0}})^2 + \hat{\sigma^2}}$$` where: * `\(se(\hat{y^{0}})\)` is the estimated standard error of our prediction `\(\hat{y^0}\)`} * `\(\hat{\sigma^2}\)` is the variance of the residuals -- which we often express as an interval estimated in the familiar way as: `$$\hat{y^0} \pm t * se(\hat{e}^0)$$` --- ### Prediction Interval Example <table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;"> <caption>Statistical models</caption> <thead> <tr> <th style="padding-left: 5px;padding-right: 5px;"> </th> <th style="padding-left: 5px;padding-right: 5px;">Clinton Share</th> <th style="padding-left: 5px;padding-right: 5px;">Clinton Share</th> </tr> </thead> <tbody> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Intercept</td> <td style="padding-left: 5px;padding-right: 5px;">54.196<sup>***</sup></td> <td style="padding-left: 5px;padding-right: 5px;">43.807<sup>***</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(2.014)</td> <td style="padding-left: 5px;padding-right: 5px;">(0.413)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Median Age</td> <td style="padding-left: 5px;padding-right: 5px;">-0.106<sup>*</sup></td> <td style="padding-left: 5px;padding-right: 5px;"> </td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.054)</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Per Capita Income</td> <td style="padding-left: 5px;padding-right: 5px;">-0.001<sup>***</sup></td> <td style="padding-left: 5px;padding-right: 5px;"> </td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.000)</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Median Age - 35</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">-0.106<sup>*</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.054)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Per Capita Income - 10000</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">-0.001<sup>***</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.000)</td> </tr> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.052</td> <td style="padding-left: 5px;padding-right: 5px;">0.052</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Adj. R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.051</td> <td style="padding-left: 5px;padding-right: 5px;">0.051</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td> <td style="padding-left: 5px;padding-right: 5px;">2704</td> <td style="padding-left: 5px;padding-right: 5px;">2704</td> </tr> <tr style="border-bottom: 2px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">RMSE</td> <td style="padding-left: 5px;padding-right: 5px;">9.918</td> <td style="padding-left: 5px;padding-right: 5px;">9.918</td> </tr> </tbody> <tfoot> <tr> <td style="font-size: 0.8em;" colspan="3"><sup>***</sup>p < 0.001; <sup>**</sup>p < 0.01; <sup>*</sup>p < 0.05</td> </tr> </tfoot> </table> --- ### Prediction Interval Example If we want to find the prediction interval where median age = 35 and per capita income is 10000, we can solve using: `$$se(\hat{e}^0) = \sqrt{se(\hat{y^{0}})^2 + \hat{\sigma^2}}$$` `$$se(\hat{y^{0}}) = 0.413$$` `$$\hat{\sigma} = 9.918$$` `$$se(\hat{e}^0) = \sqrt{0.413^2 + 9.918^2}$$` `$$se(\hat{e}^0) = 9.926$$` --- ### Prediction Interval Example <img src="Week12_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ### Prediction Interval Example <img src="Week12_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## Using Predictions To take a step back, there is often a distinction between __explanatory modeling__ and __predictive modeling__ (Schmueli 2020). * Explanatory modeling - the use of statistical models for testing causal explanations, where we can make inferences to test hypotheses * Predictive modeling - applying a statistical model or algorithm to data for the purpose of predicting new (or future) observations -- A useful framework for distinguishing between the two is identifying the focus of the following axes: * _causation-association_ - explanatory seeks the causal; predictive identifies the association -- * _theory-data_ - theory often dictates explanatory models; data often drives predictive models -- * _retrospective-prospective_ - explanatory is backward looking; predictive is forward looking -- * _bias-variance_ - explanatory seeks to minimize bias; predictive to minimize variance --- ## Using Predictions `$$y = \beta_0 + \beta_1x_1 + \beta_2x_2$$` `$$\hat{y} = \hat{\beta_0} + \hat{\beta_1}x_1 + \hat{\beta_2}x_2$$` Using the same model, we might focus on different elements: * explanatory would likely focus on the estimates for `\(\beta\)` * predictive would likely focus on how well we can estimate the fitted values --- ## Working with Out of Sample Data So far in class we've mostly just spoken about our "data" and we've used that data to estimate various models for explanatory purposes. But as we pivot from explanatory to predictive we also need to think of our data differently. One way to think of our data is to reference "in-sample" vs. "out-of-sample" data. Generally this means refers to whether or not we collected and used a particular datum to estimate our model. --- ## Working with Out of Sample Data: Example 1 Imagine we estimate the following equation: `$$ViolentC = \beta_0 + \beta_1PoliceFunding + \beta_2RegB + \beta_2RegC + \beta_2RegD + \beta_2RegE$$` -- <table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;"> <caption>Statistical models</caption> <thead> <tr> <th style="padding-left: 5px;padding-right: 5px;"> </th> <th style="padding-left: 5px;padding-right: 5px;">OLS</th> </tr> </thead> <tbody> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Intercept</td> <td style="padding-left: 5px;padding-right: 5px;">-392.982</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(240.271)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Annual Police Funding</td> <td style="padding-left: 5px;padding-right: 5px;">16.883<sup>**</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(5.099)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Region B</td> <td style="padding-left: 5px;padding-right: 5px;">122.952</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(199.184)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Region C</td> <td style="padding-left: 5px;padding-right: 5px;">400.832<sup>*</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(195.919)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Region D</td> <td style="padding-left: 5px;padding-right: 5px;">684.849<sup>**</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(195.454)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Region E</td> <td style="padding-left: 5px;padding-right: 5px;">649.676<sup>**</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(202.627)</td> </tr> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.482</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Adj. R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.423</td> </tr> <tr style="border-bottom: 2px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td> <td style="padding-left: 5px;padding-right: 5px;">50</td> </tr> </tbody> <tfoot> <tr> <td style="font-size: 0.8em;" colspan="2"><sup>***</sup>p < 0.001; <sup>**</sup>p < 0.01; <sup>*</sup>p < 0.05</td> </tr> </tfoot> </table> --- ## Working with Out of Sample Data: Example 1 Out of sample data might help us answer... _What if State 41 in Region E increased its annual police funding by five points?_ Current value: 3545 Current fitted value: 1708 New fitted value: 1793 Confidence interval: [1284, 2301] Prediction interval: [778, 2807] -- _What if a new state were added to region B and had the maximum annual police funding?_ Current value: NA Current fitted value: NA New fitted value: 1181 Confidence interval: [547, 1816] Prediction interval: [98, 2265] --- ## Working with Out of Sample Data: Example 2 To explore a second example, now imagine using our admissions data we have estimated the model: `$$P(Admitted = 1) = \Phi(\beta_0 + \beta_1GRE + \beta_2CGPA + \beta_3Research)$$` -- <table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;"> <caption>Statistical models</caption> <thead> <tr> <th style="padding-left: 5px;padding-right: 5px;"> </th> <th style="padding-left: 5px;padding-right: 5px;">Probit</th> </tr> </thead> <tbody> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Intercept</td> <td style="padding-left: 5px;padding-right: 5px;">-33.696<sup>***</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(3.588)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">GRE Score</td> <td style="padding-left: 5px;padding-right: 5px;">0.055<sup>***</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.013)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">CGPA</td> <td style="padding-left: 5px;padding-right: 5px;">1.914<sup>***</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.264)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Research</td> <td style="padding-left: 5px;padding-right: 5px;">0.466<sup>**</sup></td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.171)</td> </tr> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">AIC</td> <td style="padding-left: 5px;padding-right: 5px;">319.544</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">BIC</td> <td style="padding-left: 5px;padding-right: 5px;">336.403</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Log Likelihood</td> <td style="padding-left: 5px;padding-right: 5px;">-155.772</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Deviance</td> <td style="padding-left: 5px;padding-right: 5px;">311.544</td> </tr> <tr style="border-bottom: 2px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td> <td style="padding-left: 5px;padding-right: 5px;">500</td> </tr> </tbody> <tfoot> <tr> <td style="font-size: 0.8em;" colspan="2"><sup>***</sup>p < 0.001; <sup>**</sup>p < 0.01; <sup>*</sup>p < 0.05</td> </tr> </tfoot> </table> --- ## Working with Out of Sample Data: Example 2 _What does the model predict for a student with a GRE score of 330, a GPA of 8, and no research experience?_ Probability: 44% --- ## Fitting the Model We may be tempted when building a model to have the _best_ fit possible (think: highest `\(R^2\)` or lowest `\(RMSE\)`, however that might not always be the smartest decision. In part, because we might accidentally come up with the perfect model for our _sample_ but not for our _population_. --- ## Fitting the Model Let's take an example where we have 100 cases in our population. We _know_ the relationship is: `$$y = 1 + .4x + x^2 + u$$` but let's say we sample 40 cases and try to estimate the full relationship. We might consider the following models: -- <table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;"> <caption>Statistical models</caption> <thead> <tr> <th style="padding-left: 5px;padding-right: 5px;"> </th> <th style="padding-left: 5px;padding-right: 5px;">Model 1</th> <th style="padding-left: 5px;padding-right: 5px;">Model 2</th> <th style="padding-left: 5px;padding-right: 5px;">Model 3</th> <th style="padding-left: 5px;padding-right: 5px;">Model 4</th> </tr> </thead> <tbody> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Intercept</td> <td style="padding-left: 5px;padding-right: 5px;">-15.287<sup>*</sup></td> <td style="padding-left: 5px;padding-right: 5px;">-4.625</td> <td style="padding-left: 5px;padding-right: 5px;">-2.101</td> <td style="padding-left: 5px;padding-right: 5px;">-2.627</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(6.414)</td> <td style="padding-left: 5px;padding-right: 5px;">(4.356)</td> <td style="padding-left: 5px;padding-right: 5px;">(4.524)</td> <td style="padding-left: 5px;padding-right: 5px;">(5.662)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">X</td> <td style="padding-left: 5px;padding-right: 5px;">11.024<sup>***</sup></td> <td style="padding-left: 5px;padding-right: 5px;">1.204</td> <td style="padding-left: 5px;padding-right: 5px;">3.314</td> <td style="padding-left: 5px;padding-right: 5px;">3.339</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.988)</td> <td style="padding-left: 5px;padding-right: 5px;">(1.464)</td> <td style="padding-left: 5px;padding-right: 5px;">(1.918)</td> <td style="padding-left: 5px;padding-right: 5px;">(1.951)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">X^2</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">0.994<sup>***</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.286</td> <td style="padding-left: 5px;padding-right: 5px;">0.390</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.134)</td> <td style="padding-left: 5px;padding-right: 5px;">(0.449)</td> <td style="padding-left: 5px;padding-right: 5px;">(0.800)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">X^3</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">0.045</td> <td style="padding-left: 5px;padding-right: 5px;">0.025</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.027)</td> <td style="padding-left: 5px;padding-right: 5px;">(0.130)</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">X^4</td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">0.001</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;"> </td> <td style="padding-left: 5px;padding-right: 5px;">(0.006)</td> </tr> <tr style="border-top: 1px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.766</td> <td style="padding-left: 5px;padding-right: 5px;">0.906</td> <td style="padding-left: 5px;padding-right: 5px;">0.913</td> <td style="padding-left: 5px;padding-right: 5px;">0.913</td> </tr> <tr> <td style="padding-left: 5px;padding-right: 5px;">Adj. R<sup>2</sup></td> <td style="padding-left: 5px;padding-right: 5px;">0.760</td> <td style="padding-left: 5px;padding-right: 5px;">0.901</td> <td style="padding-left: 5px;padding-right: 5px;">0.906</td> <td style="padding-left: 5px;padding-right: 5px;">0.903</td> </tr> <tr style="border-bottom: 2px solid #000000;"> <td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td> <td style="padding-left: 5px;padding-right: 5px;">40</td> <td style="padding-left: 5px;padding-right: 5px;">40</td> <td style="padding-left: 5px;padding-right: 5px;">40</td> <td style="padding-left: 5px;padding-right: 5px;">40</td> </tr> </tbody> <tfoot> <tr> <td style="font-size: 0.8em;" colspan="5"><sup>***</sup>p < 0.001; <sup>**</sup>p < 0.01; <sup>*</sup>p < 0.05</td> </tr> </tfoot> </table> --- ## Fitting the Model We might assume that model four is the best model for making predictions considering the high `\(R^2\)` and adjusted `\(R^2\)`. -- But let's consider which model performs best on a _different_ random sample from our population. Here, we'll sample 20 new observations and consider the RMSE. We can compare the RMSE from this out-of-sample group to the in-sample. -- <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> RMSE Mod 1 </th> <th style="text-align:right;"> RMSE Mod 2 </th> <th style="text-align:right;"> RMSE Mod 3 </th> <th style="text-align:right;"> RMSE Mod 4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Original 40 </td> <td style="text-align:right;"> 15.10489 </td> <td style="text-align:right;"> 11.4661 </td> <td style="text-align:right;"> 11.45303 </td> <td style="text-align:right;"> 11.08218 </td> </tr> <tr> <td style="text-align:left;"> New 20 </td> <td style="text-align:right;"> 35.18467 </td> <td style="text-align:right;"> 13.4988 </td> <td style="text-align:right;"> 12.94296 </td> <td style="text-align:right;"> 57.99199 </td> </tr> </tbody> </table> --- ### Overfitting and Underfitting .pull-left[ This example introduces us to: - underfitting: not great fit to the training data, not great fit to other data (high bias) - overfitting: great fit to our training data, not as great generalization to other data (such as the testing set) (high variance) ] .pull-right[ <img src="Week12_files/figure-html/unnamed-chunk-16-1.png" width="90%" /> ] --- ### Overfitting and Underfitting .pull-left[ This example introduces us to: - __underfitting__: not great fit to the training data, not great fit to other data (high bias) - overfitting: great fit to our training data, not as great generalization to other data (such as the testing set) (high variance) ] .pull-right[ <img src="Week12_files/figure-html/unnamed-chunk-17-1.png" width="90%" /> ] --- ### Overfitting and Underfitting .pull-left[ This example introduces us to: - underfitting: not great fit to the training data, not great fit to other data (high bias) - __overfitting__: great fit to our training data, not as great generalization to other data (such as the testing set) (high variance) ] .pull-right[ <img src="Week12_files/figure-html/unnamed-chunk-18-1.png" width="90%" /> ] --- ### Overfitting and Underfitting .pull-left[ This example introduces us to: - underfitting: not great fit to the training data, not great fit to other data (high bias) - overfitting: great fit to our training data, not as great generalization to other data (such as the testing set) (high variance) ] .pull-right[ <img src="Week12_files/figure-html/unnamed-chunk-19-1.png" width="90%" /> ] --- ### Prediction Workflow The workflow for creating predictive models is _slightly_ different than how we've discussed in the past -- in part because we are now quite sensitive to the accuracy of out-of-sample predictions. -- To prevent overfitting, we will often follow: * Randomly dividing our data into training data and testing data * Estimate a model using our training data (and _only_ our training data) * Use this model (estimated by the training data) to make predictions on the testing data * Assess the accuracy of our model by comparing the predictions with the actual observed values. -- And this workflow works for OLS, logit, probit, and many others. -- _Note_: This is a bit different than how we think of the same process for explanatory models. * We're _willingly_ using less data to estimate our model (gasp!) * We're no longer just using in-sample or hypothetical data points to render predictions. --- ### Prediction Workflow Example Let's say we have a data set of the GSS and we're interested in developing a simple model to predict whether someone voted (in the 2008 election).
--- ### Prediction Workflow Example __Step 1:__ We're going to randomly assign 80% of our data to be the "training set" and the remaining 20% of our data as the "test set". This leaves our training data with 1580 cases and our testing data with 394 cases. -- __Step 2:__ We can estimate a model using our training data (and only our training data) `$$P(Vote = 1) = \Lambda(\beta_0 + \beta_1Age + \beta_2Educ + \beta_3Sex + \beta_4Race + \beta_5Party + \beta_6Income)$$` -- __Step 3:__ We will use that model estimated in step 2 and make predictions to our test data --- ### Prediction Workflow Example __Step 4:__ Assess the accuracy of the model! Since we're using a simple logit model, let's use a different metric (more on this later). `$$Accuracy = \frac{Correct Predictions}{All Predictions}$$` `$$Accuracy = \frac{Correct Predictions}{All Predictions}$$` `$$Accuracy = .745$$` --- ### K-fold Cross Validation __K-fold cross-validation__ builds on the validation method and divides the dataset into `\(k\)` number of subsets and computes our performance metric K number of times using each subset as a testing set. For example, imagine we have a model and dataset that we want to use k-fold cross-validation where `\(k = 5\)`. 1. Randomly split the data into 5 subsets (A, B, C, D, and E). 2. For the first iteration, hold out subset A as the test set and use the remaining subsets (B, C, D, and E) as the training set. 3. Test the model that you developed on the remaining subsets on the test set and store the diagnostic we care about (e.g. prediction accuracy). 4. Repeat steps 2 and 3 for each subset. 5. Average the stored diagnostics and report it as the "cross validation performance metric." -- _Why is this better than the validation set method?:_ More robust due to repeated randomization. -- _How do we decide `\(k\)`?_ Typically, you will see k as 5 or 10 as those figures have been seen to not sacrifice bias at the expense of variance (or vice versa). i.e. Lower k values may be more biased but higher k values may have large variability. --- ### K-fold Cross Validation (K = 5) <img src="Week12_files/figure-html/unnamed-chunk-22-1.png" height="80%" style="display: block; margin: auto;" /> --- ### K-fold Cross Validation (K = 5) <img src="Week12_files/figure-html/unnamed-chunk-23-1.png" height="80%" style="display: block; margin: auto;" /> --- ### K-fold Cross Validation (K = 5) <img src="Week12_files/figure-html/unnamed-chunk-24-1.png" height="80%" style="display: block; margin: auto;" /> --- ### K-fold Cross Validation (K = 5) <img src="Week12_files/figure-html/unnamed-chunk-25-1.png" height="80%" style="display: block; margin: auto;" /> --- ### K-fold Cross Validation (K = 5) <img src="Week12_files/figure-html/unnamed-chunk-26-1.png" height="80%" style="display: block; margin: auto;" /> --- ### K-fold Cross Validation (K = 5) <img src="Week12_files/figure-html/unnamed-chunk-27-1.png" height="80%" style="display: block; margin: auto;" /> --- ### K-fold Cross Validation (K = 5) <img src="Week12_files/figure-html/unnamed-chunk-28-1.png" height="80%" style="display: block; margin: auto;" /> --- ### K-fold Cross Validation (K = 5) <img src="Week12_files/figure-html/unnamed-chunk-29-1.png" height="80%" style="display: block; margin: auto;" /> --- ### Model Metrics With the models that we've introduced in class (OLS, logit, probit), there are a number of different model metrics that we can use to assess model performance (think: goodness of fit). We're only going to scratch the surface here (and you don't need to memorize these at all) but... __OLS__ _(Continuous DV)_ `\(R^2\)` or `\(Adj. R^2\)` `\(RMSE\)` (or just `\(MSE\)`) `\(MAE\)` or Mean Absolute Error, `\(= \frac{|y_i - y_p|}{n}\)` `\(MAPE\)` or Mean Absolute Percentage Error, `\(= \frac{100\%}{n}\sum\frac{|y_i - y_p|}{y_i}\)` --- ### Model Metrics __Logit/Probit__ _(Binary DV)_ `\(Pseudo R^2\)` `\(Accuracy\)` - Correct Predictions `\(Specificty\)` - Correctly Identified Negative Cases `\(Sensitivity\)` - Correctly Identified Positive Cases `\(Brier Score\)` - Evaluating the probability of the event, `\(= \frac{1}{N}\sum(p_i - y_i)^2\)` Each has different benefits and disadvantages which we we'll pick up again during the Stata session. --- ### Applications So why do we want to do this again? Predictive models can get quite complicated but can begin with all the models that we've been estimating all semester long. There are a lot of applications, many of which are explicitly working their way into the research sphere. Some examples include: * developing microtargeting predictions for voters, political supporters, etc. * working to predict outcomes of various policy implementations (think Covid vaccine policies) * identifying opinion changes from smaller studies and extrapolating to broader populations Though as models get more complicated, we have to be clear to think through the unintended consequences of our predictive models. --- ## Where We're Going Next Missing data, and other data concerns __Reading__ * Wooldridge: 9-4, 9-5