Empirically Driven Variable Selection for the Estimation of Causal Effects with Observational Data
Author | : Bryan Keller |
Publisher | : |
Total Pages | : 12 |
Release | : 2016 |
ISBN-10 | : OCLC:1066671782 |
ISBN-13 | : |
Rating | : 4/5 (82 Downloads) |
Book excerpt: Observational studies are common in educational research, where subjects self-select or are otherwise non-randomly assigned to different interventions (e.g., educational programs, grade retention, special education). Unbiased estimation of a causal effect with observational data depends crucially on the assumption of ignorability, which specifies that potential outcomes under different treatment conditions are independent of treatment assignment, given the observed covariates. The primary goals in this paper are to: (1) propose and evaluate an empirically driven method for the identification and removal of potential instrumental and non-informative pretreatment variables based on lack of association with the outcome (though bias will exist in the presence of unmeasured confounders); and (2) to investigate, through simulation studies, the efficacy of three variable selection methods as measured by their success in identifying Instrumental variables (IVs) and Non-informative variables (NVs) and by improvement in bias and mean squared error relative to no variable selection at all. Two simulation studies were conducted. The conclusions of the simulations are twofold. First, the authors show that for the two data-generation processes used herein, preprocessing data to detect and remove potential instrumental and non-informative variables based on their relationships with the outcome improved the mean squared error of treatment effect estimation. Of course, this has to do with the bias/variance trade off: more sensitive variable selection methods do a better job detecting and removing noninformative variables (which decreases variance) while simultaneously dropping more weak confounders (which increases bias). Second, they find that recursive feature elimination with random forests is a promising method for predictor selection, as evidenced by strong performance in its nav̐e implementation across both simulation studies. Finally, we note that the ability of the methods to single out NVs and IVs depends in part upon the magnitudes and directions of correlations between unmeasured confounders and other predictors. Tables and figures are appended.